0% found this document useful (0 votes)

22 views26 pages

7469_Magicoder_Empowering_Code

Magicoder is a series of fully open-source Large Language Models (LLMs) designed for code generation, utilizing a novel approach called OSS-I NSTRUCT to produce diverse and realistic coding instructions. These models, trained on 75K synthetic instruction data, significantly outperform existing state-of-the-art models, including ChatGPT, on various coding benchmarks while maintaining a parameter count of no more than 7B. The OSS-I NSTRUCT method enhances LLMs by leveraging open-source code snippets, thereby reducing bias and improving the quality of generated code.

Uploaded by

hruthi muggalla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views26 pages

7469_Magicoder_Empowering_Code

Uploaded by

hruthi muggalla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Magicoder: Empowering Code Generation with OSS-I NSTRUCT

Yuxiang Wei 1 Zhe Wang 2 † Jiawei Liu 1 Yifeng Ding 1 Lingming Zhang 1

Abstract 2023a) for domain-specific tasks. Until recently, Large Lan-

guage Models (LLMs) trained on code (Austin et al., 2021;
We introduce Magicoder, a series of fully open-
Chen et al., 2021) has shown outstanding breakthroughs in
source (code, weights, and data) Large Language
generating code that accurately satisfies user intents, and
Models (LLMs) for code that significantly closes
they are widely deployed to assist real-world software de-
the gap with top code models while having no
velopment (Microsoft, 2023b; Services, 2023).
more than 7B parameters. Magicoder models are
trained on 75K synthetic instruction data using Initially, closed-source models such as GPT-3.5 Turbo (Ope-
OSS-I NSTRUCT, a novel approach to enlighten- nAI, 2022) (i.e., ChatGPT) and GPT-4 (OpenAI, 2023) mas-
ing LLMs with open-source code snippets to gen- sively dominated various coding benchmarks and leader-
erate diverse instruction data for code. Our main boards (Chen et al., 2021; Austin et al., 2021; Liu et al.,
motivation is to mitigate the inherent bias of the 2023b; Lai et al., 2022; Xia & Zhang, 2023). To further push
synthetic data generated by LLMs through the the boundaries of code generation with open source LLMs,
wealth of open-source references for the produc- S ELF -I NSTRUCT (Wang et al., 2023a) is adopted to boot-
tion of more realistic and controllable data. The strap the instruction-following ability of LLMs. In the realm
orthogonality of OSS-I NSTRUCT and other data of code, practitioners commonly devise synthetic coding
generation methods like Evol-Instruct further en- instructions using a stronger teacher model (e.g., ChatGPT
ables us to build an enhanced MagicoderS. Both and GPT-4) and then finetune a weaker student model (e.g.,
Magicoder and MagicoderS substantially outper- C ODE L LAMA (Rozière et al., 2023)) with the generated data
form state-of-the-art code models with similar or to distill the knowledge from the teacher (Taori et al., 2023;
even larger sizes on a wide range of coding bench- Chaudhary, 2023). For example, Code Alpaca (Chaudhary,
marks. Notably, MagicoderS-CL-7B based on 2023) consists of 20K automatically generated code instruc-
C ODE L LAMA even surpasses the prominent Chat- tions by applying S ELF -I NSTRUCT on ChatGPT using 21
GPT on HumanEval+ (66.5 vs. 65.9 in pass@1). seed tasks. To further enhance the coding abilities of LLMs,
Overall, OSS-I NSTRUCT opens a new direction Luo et al. (2023b) proposes Code Evol-Instruct that employs
for crafting diverse synthetic instruction data for various heuristics to increase the complexity of seed code
code using abundant open-source references. instructions (Code Alpaca in this case), achieving state-of-
the-art (SOTA) results among open-source models.
While these data generation methods can effectively im-
1. Introduction prove the instruction-following capability of an LLM, they
rely on a narrow range of predefined tasks or heuristics
Code generation, also known as program synthesis (Gul- under the hood. For example, on the one hand, Code Al-
wani et al., 2017), is a long-standing challenge in com- paca that adopts S ELF -I NSTRUCT only relies on 21 seed
puter science. In the past few decades, a large body of tasks to generate new code instructions using an identical
research has been studying symbolic approaches, such as prompt template. On the other hand, Code Evol-Instruct
abstraction-based synthesis (Wang et al., 2017; Feng et al., takes Code Alpaca as seeds and merely depends on 5 heuris-
2018) for general-purpose synthesis problems and program- tics to evolve the dataset. As partly suggested by Yu et al.
ming by examples (Cambronero et al., 2023; Liu et al., (2023) and Wang et al. (2023a), such approaches may sig-
† nificantly inherit the system bias inherent in the LLMs as
The work was done during a remote summer internship at the
University of Illinois. 1 University of Illinois at Urbana-Champaign, well as the predefined tasks.
USA 2 Tsinghua University, China. Correspondence to: Yuxiang Therefore, in this paper, we propose OSS-I NSTRUCT to
Wei <[email protected]>.
mitigate the inherent bias of LLMs and to unleash their
Proceedings of the 41 st International Conference on Machine potential to craft diverse and creative code instructions via
Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by direct learning from the open source. As shown in Figure 1,
the author(s).

1
Magicoder: Empowering Code Generation with OSS-I NSTRUCT

Open-source codebase Seed code snippet

Prompt (details omitted)
learn_model(
PosNeg.py Program.cs tf_idfSVM, tf_idfNB, target) Please gain inspiration from the Language
code snippet to create a high- Model
Log.cpp Strength.swift def get_clean_review(raw_review):
letters_only = re.sub( quality programming problem…
GrantInfo.ts … "[^a-zA-Z]", " ", raw_review)
OSS-INSTRUCT

Generated solution (details omitted) Generated problem (details omitted)

from sklearn.feature_extraction.text import TfidfVectorizer ... You are working on a natural language processing (NLP)
def get_clean_review(raw_review): ... project and need to create a program to preprocess and
def train_model(tf_idfSVM, tf_idfNB, reviews, labels): ... classify movie reviews...
def classify_review(clean_review, tf_idfSVM, tf_idfNB): ... ...
... Your program should be able to preprocess new movie
train_model(tf_idfSVM, tf_idfNB, reviews, labels)
reviews, train the model, and classify new reviews accurately.
cleaned_review = get_clean_review(...)...
Q
D
s
s
L

Figure 1: Overview of OSS-I NSTRUCT and the pass@1 results of different LLMs on HumanEval (+)

OSS-I NSTRUCT leverages a powerful LLM to automati- CL even outperforms WizardCoder-CL-7B, WizardCoder-
cally generate new coding problems by drawing inspira- SC-15B, and all studied SOTA LLMs with less than or equal
tion from any random code snippets collected from the to 16B parameters on all the benchmarks we tested. Also,
open source. In this example, the LLM gets inspired by the pass@1 result of the enhanced MagicoderS-CL is on
two incomplete code fragments from different functions par with ChatGPT on HumanEval (70.7 vs. 72.6) and sur-
and manages to relate them and craft a realistic machine passes it on the more rigorous HumanEval+ (66.5 vs. 65.9),
learning problem. Thanks to the “infinite” real-world open- indicating that MagicoderS-CL can generate more robust
source code, OSS-I NSTRUCT can directly produce diverse, code. It also achieves SOTA results among all code models
realistic, and controllable code instructions by providing at the same scale.
distinct seed code snippets. In the end, we generate 75K
Additionally, we notice a very recent advancement in the
synthetic data to finetune C ODE L LAMA -P YTHON-7B, re-
development of the DeepSeek-Coder series (Guo et al.,
sulting in Magicoder-CL. While being simple and effective,
2024) which has shown exceptional coding performance.
OSS-I NSTRUCT is orthogonal to existing data generation
However, due to the limited technical details disclosed,
methods, and they can be combined to further boost the
we only briefly discuss them in §3.4. Despite this, we
models’ coding capabilities. Therefore, we continually fine-
applied OSS-I NSTRUCT on DeepSeek-Coder-Base 6.7B,
tune Magicoder-CL on an open-source Evol-Instruct dataset
resulting in the creation of Magicoder-DS and MagicoderS-
with 110K entries, producing MagicoderS-CL.
DS. In addition to the consistent findings on the previous
We evaluate Magicoder and MagicoderS on a wide range results with C ODE L LAMA -P YTHON-7B as the base model,
of coding tasks, including HumanEval (Chen et al., 2021) Magicoder-DS and MagicoderS-DS benefit from the more
and MBPP (Austin et al., 2021) for Python text-to-code gen- powerful DeepSeek-Coder-Base-6.7B. This advantage is
eration, MultiPL-E (Cassano et al., 2022) for multilingual demonstrated by MagicoderS-DS, which achieves a remark-
code completion, and DS-1000 (Lai et al., 2022) for solving able 76.8 pass@1 on HumanEval. MagicoderS-DS also out-
data science problems. We further adopt EvalPlus (Liu et al., performs DeepSeek-Coder-Instruct-6.7B on HumanEval (+)
2023b), which includes the augmented HumanEval+ and and MBPP (+) with 8× less finetuning tokens.
MBPP+ datasets for more rigorous model evaluation. Both
To justify the design of OSS-I NSTRUCT, i.e., generating
Magicoder-CL and MagicoderS-CL substantially boost the
instruction-tuning data from open-source references rather
base C ODE L LAMA -P YTHON-7B. Additionally, Magicoder-

2
Magicoder: Empowering Code Generation with OSS-I NSTRUCT

than using the references directly, we demonstrate that fine- initial seed snippets from 80K code documents, 40K from
tuning the base models with semantically relevant comment- Python, and 5K from each of C++, Java, TypeScript, Shell,
function pairs extracted from open-source projects even C#, Rust, PHP, and Swift respectively. Then, each collected
negatively impacts the model performance (§4.2). seed code snippet is applied to the prompt template shown
in Appendix A.1, which a teacher model takes as input and
In general, we make the following contributions:
outputs both a coding problem and its solution.

• We introduce OSS-I NSTRUCT, a pioneering approach to 2.2. Data Cleaning and Decontamination
enlightening LLMs with open-source code snippets to
generate more diverse, realistic, and controllable coding We perform data cleaning by excluding samples that are
instruction data, which can be leveraged to substantially identical or share the same seed code snippet. While there
boost the performance of various LLMs via instruction exist other sorts of noisiness (e.g., the solution is incom-
tuning. It opens a new dimension for creating low-bias plete) in the generated data, inspired by Honovich et al.
and diverse instruction-tuning data from the abundance of (2023), they are not removed as we believe they still con-
open-source references. tain valuable information for LLMs to learn. More experi-
mental details can be found in Appendix C.3. Finally, we
• We build the Magicoder series trained with OSS- apply the same logic as StarCoder Li et al. (2023) to decon-
I NSTRUCT and MagicoderS series trained on a combi- taminate our training data by removing coding problems
nation of OSS-I NSTRUCT and Evol-Instruct. Our eval- that contain docstrings or solutions from HumanEval (Chen
uation across 6 benchmarks shows that all Magicoders et al., 2021) and MBPP (Austin et al., 2021), docstrings
significantly improve the base LLMs. Notably, both from APPS (Hendrycks et al., 2021), prompts from DS-
MagicoderS-CL and MagicoderS-DS outperform Chat- 1000 (Lai et al., 2022), or questions from GSM8K (Cobbe
GPT on HumanEval+ with only 7B parameters. et al., 2021). As part of our analysis, the decontamination
procedure only filters out 9 additional samples. Since the
• We fully open source the model weights, training data, and seed corpus starcoderdata has already gone through
source code at https://github.com/ise-uiuc/ rigorous data decontamination, this observation suggests
magicoder to facilitate future research. that OSS-I NSTRUCT is unlikely to introduce additional data
leakage beyond the seeds. The eventual OSS-I NSTRUCT
2. OSS-I NSTRUCT: Instruction Tuning from dataset contains about 75K entries. An overview of the
dataset statistics can be found in Appendix A.3.
Open Source
In this section, we elaborate on our OSS-I NSTRUCT ap- 2.3. Qualitative Examples of OSS-I NSTRUCT
proach. From a high level, as shown in Figure 1, OSS-
I NSTRUCT works by prompting an LLM (e.g., ChatGPT) Figure 2 shows some qualitative examples of how OSS-
to generate a coding problem and its solution according to I NSTRUCT can help LLM get inspiration from a seed code
some seed code snippet collected from the wild (e.g., from snippet to create new coding problems and solutions. For
GitHub). The seed snippet offers controllability of the gen- example, the shell script example shows how an LLM crafts
eration and encourages the LLM to create diverse coding a Python coding problem with just one line of shell script.
problems that can reflect real-world programming scenarios. The library imports example demonstrates how an LLM
can create a realistic machine learning problem using just
a few import statements. Meanwhile, the class signature
2.1. Generating Coding Problems
instance illustrates the ability of LLM to draw inspiration
OSS-I NSTRUCT is powered by seed code snippets that can from an incomplete class definition featuring annotations
be easily collected from open source. In this work, we like SpringBootApplication and keywords such as
directly adopt starcoderdata as our seed corpus, a fil- bank. From this, the LLM generates a problem that re-
tered version of The Stack (Kocetkov et al., 2022) dataset quires implementing a complete banking system based on
that StarCoder is trained on, containing permissively li- Spring Boot. Overall, OSS-I NSTRUCT can inspire an LLM
censed source code documents in various programming lan- with distinct code structures and semantics to create diverse
guages. We chose starcoderdata because it is widely coding tasks, including algorithmic challenges, realistic
adopted, includes massive high-quality code snippets, and issues, single-function code generation, library-based pro-
is even post-processed for data decontamination (Li et al., gram completion, whole-program development, and even
2023; Allal et al., 2023). For each code document from whole-application construction.
the corpus, we randomly extract 1–15 consecutive lines
as the seed snippet for the model to gain inspiration from Similarity with HumanEval To study whether our data
and produce coding problems. In total, we collected 80K generation process produces more HumanEval-like prob-

3
Magicoder: Empowering Code Generation with OSS-I NSTRUCT

Seed: shell script Seed: library imports Seed: class signature

python3 makeErrorFile.py data/test_dataset_14 14 import numpy as np @SpringBootApplication
import gym_electric_motor as gem @Import({ AxonConfig.class })
Problem import matplotlib.pyplot as plt public class AxonbankApplication {
Create a Python program that generates an error ile public static void main(String[] args) {
based on a given dataset... Problem
Create a reinforcement learning agent to control an Problem
Code electric motor using the OpenAI Gym environment... Create a simple Java Spring Boot application
def generate_error_file(dataset_file, ...): for a banking system...
error_lines = [] Code
with open(dataset_file, 'r') as file: import numpy as np Code
for line in file: import gym_electric_motor as gem import org.axonframework.commandhandling...
... ... import org.axonframework.config...
with open(error_file_name, 'w') as error_file: env = gem.make("DcSeriesCont-v1") ...
for error_line in error_lines: class DQNAgent: @SpringBootApplication
error_file.write(error_line + '\n') def __init__(self, state_dim, action_dim): ... @Import({ AxonConfig.class })
if __name__ == "__main__": def build_model(self): ... public class AxonbankApplication {...}
if len(sys.argv) != 3: def act(self, state): ... public class BankAccount {...}
print("Usage: ...") def train(self, state, action, reward, ...): ... public class CreateAccountCommand {...}
else: ... public class DepositFundsCommand {...}
dataset_file = sys.argv[1] for episode in range(episodes): public class WithdrawFundsCommand {...}
dataset_number = sys.argv[2] state = env.reset() public class AccountCreatedEvent {...}
generate_error_file(...) state = np.reshape(state, [1, state_dim]) public class FundsDepositedEvent {...}
... public class FundsWithdrawnEvent {...}
f
D
Q
L

Figure 2: Examples showing how OSS-I NSTRUCT generates problems and solutions from seed code snippets. Detailed
problem requirements, implementations, and explanations are omitted for brevity. More examples can be found in
Appendix A.2.

0.14 age similarity. This result indicates that the improvements

Self-Instruct; Avg Score: 0.169 from OSS-I NSTRUCT are not merely due to including data
0.12 Evol-Instruct; Avg Score: 0.131 from the same distribution.
0.10 OSS-Instruct; Avg Score: 0.105
Percentage

0.08 3. Evaluation
0.06
We choose C ODE L LAMA -P YTHON-7B and DeepSeek-
0.04
Coder-Base 6.7B as the base LLMs. To derive Magicoder
0.02 series, we first finetune them on 75K synthetic data
0.00 generated through OSS-I NSTRUCT. We then obtain
0.0 0.1 0.2 0.3 0.4 0.5 MagicoderS by continuing finetuning Magicoder with the
Cosine Similarity Score evol-codealpaca-v1 dataset, an open-source Evol-
Instruct implementation containing about 110K samples.
Figure 3: Cosine similarities between HumanEval and syn- More implementation details and additional evaluation re-
thetic data generated by different methods. sults are listed in Appendices B and C. We also present
interesting use cases that reflect the effectiveness of instruc-
tion tuning in Appendix D and demonstrate Magicoder’s
capability to generate complex programs in Appendix E.
lems or solutions that contribute to high performance, we
pair each sample from our 75K dataset with each of the
164 HumanEval (Chen et al., 2021) samples and compute 3.1. Python Text-to-Code Generation
their cosine similarity using TF-IDF (SPARCK JONES, HumanEval (Chen et al., 2021) and MBPP (Austin et al.,
1972) embeddings. We then associate each OSS-I NSTRUCT 2021) are two of the most widely used benchmarks for code
sample with a HumanEval sample with the highest simi- generation. Each task in these benchmarks includes a task
larity score. We also compare our dataset against Code description (e.g., docstring) as the prompt, where LLMs
Alpaca, a 20K dataset applying S ELF -I NSTRUCT to code, generate corresponding code whose correctness is checked
and evol-codealpaca-v1 (theblackcat102, 2023), an by a handful of test cases. Because tests in these benchmarks
open-source reproduction of Evol-Instruct containing 110K can be insufficient, for more rigorous evaluation, we use
coding instructions. We resort to the open-source implemen- HumanEval+ and MBPP+, both powered by the EvalPlus
tation because the official Code Evol-Instruct (Luo et al., framework (Liu et al., 2023b) to obtain 80×/35× more tests.
2023b) dataset is not released. We decontaminate all the Following prior work (Liu et al., 2023b; Chen et al., 2023),
datasets beforehand using the same way discussed in §2.2. for each task and LLM we use greedy decoding to generate
Figure 3 shows that OSS-I NSTRUCT exhibits the lowest one sample and focus on comparing the pass@1 metric.
average similarity among all the studied data generation
techniques while S ELF -I NSTRUCT shows the highest aver- We consider a wide range of baseline models, including

4
Magicoder: Empowering Code Generation with OSS-I NSTRUCT

Table 1: Pass@1 (%) results of different LLMs on HumanEval (+) and MBPP (+) computed with greedy decoding. The
abbreviations “CL” and “SC” refer to the base models C ODE L LAMA -P YTHON and StarCoder, respectively. We report the
results consistently from the EvalPlus (Liu et al., 2023b) Leaderboard.

Benchmark Open-Source
Model Release Date Size
HumanEval (+) MBPP (+) Weight Data
GPT-3.5 Turbo Nov 2023 - 72.6 (65.9) 81.7 (69.4) # #
GPT-4 Turbo Nov 2023 - 85.4 (81.7) 83.0 (70.7) # #
C ODE L LAMA -P YTHON Aug 2023 34B 51.8 (42.7) 67.2 (52.9) #
WizardCoder-CL Sep 2023 34B 73.2 (64.6) 73.2 (59.9) #
CodeT5+ May 2023 16B 31.7 (26.2) 54.6 (44.4)
CodeGen-Mono Mar 2022 16B 32.9 (27.4) 52.6 (43.6)
StarCoder May 2023 15B 34.1 (29.3) 55.1 (46.1)
C ODE L LAMA -P YTHON Aug 2023 13B 42.7 (36.6) 61.2 (50.9) #
WizardCoder-SC Sep 2023 15B 51.9 (45.1) 61.9 (50.6) #
StarCoder May 2023 7B 24.4 (20.7) 33.1 (28.8)
Mistral Oct 2023 7B 28.7 (23.2) 50.1 (40.9) #
CodeT5+ May 2023 6B 29.3 (23.8) 51.9 (40.9)
CodeGen-Mono Mar 2022 6B 29.3 (25.6) 49.9 (42.1)
C ODE L LAMA -P YTHON Aug 2023 7B 37.8 (34.1) 57.6 (45.4) #
WizardCoder-CL Sep 2023 7B 48.2 (40.9) 56.6 (47.1) #
Magicoder-CL Dec 2023 7B 60.4 (55.5) 64.2 (52.6)
MagicoderS-CL Dec 2023 7B 70.7 (66.5) 68.4 (56.6)

C ODE L LAMA -P YTHON (Rozière et al., 2023), Wizard- 3.2. Multilingual Code Generation
Coder (Luo et al., 2023b), GPT-3.5 Turbo (OpenAI, 2022),
In addition to Python, as shown in Table 2, we perform
GPT-4 Turbo (OpenAI, 2023), StarCoder (Li et al., 2023),
an extensive evaluation on 6 widely used programming
CodeT5+ (Wang et al., 2023b), CodeGen-Mono (Nijkamp
languages, i.e., Java, JavaScript, C++, PHP, Swift, and
et al., 2023), and Mistral (Jiang et al., 2023a). All the re-
Rust, using the MultiPL-E benchmark (Cassano et al.,
sults are consistently reported from the EvalPlus (Liu et al.,
2022). We report available results from the WizardCoder pa-
2023b) leaderboard (EvalPlus hash: 1895d2f).
per (Luo et al., 2023b) and evaluate our models consistently
Table 1 shows the pass@1 results of different LLMs on through bigcode-evaluation-harness (Ben Allal
these benchmarks. From the results, we can first observe et al., 2022). We skip proprietary models such as Chat-
that Magicoder-CL has a clear improvement over the base GPT and GPT-4 as they are not supported by the frame-
C ODE L LAMA -P YTHON-7B, and outperforms all studied work. Due to a significant inference latency when running
open-source models except C ODE L LAMA -P YTHON-34B WizardCoder-CL-7B using the harness in our environment,
and WizardCoder-CL-34B. Notably, Magicoder-CL sur- we choose not to include it in our analysis.
passes WizardCoder-SC-15B and has a substantial improve-
The results indicate that Magicoder-CL improves the base
ment on HumanEval and HumanEval+ over C ODE L LAMA -
C ODE L LAMA -P YTHON-7B by a large margin among all
P YTHON-34B. MagicoderS-CL demonstrates further im-
the studied programming languages. Moreover, Magicoder-
provements by being trained with the orthogonal Evol-
CL also achieves better results than the SOTA 15B
Instruct method. MagicoderS-CL outperforms ChatGPT
WizardCoder-SC among half of the programming lan-
and all other open-source models on HumanEval+. More-
guages. Additionally, MagicoderS-CL demonstrates fur-
over, although it scores slightly lower than WizardCoder-
ther improvement over Magicoder-CL on all program-
CL-34B and ChatGPT on HumanEval, it surpasses both of
ming languages, achieving comparable performance against
them on the more rigorous HumanEval+ dataset, indicating
WizardCoder-CL-34B with only 7B parameters. It is worth
that MagicoderS-CL may produce more robust code.
noting that Magicoder-CL is only trained with very limited
multilingual data but still outperforms other LLMs with
similar or even larger sizes. Also, although the harness

5
Magicoder: Empowering Code Generation with OSS-I NSTRUCT

evaluates models in completion formats which are for base forming OSS-I NSTRUCT may produce code in a different
models, Magicoders still show significant improvements de- programming language than the seed.
spite being only instruction-tuned. This implies that LLMs
Table 5 shows the evaluation results, where we consistently
can learn knowledge from the data beyond its format.
finetune the base C ODE L LAMA -P YTHON-7B for 2 epochs
on different data partitions using the same training hyper-
3.3. Code Generation for Data Science parameters explained in Appendix B. From the table, we
The DS-1000 dataset (Lai et al., 2022) contains 1K distinct can see that, as can be imagined, training on Python or
data science coding issues ranging from 7 popular data sci- non-Python data can substantially boost the performance of
ence libraries in Python. It evaluates the realistic and practi- the base model in Python or non-Python tasks, respectively.
cal use case of an LLM and offers unit tests for validating Interestingly, instruction tuning on different programming
each problem. DS-1000 has both completion and insertion languages can still boost the overall coding performance
modes, but here we only evaluate completion because the that includes out-of-distribution languages. For example,
base C ODE L LAMA -P YTHON does not support infilling. Ta- when trained on only non-Python data, Magicoder-CL still
ble 3 shows the evaluation results where we include the achieves a 10.4 percentage point improvement over the base
recent I N C ODER (Fried et al., 2023), CodeGen (Nijkamp model in the Python-only evaluation. This implies LLMs
et al., 2023), Code-Cushman-001 (Microsoft, 2023a), Star- can establish correlations between different programming
Coder (Li et al., 2023), C ODE L LAMA -P YTHON (Rozière languages and perform transfer learning of deeper code se-
et al., 2023), and WizardCoder (Luo et al., 2023b). We mantics. Finally, we observe a more significant boost in
can see from the table that Magicoder-CL-7B already out- Python evaluation when combining data from both sources,
performs all the baselines we evaluate, including state- with a slight decrease in multilingual performance compared
of-the-art WizardCoder-CL-7B and WizardCoder-SC-15B. with only finetuning on multilingual data. We attribute this
MagicoderS-CL-7B further breaks the limit by introduc- decrease to the dominant amount of Python data (around
ing an 8.3 percentage point absolute improvement over 57%) during instruction tuning.
WizardCoder-SC-15B.
4.2. OSS-I NSTRUCT vs. Direct Finetuning
3.4. Comparison with DeepSeek-Coder The fact that OSS-I NSTRUCT gets an LLM inspired from
DeepSeek-Coder (Guo et al., 2024) is a series of models open-source code snippets may lead to a natural question:
released concurrently to our work and they demonstrate su- why not directly finetuning on these open-source code? To
perior coding performance. We only briefly discuss it in answer this question, we follow CodeSearchNet (Husain
this section because its data and instruction tuning details et al., 2020) to mine semantically relevant comment-function
are not publicly available at the time of writing. We apply pairs from the same seed document corpus we use to con-
the same finetuning strategy on DeepSeek-Coder-Base-6.7B struct the 75K OSS-I NSTRUCT dataset. We then train the
as we performed on C ODE L LAMA -P YTHON-7B, leading model to predict the function bodies from the function signa-
to Magicoder-DS and MagicoderS-DS. Table 4 shows a tures and comments. We prioritize comment-function pairs
similar trend as Table 1 that the base model can be sig- that overlap with our 75K seed snippets, resulting in about
nificantly improved after applying OSS-I NSTRUCT. Re- 11K data points. To align with our 75K samples, we collect
markably, the MagicoderS-DS variant surpasses DeepSeek- the remaining 64K samples using the whole corpus of 75K
Coder-Instruct-6.7B on all the benchmarks with ×8 fewer seed documents. Eventually, we have the same number of
training tokens, and it also closely matches DeepSeek- comment-function pairs with OSS-I NSTRUCT data.
Coder-Instruct-33B on these datasets. We finetune the base C ODE L LAMA -P YTHON-7B for 2
epochs using the paired data, following the same training
4. Ablations of Data Source setup discussed in Appendix B. From Table 6, we observe
that finetuning on 75K paired comment-function data even
4.1. Impact of the Language Distribution worsens the base model, while OSS-I NSTRUCT helps to
To understand the correlation between the programming lan- introduce a substantial boost. We conjecture that the degra-
guages appearing in the training data and the downstream dation is owing to the substantial noise and inconsistency
performance of different languages, we conduct an addi- that exists intrinsically in the data pairs, even though these
tional ablation study about the training data. We classify the paired data exhibit very similar format as HumanEval or
75K training data into approximately 43K Python-only, and MultiPL-E problems. This further shows that data factual-
32K non-Python data according to whether ‘‘‘python ity, rather than the format, is essential to code instruction
is a substring of the generated data. We do not classify tuning. It also indicates the superiority of OSS-I NSTRUCT
the data based on the seed code snippet because LLMs per- which can translate these loosely related code fragments

6
Magicoder: Empowering Code Generation with OSS-I NSTRUCT

Table 2: Pass@1 results of different LLMs on MultiPL-E (Cassano et al., 2022) following the same hyperparameter
settings as the WizardCoder paper (Luo et al., 2023b): temperature = 0.2, top p = 0.95, max length = 512, and
num samples = 50. We evaluate all 7B models using bigcode-evaluation-harness (Ben Allal et al., 2022) and
report other results from WizardCoder.

Programming Language
Model Size
Java JavaScript C++ PHP Swift Rust
C ODE L LAMA 34B 40.2 41.7 41.4 40.4 35.3 38.7
C ODE L LAMA -P YTHON 34B 39.5 44.7 39.1 39.8 34.3 39.7
C ODE L LAMA -I NSTRUCT 34B 41.5 45.9 41.5 37.0 37.6 39.3
WizardCoder-CL 34B 44.9 55.3 47.2 47.2 44.3 46.2
StarCoderBase 15B 28.5 31.7 30.6 26.8 16.7 24.5
StarCoder 15B 30.2 30.8 31.6 26.1 22.7 21.8
WizardCoder-SC 15B 35.8 41.9 39.0 39.3 33.7 27.1
C ODE L LAMA 7B 29.3 31.7 27.0 25.1 25.6 25.5
C ODE L LAMA -P YTHON 7B 29.1 35.7 30.2 29.0 27.1 27.0
Magicoder-CL 7B 36.4 45.9 36.5 39.5 33.4 30.6
MagicoderS-CL 7B 42.9 57.5 44.4 47.6 44.1 40.3

Table 3: Pass@1 results on DS-1000 (completion format) with temperature = 0.2, top p = 0.5, max length =
1024, and num samples = 40, following the same hyperparameter setting used in WizardCoder (Luo et al., 2023b). We
evaluate all the 7B models with their preferred prompt formats and report other results from WizardCoder.

+ 155 + 220 + 291 + 68 + 106 + 115 + 45 = 1000

Model Size Matplotlib NumPy Pandas PyTorch SciPy Sklearn TensorFlow Overall
I N C ODER 6.7B 28.3 4.4 3.1 4.4 2.8 2.8 3.8 7.4
CodeGen-Mono 16B 31.7 10.9 3.4 7.0 9.0 10.8 15.2 11.7
Code-Cushman-001 - 40.7 21.8 7.9 12.4 11.3 18.0 12.2 18.1
StarCoder 15B 51.7 29.7 11.4 21.4 20.2 29.5 24.5 26.0
WizardCoder-SC 15B 55.2 33.6 16.7 26.2 24.2 24.9 26.7 29.2
C ODE L LAMA -P YTHON 7B 55.3 34.5 16.4 19.9 22.3 17.6 28.5 28.0
WizardCoder-CL 7B 53.5 34.4 15.2 25.7 21.0 24.5 28.9 28.4
Magicoder-CL 7B 54.6 34.8 19.0 24.7 25.0 22.6 28.9 29.9
MagicoderS-CL 7B 55.9 40.6 28.4 40.4 28.8 35.8 37.6 37.5

Table 4: Pass@1 (greedy decoding) comparison between Magicoder and DeepSeek-Coder (Guo et al., 2024) on Hu-
manEval (+) and MBPP (+). DeepSeek-Coder results are reported from EvalPlus (Liu et al., 2023b) Leaderboard.

Benchmark Open-Source
Model Size Training Tokens
HumanEval (+) MBPP (+) Weight Data
1.3B 2T - 55.4 (46.9) #
DeepSeek-Coder-Base 6.7B 2T 47.6 (39.6) 70.2 (56.6) #
33B 2T 51.2 (43.3) - #
1.3B +2B 64.6 (58.5) 63.7 (53.1) #
DeepSeek-Coder Instruct 6.7B +2B 73.8 (70.1) 72.7 (63.4) #
33B +2B 78.7 (72.6) 78.7 (66.7) #
Magicoder-DS 6.7B +90M 66.5 (60.4) 75.4 (61.9)
MagicoderS-DS 6.7B +240M 76.8 (70.7) 75.7 (64.4)

7
Magicoder: Empowering Code Generation with OSS-I NSTRUCT

Table 5: Ablation study of using different programming languages as training data. We show the pass@1 results on
HumanEval+ (Liu et al., 2023b) for Python and the average pass@1 results on MultiPL-E (Cassano et al., 2022) for the
same set of programming languages used in Table 2 (i.e., Java, JavaScript, C++, PHP, Swift, and Rust). All the variants are
finetuned with 2 epochs and evaluated through greedy-decoding.

Model (7B) Finetuning Data Python (HumanEval+) Others (MultiPL-E)

C ODE L LAMA -P YTHON - 34.1 29.6
Magicoder-CL Python (43K) 47.6 32.7
Magicoder-CL Others (32K) 44.5 38.3
Magicoder-CL Both (75K) 55.5 37.8

Table 6: Comparison between OSS-I NSTRUCT and directly Table 7 indicates that Magicoder-CL-Mixtral-7B not
finetuning on comment-function pairs with C ODE L LAMA - only significantly improves over the base C ODE L LAMA -
P YTHON-7B as the base model. P YTHON, but is also better than Mixtral-8x7B-Instruct-v0.1
(i.e., the teacher model) across HumanEval+ and MBPP+.
Finetuning Data HumanEval+ MultiPL-E These results suggest that OSS-I NSTRUCT is not simply dis-
Base model w/o finetuning 34.1 29.6 tilling a teacher model, but also triggering the base model’s
Comment-function pairs (75K) 34.1 24.1 own capability and effectively leveraging the information
OSS-I NSTRUCT (75K) 55.5 37.8 encapsulated in seed code snippets.

5. Related Work
into semantically-consistent instruction-tuning data. Foundation models for code Trained over billions of
lines of code, LLMs have demonstrated outstanding per-
4.3. OSS-I NSTRUCT with A Less Powerful Teacher formance in a wide range of software engineering tasks,
In this section, we explore the factors contributing to the including code generation (Chen et al., 2021; Austin et al.,
effectiveness of OSS-I NSTRUCT beyond just the distillation 2021), program repair (Xia & Zhang, 2022; Wei et al.,
of the teacher model. We propose two potential key reasons. 2023; Xia et al., 2023b; Jiang et al., 2023b; Bouzenia et al.,
First, since the base model is pretrained with comprehen- 2024), and software testing (Xia et al., 2023a; Deng et al.,
sive code data, the distillation process likely activates the 2023; Yuan et al., 2023; Schäfer et al., 2023; Lemieux et al.,
model’s internal capabilities, leading to improved perfor- 2023). In particular, prominent base models, such as Code-
mance in coding tasks. Second, OSS-I NSTRUCT uses seed Gen (Nijkamp et al., 2023), CodeT5 (Wang et al., 2021),
code snippets to generate problem-solution pairs in one shot. StarCoder (Li et al., 2023), and C ODE L LAMA (Rozière
These seed snippets provide valuable context, enabling the et al., 2023), are pre-trained over a huge number of code-
model to create better solutions than a plain teacher model base from scratch, establishing the fundamental ability of
lacking such seed information. These enhanced solutions general code generation and understanding. More recent
can then be used to train more effective student models. To code LLMs, such as DeepSeek-Coder (Guo et al., 2024) and
verify these points, we conduct an additional experiment StarCoder2 (Lozhkov et al., 2024), additionally organize
by generating a subset of 20K OSS-I NSTRUCT data using the pretraining data at the repository level to enhance the
Mixtral-8x7B-Instruct-v0.1 (Jiang et al., 2024), a state-of- model’s contextual understanding capabilities. Furthermore,
the-art, general-purpose, open-source LLM. these base models are also finetuned (Luo et al., 2023b) or
prompted (Chen et al., 2023) to unlock their true potential
to specialize in solving domain-specific coding tasks.
Table 7: Pass@1 on HumanEval+ and MBPP+ when fine-
tuning C ODE L LAMA -P YTHON-7B for 2 epochs on 20K
OSS-I NSTRUCT data generated by Mixtral-8x7B-Instruct- Instruction tuning with synthetic data Instruction tun-
v0.1 (Jiang et al., 2024). ing aims to improve pretrained LLMs by finetuning them
with a mixture of instructions and corresponding re-
Model HumanEval+ MBPP+ sponses (Wei et al., 2022). However, obtaining high-
quality instructional data is oftentimes laborious. Hence,
Mixtral-8x7B-Instruct-v0.1 39.6 47.4
researchers are increasingly focusing on the development
C ODE L LAMA -P YTHON-7B 34.1 45.4
of methods to generate synthetic instruction data. Wang
Magicoder-CL-Mixtral-7B 55.5 50.4
et al. (2023a) introduces S ELF -I NSTRUCT, where a founda-

8
Magicoder: Empowering Code Generation with OSS-I NSTRUCT

tion LLM (GPT-3 (Brown et al., 2020)) is used to gen- HumanEval benchmarks. We fully open source the model
erate synthetic instruction-response pairs with carefully weights, training data, and source code, to enable future
crafted prompts. The same LLM is then instruction-tuned on research in LLMs for code. In the near future, we will ap-
the synthetic data to distill such self-generated knowledge. ply OSS-I NSTRUCT to larger base models. We will also
This technique has been further extended to create synthetic continue advancing OSS-I NSTRUCT by generating higher-
data with different LLMs. For example, Alpaca (Taori et al., quality data with a strategically designed distribution of the
2023) and Code Alpaca (Chaudhary, 2023) apply S ELF - seed code snippets and with more advanced teacher LLMs
I NSTRUCT to finetune L LAMA with ChatGPT-generated such as GPT-4.
instructions. To improve S ELF -I NSTRUCT, WizardLM (Xu
et al., 2023) and WizardCoder (Luo et al., 2023a) propose Acknowledgement
Evol-Instruct and Code Evol-Instruct by guiding ChatGPT
with heuristic prompts to make the synthetic data more com- We thank all the reviewers for their insightful comments and
plex and diverse. More recently, Gunasekar et al. (2023) suggestions for our paper. This work was partially supported
shows that textbook-quality synthetic data alone can help by NSF grant CCF-2131943, as well as Kwai Inc.
the model achieve remarkable coding and reasoning capa-
bilities. Orthogonal to all existing methods, our proposed Impact Statement
OSS-I NSTRUCT allows LLMs to get inspired from real-
world code snippets for better controllability, quality, and This work is motivated to boost large language models
creativity in coding tasks. in terms of their code generation and understanding ca-
pabilities through instruction tuning. The proposed OSS-
Evaluating LLMs for code Most code benchmarks eval- I NSTRUCT method leverages the abundance of open source
uate LLMs on generating single-function programs from to generate diverse and controllable instruction data. We ex-
natural language descriptions. Such benchmarks include pect this idea to also foster innovative software solutions tai-
HumanEval (Chen et al., 2021), MBPP (Austin et al., 2021), lored to domain-specific needs, particularly in areas where
APPS (Hendrycks et al., 2021), and CodeContests (Li et al., real data is private and scarce, by generating extensive syn-
2022). A handful of manual tests are used to assess the thetic data. Additionally, our method reinforces the value
functional correctness of LLM-generated solutions. How- of community-driven content and knowledge sharing by
ever, insufficient tests can lead to false negatives. Conse- incorporating open-source code as references.
quently, the EvalPlus framework (Liu et al., 2023b) pro- However, it is essential to recognize the potential for misuse,
duces HumanEval+ and MBPP+ by extending 80×/35× such as the deliberate generation of vulnerable code that can
more tests. To address dataset contamination issues, re- be exploited for malicious purposes. Ultimately, adhering
searchers propose LiveCodeBench (Jain et al., 2024), which to ethical guidelines is crucial to ensure the responsible use
compiles fresh coding problems not included in model of this technique.
training, and EvoEval (Xia et al., 2024), which strategi-
cally leverages LLMs to evolve existing benchmarks into
new coding tasks. Meanwhile, there are comprehensive
References
benchmarks evaluating code generation for data science Allal, L. B., Li, R., Kocetkov, D., Mou, C., Akiki, C., Fer-
(DS-1000 (Lai et al., 2022)), addressing open-source issues randis, C. M., Muennighoff, N., Mishra, M., Gu, A., Dey,
(SWE-bench (Jimenez et al., 2023)), and repository-level M., Umapathi, L. K., Anderson, C. J., Zi, Y., Poirier, J. L.,
code generation (C ROSS C ODE E VAL (Ding et al., 2023) and Schoelkopf, H., Troshin, S., Abulkhanov, D., Romero,
RepoEval (Zhang et al., 2023)). M., Lappert, M., Toni, F. D., del Rı́o, B. G., Liu, Q.,
Bose, S., Bhattacharyya, U., Zhuo, T. Y., Yu, I., Villegas,
6. Conclusion and Future Work P., Zocca, M., Mangrulkar, S., Lansky, D., Nguyen, H.,
Contractor, D., Villa, L., Li, J., Bahdanau, D., Jernite, Y.,
We propose OSS-I NSTRUCT, a novel data generation Hughes, S., Fried, D., Guha, A., de Vries, H., and von
method using Large Language Models to generate diverse Werra, L. Santacoder: don’t reach for the stars!, 2023.
coding challenges from open-source code snippets. This
approach enables Magicoder, which significantly improves Austin, J., Odena, A., Nye, M. I., Bosma, M., Michalewski,
the base LLM. Despite having less than 7B parameters, it H., Dohan, D., Jiang, E., Cai, C. J., Terry, M., Le, Q. V.,
can outperform all evaluate LLMs with less than or equal to and Sutton, C. Program synthesis with large language
16B parameters, including the 15B WizardCoder. Combin- models. CoRR, abs/2108.07732, 2021. URL https:
ing OSS-I NSTRUCT with Evol-Instruct allows us to build //arxiv.org/abs/2108.07732.
the enhanced MagicoderS models. They achieve remark-
able results by rivaling leading models like ChatGPT in Ben Allal, L., Muennighoff, N., Kumar Umapathi,

9
Magicoder: Empowering Code Generation with OSS-I NSTRUCT

L., Lipkin, B., and von Werra, L. A framework Chen, X., Lin, M., Schärli, N., and Zhou, D. Teaching large
for the evaluation of code generation models. language models to self-debug, 2023.
https://github.com/bigcode-project/
bigcode-evaluation-harness, 2022. Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H.,
Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano,
Bouzenia, I., Devanbu, P., and Pradel, M. Repairagent: An R., Hesse, C., and Schulman, J. Training verifiers to solve
autonomous, llm-based agent for program repair. arXiv math word problems, 2021.
preprint arXiv:2403.17134, 2024.
Deng, Y., Xia, C. S., Peng, H., Yang, C., and Zhang, L.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
Large language models are zero-shot fuzzers: Fuzzing
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
deep-learning libraries via large language models, 2023.
Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G.,
Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Ding, Y., Wang, Z., Ahmad, W. U., Ding, H., Tan, M.,
Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Jain, N., Ramanathan, M. K., Nallapati, R., Bhatia, P.,
Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Roth, D., and Xiang, B. Crosscodeeval: A diverse
Radford, A., Sutskever, I., and Amodei, D. Language and multilingual benchmark for cross-file code comple-
models are few-shot learners. In Larochelle, H., tion. In Thirty-seventh Conference on Neural Informa-
Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), tion Processing Systems Datasets and Benchmarks Track,
Advances in Neural Information Processing Systems, 2023. URL https://openreview.net/forum?
volume 33, pp. 1877–1901. Curran Associates, Inc., id=wgDcbBMSfh.
2020. URL https://proceedings.neurips.
cc/paper_files/paper/2020/file/ Feng, Y., Martins, R., Bastani, O., and Dillig, I. Program
1457c0d6bfcb4967418bfb8ac142f64a-Paper. synthesis using conflict-driven learning. SIGPLAN Not.,
pdf. 53(4):420–435, jun 2018. ISSN 0362-1340. doi: 10.
Cambronero, J., Gulwani, S., Le, V., Perelman, D., Rad- 1145/3296979.3192382. URL https://doi.org/
hakrishna, A., Simon, C., and Tiwari, A. Flashfill++: 10.1145/3296979.3192382.
Scaling programming by example by cutting to the Fried, D., Aghajanyan, A., Lin, J., Wang, S., Wallace,
chase. Proc. ACM Program. Lang., 7(POPL), jan 2023. E., Shi, F., Zhong, R., Yih, S., Zettlemoyer, L., and
doi: 10.1145/3571226. URL https://doi.org/10. Lewis, M. Incoder: A generative model for code infilling
1145/3571226. and synthesis. In The Eleventh International Confer-
Cassano, F., Gouwar, J., Nguyen, D., Nguyen, S., Phipps- ence on Learning Representations, 2023. URL https:
Costin, L., Pinckney, D., Yee, M.-H., Zi, Y., Anderson, //openreview.net/forum?id=hQwb-lbM6EL.
C. J., Feldman, M. Q., Guha, A., Greenberg, M., and
Jangda, A. Multipl-e: A scalable and extensible approach Gulwani, S., Polozov, O., and Singh, R. Program syn-
to benchmarking neural code generation, 2022. thesis. Foundations and Trends® in Programming Lan-
guages, 4(1-2):1–119, 2017. ISSN 2325-1107. doi:
Chaudhary, S. Code alpaca: An instruction-following llama 10.1561/2500000010. URL http://dx.doi.org/
model for code generation. https://github.com/ 10.1561/2500000010.
sahil280114/codealpaca, 2023.
Gunasekar, S., Zhang, Y., Aneja, J., Mendes, C. C. T.,
Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, Giorno, A. D., Gopi, S., Javaheripi, M., Kauffmann, P.,
H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., de Rosa, G., Saarikivi, O., Salim, A., Shah, S., Behl,
Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, H. S., Wang, X., Bubeck, S., Eldan, R., Kalai, A. T., Lee,
M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, Y. T., and Li, Y. Textbooks are all you need, 2023.
S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavar-
ian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Guo, D., Zhu, Q., Yang, D., Xie, Z., Dong, K., Zhang, W.,
Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss, A., Chen, G., Bi, X., Wu, Y., Li, Y. K., Luo, F., Xiong, Y.,
Guss, W. H., Nichol, A., Paino, A., Tezak, N., Tang, and Liang, W. Deepseek-coder: When the large language
J., Babuschkin, I., Balaji, S., Jain, S., Saunders, W., model meets programming – the rise of code intelligence,
Hesse, C., Carr, A. N., Leike, J., Achiam, J., Misra, 2024.
V., Morikawa, E., Radford, A., Knight, M., Brundage,
M., Murati, M., Mayer, K., Welinder, P., McGrew, B., Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora,
Amodei, D., McCandlish, S., Sutskever, I., and Zaremba, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., and
W. Evaluating large language models trained on code, Steinhardt, J. Measuring coding challenge competence
2021. with apps, 2021.

10
Magicoder: Empowering Code Generation with OSS-I NSTRUCT

Honovich, O., Scialom, T., Levy, O., and Schick, T. Unnat- Li, R., Allal, L. B., Zi, Y., Muennighoff, N., Kocetkov, D.,
ural instructions: Tuning language models with (almost) Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., Liu, Q.,
no human labor. In Rogers, A., Boyd-Graber, J., and Zheltonozhskii, E., Zhuo, T. Y., Wang, T., Dehaene, O.,
Okazaki, N. (eds.), Proceedings of the 61st Annual Meet- Davaadorj, M., Lamy-Poirier, J., Monteiro, J., Shliazhko,
ing of the Association for Computational Linguistics (Vol- O., Gontier, N., Meade, N., Zebaze, A., Yee, M.-H., Uma-
ume 1: Long Papers), pp. 14409–14428, Toronto, Canada, pathi, L. K., Zhu, J., Lipkin, B., Oblokulov, M., Wang,
July 2023. Association for Computational Linguistics. Z., Murthy, R., Stillerman, J., Patel, S. S., Abulkhanov,
doi: 10.18653/v1/2023.acl-long.806. URL https: D., Zocca, M., Dey, M., Zhang, Z., Fahmy, N., Bhat-
//aclanthology.org/2023.acl-long.806. tacharyya, U., Yu, W., Singh, S., Luccioni, S., Villegas,
P., Kunakov, M., Zhdanov, F., Romero, M., Lee, T., Timor,
Hugging Face. Hugging face: The ai community build-
N., Ding, J., Schlesinger, C., Schoelkopf, H., Ebert, J.,
ing the future. https://huggingface.co/, 2023.
Dao, T., Mishra, M., Gu, A., Robinson, J., Anderson,
Accessed: 2023-12-01.
C. J., Dolan-Gavitt, B., Contractor, D., Reddy, S., Fried,
Husain, H., Wu, H.-H., Gazit, T., Allamanis, M., and D., Bahdanau, D., Jernite, Y., Ferrandis, C. M., Hughes,
Brockschmidt, M. Codesearchnet challenge: Evaluat- S., Wolf, T., Guha, A., von Werra, L., and de Vries, H.
ing the state of semantic code search, 2020. Starcoder: may the source be with you!, 2023.
Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T., Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser,
Wang, S., Solar-Lezama, A., Sen, K., and Stoica, I. Live- J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F.,
codebench: Holistic and contamination free evaluation of Dal Lago, A., Hubert, T., Choy, P., de Masson d’Autume,
large language models for code, 2024. C., Babuschkin, I., Chen, X., Huang, P.-S., Welbl, J.,
Gowal, S., Cherepanov, A., Molloy, J., Mankowitz,
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C.,
D. J., Sutherland Robson, E., Kohli, P., de Freitas,
Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel,
N., Kavukcuoglu, K., and Vinyals, O. Competition-
G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-
level code generation with alphacode. Science, 378
A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix,
(6624):1092–1097, December 2022. ISSN 1095-9203.
T., and Sayed, W. E. Mistral 7b, 2023a.
doi: 10.1126/science.abq1158. URL http://dx.doi.
Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, org/10.1126/science.abq1158.
B., Bamford, C., Chaplot, D. S., de las Casas, D., Hanna,
Liu, J., Peng, J., Wang, Y., and Zhang, L. Neuri: Di-
E. B., Bressand, F., Lengyel, G., Bour, G., Lample, G.,
versifying dnn generation via inductive rule inference.
Lavaud, L. R., Saulnier, L., Lachaux, M.-A., Stock, P.,
In Proceedings of the 31st ACM Joint European Soft-
Subramanian, S., Yang, S., Antoniak, S., Scao, T. L.,
ware Engineering Conference and Symposium on the
Gervet, T., Lavril, T., Wang, T., Lacroix, T., and Sayed,
Foundations of Software Engineering, ESEC/FSE 2023,
W. E. Mixtral of experts, 2024.
pp. 657–669, New York, NY, USA, 2023a. Associa-
Jiang, N., Liu, K., Lutellier, T., and Tan, L. Impact of code tion for Computing Machinery. ISBN 9798400703270.
language models on automated program repair, 2023b. doi: 10.1145/3611643.3616337. URL https://doi.
org/10.1145/3611643.3616337.
Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press,
O., and Narasimhan, K. Swe-bench: Can language mod- Liu, J., Xia, C. S., Wang, Y., and Zhang, L. Is your code
els resolve real-world github issues?, 2023. generated by chatGPT really correct? rigorous evaluation
of large language models for code generation. In Thirty-
Kocetkov, D., Li, R., Allal, L. B., Li, J., Mou, C., Ferrandis,
seventh Conference on Neural Information Processing
C. M., Jernite, Y., Mitchell, M., Hughes, S., Wolf, T.,
Systems, 2023b. URL https://openreview.net/
Bahdanau, D., von Werra, L., and de Vries, H. The stack:
forum?id=1qvx610Cu7.
3 tb of permissively licensed source code, 2022.
Lozhkov, A., Li, R., Allal, L. B., Cassano, F., Lamy-Poirier,
Lai, Y., Li, C., Wang, Y., Zhang, T., Zhong, R., Zettlemoyer,
J., Tazi, N., Tang, A., Pykhtar, D., Liu, J., Wei, Y., Liu, T.,
L., tau Yih, S. W., Fried, D., Wang, S., and Yu, T. Ds-
Tian, M., Kocetkov, D., Zucker, A., Belkada, Y., Wang,
1000: A natural and reliable benchmark for data science
Z., Liu, Q., Abulkhanov, D., Paul, I., Li, Z., Li, W.-D.,
code generation, 2022.
Risdal, M., Li, J., Zhu, J., Zhuo, T. Y., Zheltonozhskii,
Lemieux, C., Inala, J. P., Lahiri, S. K., and Sen, S. Co- E., Dade, N. O. O., Yu, W., Krauß, L., Jain, N., Su, Y.,
damosa: Escaping coverage plateaus in test genera- He, X., Dey, M., Abati, E., Chai, Y., Muennighoff, N.,
tion with pre-trained large language models. In 2023 Tang, X., Oblokulov, M., Akiki, C., Marone, M., Mou,
IEEE/ACM 45th International Conference on Software C., Mishra, M., Gu, A., Hui, B., Dao, T., Zebaze, A.,
Engineering (ICSE), pp. 919–931. IEEE, 2023. Dehaene, O., Patry, N., Xu, C., McAuley, J., Hu, H.,

11
Magicoder: Empowering Code Generation with OSS-I NSTRUCT

Scholak, T., Paquet, S., Robinson, J., Anderson, C. J., Services, A. W. AI Code Generator - Amazon Code-
Chapados, N., Patwary, M., Tajbakhsh, N., Jernite, Y., Whisperer - AWS. https://aws.amazon.com/
Ferrandis, C. M., Zhang, L., Hughes, S., Wolf, T., Guha, codewhisperer/, 2023.
A., von Werra, L., and de Vries, H. Starcoder 2 and the
stack v2: The next generation, 2024. Shazeer, N. and Stern, M. Adafactor: Adaptive learning
rates with sublinear memory cost, 2018.
Luo, Z., Xu, C., Zhao, P., Sun, Q., Geng, X., Hu, W., Tao, C.,
Ma, J., Lin, Q., and Jiang, D. Wizardcoder: Empowering SPARCK JONES, K. A statistical interpretation of term
code large language models with evol-instruct. arXiv specificity and its application in retrieval. 28(1):11–21,
preprint arXiv:2306.08568, 2023a. 2023/11/30 1972. doi: 10.1108/eb026526. URL https:
Luo, Z., Xu, C., Zhao, P., Sun, Q., Geng, X., Hu, W., Tao, C., //doi.org/10.1108/eb026526.
Ma, J., Lin, Q., and Jiang, D. Wizardcoder: Empowering
code large language models with evol-instruct, 2023b. Su, H., Shi, W., Kasai, J., Wang, Y., Hu, Y., Ostendorf,
M., Yih, W.-t., Smith, N. A., Zettlemoyer, L., and Yu, T.
Microsoft. Azure openai service models. https: One embedder, any task: Instruction-finetuned text em-
//learn.microsoft.com/en-us/azure/ beddings. 2022. URL https://arxiv.org/abs/
cognitive-services/openai/concepts/ 2212.09741.
models, 2023a.
Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li,
Microsoft. GitHub Copilot – Your AI pair pro- X., Guestrin, C., Liang, P., and Hashimoto, T. B.
grammer. https://github.com/features/ Stanford alpaca: An instruction-following llama
copilot, 2023b. model. https://github.com/tatsu-lab/
Muennighoff, N., Liu, Q., Zebaze, A., Zheng, Q., Hui, B., stanford_alpaca, 2023.
Zhuo, T. Y., Singh, S., Tang, X., von Werra, L., and
Longpre, S. Octopack: Instruction tuning code large theblackcat102. The evolved code alpaca dataset.
language models, 2023. https://huggingface.co/datasets/
theblackcat102/evol-codealpaca-v1,
Nijkamp, E., Pang, B., Hayashi, H., Tu, L., Wang, H., 2023.
Zhou, Y., Savarese, S., and Xiong, C. Codegen: An
open large language model for code with multi-turn pro- Wang, X., Dillig, I., and Singh, R. Program synthesis using
gram synthesis. In The Eleventh International Confer- abstraction refinement. Proc. ACM Program. Lang., 2
ence on Learning Representations, 2023. URL https: (POPL), dec 2017. doi: 10.1145/3158151. URL https:
//openreview.net/forum?id=iaYcJKpY2B_. //doi.org/10.1145/3158151.
Olausson, T. X., Inala, J. P., Wang, C., Gao, J., and Wang, Y., Wang, W., Joty, S., and Hoi, S. C. CodeT5:
Solar-Lezama, A. Is self-repair a silver bullet for code Identifier-aware unified pre-trained encoder-decoder mod-
generation? In The Twelfth International Conference els for code understanding and generation. In Moens,
on Learning Representations, 2024. URL https:// M.-F., Huang, X., Specia, L., and Yih, S. W.-t. (eds.), Pro-
openreview.net/forum?id=y0GJXRungR. ceedings of the 2021 Conference on Empirical Methods
OpenAI. Chatgpt: Optimizing language models for dialogue. in Natural Language Processing, pp. 8696–8708, On-
https://openai.com/blog/chatgpt/, 2022. line and Punta Cana, Dominican Republic, November
2021. Association for Computational Linguistics. doi:
OpenAI. Gpt-4 technical report, 2023. 10.18653/v1/2021.emnlp-main.685. URL https://
Rozière, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, aclanthology.org/2021.emnlp-main.685.
X. E., Adi, Y., Liu, J., Remez, T., Rapin, J., Kozhevnikov,
Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A.,
A., Evtimov, I., Bitton, J., Bhatt, M., Ferrer, C. C.,
Khashabi, D., and Hajishirzi, H. Self-instruct: Align-
Grattafiori, A., Xiong, W., Défossez, A., Copet, J., Azhar,
ing language models with self-generated instructions. In
F., Touvron, H., Martin, L., Usunier, N., Scialom, T., and
Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Pro-
Synnaeve, G. Code llama: Open foundation models for
ceedings of the 61st Annual Meeting of the Association
code, 2023.
for Computational Linguistics (Volume 1: Long Papers),
Schäfer, M., Nadi, S., Eghbali, A., and Tip, F. An empirical pp. 13484–13508, Toronto, Canada, July 2023a. Associ-
evaluation of using large language models for automated ation for Computational Linguistics. doi: 10.18653/v1/
unit test generation. IEEE Transactions on Software En- 2023.acl-long.754. URL https://aclanthology.
gineering, 2023. org/2023.acl-long.754.

12
Magicoder: Empowering Code Generation with OSS-I NSTRUCT

Wang, Y., Le, H., Gotmare, A. D., Bui, N. D. Q., Li, J., and
Hoi, S. C. H. Codet5+: Open code large language models
for code understanding and generation, 2023b.
Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester,
B., Du, N., Dai, A. M., and Le, Q. V. Finetuned language
models are zero-shot learners, 2022.

Wei, Y., Xia, C. S., and Zhang, L. Copiloting the copilots:

Fusing large language models with completion engines
for automated program repair, 2023.
Xia, C. S. and Zhang, L. Less training, more repairing
please: Revisiting automated program repair via zero-
shot learning, 2022.
Xia, C. S. and Zhang, L. Keep the conversation going:
Fixing 162 out of 337 bugs for $0.42 each using chatgpt.
arXiv preprint arXiv:2304.00385, 2023.
Xia, C. S., Paltenghi, M., Tian, J. L., Pradel, M., and Zhang,
L. Universal fuzzing via large language models, 2023a.
Xia, C. S., Wei, Y., and Zhang, L. Automated program
repair in the era of large pre-trained language models.
In 2023 IEEE/ACM 45th International Conference on
Software Engineering (ICSE), pp. 1482–1494, 2023b. doi:
10.1109/ICSE48619.2023.00129.
Xia, C. S., Deng, Y., and Zhang, L. Top leaderboard ranking
= top coding proficiency, always? evoeval: Evolving
coding benchmarks via llm, 2024.
Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao,
C., and Jiang, D. Wizardlm: Empowering large language
models to follow complex instructions. arXiv preprint
arXiv:2304.12244, 2023.
Yu, Y., Zhuang, Y., Zhang, J., Meng, Y., Ratner, A., Krishna,
R., Shen, J., and Zhang, C. Large language model as
attributed training data generator: A tale of diversity and
bias, 2023.
Yuan, Z., Lou, Y., Liu, M., Ding, S., Wang, K., Chen, Y.,
and Peng, X. No more manual tests? evaluating and
improving chatgpt for unit test generation. arXiv preprint
arXiv:2305.04207, 2023.
Zhang, F., Chen, B., Zhang, Y., Keung, J., Liu, J., Zan,
D., Mao, Y., Lou, J.-G., and Chen, W. Repocoder:
Repository-level code completion through iterative re-
trieval and generation, 2023.

13
Magicoder: Empowering Code Generation with OSS-I NSTRUCT

You are exceptionally skilled at crafting high-quality programming problems and

offering precise solutions.
Please gain inspiration from the following random code snippet to create a
high-quality programming problem. Present your output in two distinct sections:
[Problem Description] and [Solution].
Code snippet for inspiration:
```
{code}
```
Guidelines for each section:
1. [Problem Description]: This should be **completely self-contained**, providing
all the contextual information one needs to understand and solve the problem.
Assume common programming knowledge, but ensure that any specific context,
variables, or code snippets pertinent to this problem are explicitly included.
2. [Solution]: Offer a comprehensive, **correct** solution that accurately
addresses the [Problem Description] you provided.

Figure 4: The detailed prompt design for OSS-I NSTRUCT

A. More Details of OSS-I NSTRUCT

A.1. Prompt Design
Figure 4 illustrates the prompt template of OSS-I NSTRUCT, where the first section presents a high-level description of the
task, the second section incorporates the code snippet, and the third section offers guidelines on the response.

A.2. Qualitative Examples

Figure 5 extends Figure 2 and shows more qualitative examples of OSS-I NSTRUCT generation, including outputs that are
inspired by method definitions, shell scripts, library imports, class signatures, code statements, and code comments.

A.3. Breakdown of OSS-I NSTRUCT Dataset

Categories To study the categories of OSS-I NSTRUCT-generated data, we use INSTRUCTOR (Su et al., 2022), which is
one of the SOTA embedding models and can generate different text embeddings according to a task instruction. Inspired by
OctoPack (Muennighoff et al., 2023) and the topic tags on GitHub, we manually designed 10 categories specific to coding.
As shown in Figure 6, we calculate the cosine similarity between the embeddings of each sample in OSS-I NSTRUCT and
the 10 categories to obtain the category breakdown. Overall, OSS-I NSTRUCT exhibits diversity and balance across different
categories.

Length distribution We depict the length distribution for both generated problems and solutions in Figure 7. The x-axis
represents the number of tokens in each problem/solution, while the y-axis shows the correspondent number of samples.

B. Implementation Details
B.1. Data Generation
We use gpt-3.5-turbo-1106 as the foundation model to do OSS-I NSTRUCT due to its high cost-effectiveness. We
randomly extract 1–15 lines from each selected code document from starcoderdata and let gpt-3.5-turbo-1106
imagine a self-contained coding problem and a correct solution. Given the numerous seed code snippets, we perform greedy
decoding to maximize the consistency between the generated problems and solutions.

14
Magicoder: Empowering Code Generation with OSS-I NSTRUCT

Seed: method de inition Seed: shell script

render() { python3 makeErrorFile.py data/test_dataset_14 14

Problem Problem
Your task is to complete the `render` method to Create a Python program that generates an error ile
generate the rendered shape as a string... based on a given dataset...
Code Code
class ShapeRenderer { def generate_error_file(dataset_file, ...):
constructor(vertices) { error_lines = []
this.vertices = vertices; with open(dataset_file, 'r') as file:
} for line in file:
render() { ...
let renderedShape = ""; with open(error_file_name, 'w') as error_file:
for (let i = 0; i < this.vertices.length; i++) { for error_line in error_lines:
const vertex = this.vertices[i]; error_file.write(error_line + '\n')
renderedShape += `(${vertex.x}, ${vertex.y})`; if __name__ == "__main__":
if (i < this.vertices.length - 1) { if len(sys.argv) != 3:
renderedShape += " - "; print("Usage: ...")
} else:
} dataset_file = sys.argv[1]
return renderedShape; dataset_number = sys.argv[2]
} generate_error_file(...)
}

Seed: library imports Seed: class signature

import numpy as np @SpringBootApplication
import gym_electric_motor as gem @Import({ AxonConfig.class })
import matplotlib.pyplot as plt public class AxonbankApplication {
public static void main(String[] args) {
Problem
Create a reinforcement learning agent to control an Problem
electric motor using the OpenAI Gym environment... Create a simple Java Spring Boot application for a
banking system...
Code
import numpy as np Code
import gym_electric_motor as gem import org.axonframework.commandhandling.CommandBus;
... import org.axonframework.config.Configuration;
env = gem.make("DcSeriesCont-v1") ...
class DQNAgent: @SpringBootApplication
def __init__(self, state_dim, action_dim): ... @Import({ AxonConfig.class })
def build_model(self): ... public class AxonbankApplication {...}
def act(self, state): ... public class BankAccount {...}
def train(self, state, action, reward, ...): ... public class CreateAccountCommand {...}
... public class DepositFundsCommand {...}
for episode in range(episodes): public class WithdrawFundsCommand {...}
state = env.reset() public class AccountCreatedEvent {...}
state = np.reshape(state, [1, state_dim]) public class FundsDepositedEvent {...}
... public class FundsWithdrawnEvent {...}

Seed: code statements Seed: comments

cutoff_range = np.ptp(cutoffs) # Set degrees
if cutoff_range == 0: cutoff_range = 1
cutoff_colors = plt.get_cmap('plasma')( Problem
(cutoffs - np.min(cutoffs)) / cutoff_range Implement a Python class that represents a
) temperature in degrees...
Problem Code
Implement a function that calculates the color values class TemperatureConverter:
for a given set of cuto values based on a speci ied def __init__(self): ...
color map... def set_celsius(self, degrees): ...
def set_fahrenheit(self, degrees): ...
Code def set_kelvin(self, degrees): ...
import numpy as np def get_celsius(self): ...
import matplotlib.pyplot as plt def get_fahrenheit(self): ...
def get_kelvin(self): ...
def calculate_cutoff_colors(cutoffs, cmap_name): def convert_to(self, unit):
cutoff_range = np.ptp(cutoffs) if unit == 'C':
if cutoff_range == 0: return self.get_celsius()
cutoff_range = 1 elif unit == 'F':
cmap = plt.get_cmap(cmap_name) return self.get_fahrenheit()
normalized_cutoffs = ... elif unit == 'K':
cutoff_colors = ... return self.get_kelvin()
return cutoff_colors ...
f f f f
D
Q
L

Figure 5: More examples showing how OSS-I NSTRUCT generates problems and solutions from seed code snippets. Detailed
problem requirements, implementations, and explanations are omitted for brevity.
15
Magicoder: Empowering Code Generation with OSS-I NSTRUCT

Figure 6: The category constitution of OSS-I NSTRUCT

problem
7
#Count (Thousand)

solution
6
5
4
3
2
1
0
0 100 200 300 400 500 600 700
Number of Tokens

Figure 7: Token count distribution of OSS-I NSTRUCT-generated problems and solutions

B.2. Data Decontamination

We apply data decontamination before training our Magicoder and MagicoderS models. Following Li et al. (2023), we
decontaminate both our 75K OSS-I NSTRUCT dataset and the evol-codealpaca-v1 (theblackcat102, 2023) dataset,
an open-source reproduction of Evol-Instruct generated by GPT-4 (OpenAI, 2023), by removing exact matches from
HumanEval (Chen et al., 2021), MBPP (Austin et al., 2021), DS-1000 (Lai et al., 2022), and GSM8K (Cobbe et al., 2021).
Eventually, we filtered out 9 problems for OSS-I NSTRUCT dataset and 89 for evol-codealpaca-v1.

B.3. Training
We employ C ODE L LAMA -P YTHON-7B and DeepSeek-Coder-Base 6.7B as the base LLMs. To obtain Magicoder series, we
first finetune the base models on about 75K synthetic data generated through OSS-I NSTRUCT using the transformers
library from Hugging Face (Hugging Face, 2023). We finetune the base models for 2 epochs using two NVIDIA A100-80GB
GPUs through the Distributed Data Parallel (DDP) module from PyTorch. We set the initial learning rate at 5e-5 with 15
warmup steps and a linear scheduler. We use Adafactor (Shazeer & Stern, 2018) as our optimizer and choose a batch size of
512 with a sequence truncation length of 1216. To obtain MagicoderS, we continue to finetune Magicoder models with the
evol-codealpaca-v1 dataset, an open-source Evol-Instruct implementation containing about 110K samples. We use
the same hyperparameters except for 15 warmup steps and a 1024 maximum sequence length.

C. More Evaluation Results

C.1. Evaluation on APPS for Competitive Programming
We additionally evaluate Magicoder on APPS (Hendrycks et al., 2021), a benchmark suite of competitive programming
problems. Following Olausson et al. (2024), we select a subset of 300 problems from the APPS test set. From Table 8, we can
observe that the C ODE L LAMA -P YTHON-based Magicoder-CL significantly outperforms the base model and WizardCoder-
CL. MagicoderS-CL-7B is even better than WizardCoder-SC-15B despite having less than half the number of parameters.

16
Magicoder: Empowering Code Generation with OSS-I NSTRUCT

Meanwhile, DeepSeek-Coder-based MagicoderS-DS achieves the best result among all the evaluated baselines, substantially
outperforming the instruction-tuned DeepSeek-Coder-6.7B-Instruct.

Table 8: Pass@1 results on APPS evaluated using greedy decoding in a zero-shot setting.

Model Introductory (60) Interview (180) Competition (60) Overall (300)

WizardCoder-SC-15B 21.7 6.1 1.7 8.3
C ODE L LAMA -P YTHON-7B 3.3 2.8 0.0 2.3
WizardCoder-CL-7B 10.0 3.9 1.7 4.7
Magicoder-CL-7B 18.3 5.6 1.7 7.3
MagicoderS-CL-7B 23.3 6.1 1.7 8.7
DeepSeek-Coder-6.7B-Base 16.7 7.2 0.0 7.7
DeepSeek-Coder-6.7B-Instruct 23.3 9.4 0.0 10.3
Magicoder-DS-6.7B 20.0 8.9 1.7 9.7
MagicoderS-DS-6.7B 28.3 11.7 3.3 13.3

C.2. Fill-in-the-Middle Evaluation on DS-1000

Table 9 shows the evaluation results of Magicoder-DS and MagicoderS-DS on DS-1000 (Lai et al., 2022) (Insertion format),
assessing a model’s fill-in-the-middle capability. In this experiment, we use DeepSeek-Coder as the base model and exclude
C ODE L LAMA -P YTHON-based results, as C ODE L LAMA -P YTHON does not support the fill-in-the-middle format. The results
highlight Magicoder’s superior performance in fill-in-the-middle tasks compared to all other evaluated baselines. This
outstanding capability suggests that Magicoder can serve as a valuable copilot for developers.

Table 9: Pass@1 results on DS-1000 (Insertion format) with temperature = 0.2, top p = 0.5, max length = 1024,
and num samples = 40.

Model NumPy Pandas PyTorch SciPy Sklearn TensorFlow Overall

WizardCoder-SC-15B 35.1 20.4 30.4 28.9 32.3 37.8 28.6
DeepSeek-Coder-6.7B-Base 36.3 28.6 15.8 19.3 32.8 35.1 29.3
DeepSeek-Coder-6.7B-Instruct 44.1 27.3 38.2 30.8 38.4 29.6 34.6
Magicoder-DS-6.7B 39.7 31.2 27.4 23.7 44.6 30.2 33.9
MagicoderS-DS-6.7B 43.3 29.5 39.2 26.2 44.5 36.2 35.9

C.3. Impact of Removing Noisy Data

In §2.2, we highlight the benefits of preserving certain types of noise in instruction-tuning data. To support this argument, we
removed data samples with partially implemented code, such as those containing pass or NotImplemented tokens in
Python problems. Table 10 shows the pass@1 performance of C ODE L LAMA -P YTHON-7B on HumanEval+ when finetuned
for 2 epochs on the raw OSS-I NSTRUCT dataset compared to the same dataset with noisy samples removed. We can observe
that retaining the noisy samples results in slightly better performance.

Table 10: Pass@1 on HumanEval+ when finetuning C ODE L LAMA -P YTHON-7B with and without noisy samples.

Setting Dataset Size HumanEval+

With noisy data 75K 55.5
Without noisy data 68K 54.9

17
Magicoder: Empowering Code Generation with OSS-I NSTRUCT

D. Cases where OSS-I NSTRUCT Improves Code Understanding

In this section, we compare the base model, DeepSeek-Coder-6.7B-Base, and its OSS-I NSTRUCT-tuned version, Magicoder-
DS-6.7B, in both in-domain and out-of-domain scenarios and demonstrate how OSS-I NSTRUCT helps improve the base
model’s instruction following capability. We provide the base model with an appropriate prompt prefix to make sure it can
produce a plausible output format.

D.1. More Precise Understanding of Requirements

The following example shows that after instruction tuning, the model can understand the requirement “Check whether the
input is valid” more precisely. DeepSeek-Coder-6.7B-Base only considers the integer case, while Magicoder-DS-6.7B
effectively checks if the parameter is of the correct type.

Prompt: Write a function that takes an integer a and returns True if this integer is a cube of some integer number. Note:
you should check whether the input is valid.
def is_cube(a): def is_cube(a):
if a <= 0: if not isinstance(a, int) or a < 0:
return False return False # Invalid input
x = int(a ** (1/3))
return x**3 == a # Check if the number is a perfect cube
x = int(a ** (1/3))
DeepSeek-Coder-6.7B-Base return x**3 == a

Magicoder-DS-6.7B

D.2. Handling Corner Cases

The example below indicates that Magicoder-DS-6.7B can correctly handle the corner case, where the input string has less
than or equal to 2 characters, even if this is not explicitly stated in the prompt. This is also an out-of-domain example since
the Scala programming language is not included in OSS-I NSTRUCT dataset.

Prompt: Implement a Scala function that takes a string as input and returns a new string where all the characters are
reversed except the first and last characters. Reply only with the function.
def reverseExceptFirstLast(str: String): def reverseExceptFirstLast(input: String):
String = { String = {
val chars = str.toCharArray if (input.length <= 2) {
val first = chars(0) input
val last = chars(chars.length - 1) } else {
val middle = chars.slice(1, chars.length val reversedMiddle = input.substring(1,
- 1).reverse input.length - 1).reverse
first + middle + last input.charAt(0) + reversedMiddle +
} input.charAt(input.length - 1)
}
DeepSeek-Coder-6.7B-Base }

Magicoder-DS-6.7B

D.3. Generalizing to Unseen Tasks

We also show that after instruction tuning, Magicoder-DS-6.7B can respond plausibly to unseen requests during training.
This example asks the model to create an easy Python coding exercise, which the base model cannot interpret. Such kinds of
tasks are also not incorporated in the OSS-I NSTRUCT dataset.

18
Magicoder: Empowering Code Generation with OSS-I NSTRUCT

Prompt: Create an easy Python coding exercise

def add_numbers(a, b): # Given a list of integers, return the sum
return a + b of all the positive integers in the
list.
DeepSeek-Coder-6.7B-Base
def sum_positive_integers(nums):
return sum(num for num in nums if num >
0)

# Test the function

numbers = [1, -2, 3, -4, 5, -6, 7, -8, 9,
-10]
print(sum_positive_integers(numbers)) #
Output: 25

Magicoder-DS-6.7B

E. Generating Complex Programs

In this section, we present the cases where the best performing MagicoderS-DS-6.7B is capable of generating real-world,
complex programs.

E.1. Implementing Snake Game in Python

Prompt: Write a snake game in Python using the turtle library (the game is created by Magicoder).
Response: Here is a simple snake game in Python using the turtle library. This game is a basic version and does not
include many features that a full-fledged snake game would have.
import turtle
import time
import random

delay = 0.1

# Score
score = 0
high_score = 0

# Set up the screen

wn = turtle.Screen()
wn.title("Snake Game by Magicoder")
wn.bgcolor("green")
wn.setup(width=600, height=600)
wn.tracer(0) # Turns off the screen updates

# Snake head
head = turtle.Turtle()
head.speed(0)
head.shape("square")
head.color("black")
head.penup()
head.goto(0, 0)
head.direction = "stop"

# Snake food
food = turtle.Turtle()
food.speed(0)
food.shape("circle")
food.color("red")
food.penup()
food.goto(0, 100)

segments = []

# Pen
pen = turtle.Turtle()
pen.speed(0)
pen.shape("square")

19
Magicoder: Empowering Code Generation with OSS-I NSTRUCT

pen.color("white")
pen.penup()
pen.hideturtle()
pen.goto(0, 260)
pen.write("Score: 0 High Score: 0", align="center", font=("Courier", 24, "normal"))

# Functions
def go_up():
if head.direction!= "down":
head.direction = "up"

def go_down():
if head.direction!= "up":
head.direction = "down"

def go_left():
if head.direction!= "right":
head.direction = "left"

def go_right():
if head.direction!= "left":
head.direction = "right"

def move():
if head.direction == "up":
y = head.ycor()
head.sety(y + 20)

if head.direction == "down":
y = head.ycor()
head.sety(y - 20)

if head.direction == "left":
x = head.xcor()
head.setx(x - 20)

if head.direction == "right":
x = head.xcor()
head.setx(x + 20)

# Keyboard bindings
wn.listen()
wn.onkeypress(go_up, "w")
wn.onkeypress(go_down, "s")
wn.onkeypress(go_left, "a")
wn.onkeypress(go_right, "d")

# Main game loop

while True:
wn.update()

# Check for a collision with the border

if head.xcor() > 290 or head.xcor() < -290 or head.ycor() > 290 or head.ycor() < -290:
time.sleep(1)
head.goto(0, 0)
head.direction = "stop"

# Hide the segments

for segment in segments:
segment.goto(1000, 1000)

# Clear the segments list

segments.clear()

# Reset the score

score = 0

# Reset the delay

delay = 0.1

# Update the score display

pen.clear()
pen.write("Score: {} High Score: {}".format(score, high_score), align="center", font=("Courier", 24, "

20
Magicoder: Empowering Code Generation with OSS-I NSTRUCT

normal"))

# Check for a collision with the food

if head.distance(food) < 20:
# Move the food to a random spot
x = random.randint(-280, 280)
y = random.randint(-280, 280)
food.goto(x, y)

# Add a segment
new_segment = turtle.Turtle()
new_segment.speed(0)
new_segment.shape("square")
new_segment.color("grey")
new_segment.penup()
segments.append(new_segment)

# Shorten the delay

delay -= 0.001

# Increase the score

score += 10

if score > high_score:

high_score = score

pen.clear()
pen.write("Score: {} High Score: {}".format(score, high_score), align="center", font=("Courier", 24, "
normal"))

# Move the end segments first in reverse order

for index in range(len(segments) - 1, 0, -1):
x = segments[index - 1].xcor()
y = segments[index - 1].ycor()
segments[index].goto(x, y)

# Move segment 0 to where the head is

if len(segments) > 0:
x = head.xcor()
y = head.ycor()
segments[0].goto(x, y)

move()

# Check for head collision with the body segments

for segment in segments:
if segment.distance(head) < 20:
time.sleep(1)
head.goto(0, 0)
head.direction = "stop"

# Hide the segments

for segment in segments:
segment.goto(1000, 1000)

# Clear the segments list

segments.clear()

# Reset the score

score = 0

# Reset the delay

delay = 0.1

# Update the score display

pen.clear()
pen.write("Score: {} High Score: {}".format(score, high_score), align="center", font=("Courier", 24, "
normal"))

time.sleep(delay)

wn.mainloop()

This game is a simple version of the classic snake game. The snake starts with a single segment and grows in size when it
eats food. The game ends when the snake hits the border or its own body. The score is increased by 10 points for each
piece of food eaten. The game speed increases as the snake grows.

21
Magicoder: Empowering Code Generation with OSS-I NSTRUCT

E.2. Building Othello Game in Java

Prompt: Build a console-based Othello game in Java with row and column numbers shown on the board. The game
should end when there are no more valid moves for either player.
Response: Here is a simple console-based Othello game in Java. It does not include the graphical interface, but it does
include the row and column numbers on the board.
import java.util.Scanner;

public class Othello {

private static final int BOARD_SIZE = 8;
private static final char EMPTY = ’-’;
private static final char BLACK = ’B’;
private static final char WHITE = ’W’;
private static final char[][] board = new char[BOARD_SIZE][BOARD_SIZE];
private static char currentPlayer = BLACK;

public static void main(String[] args) {

initializeBoard();
printBoard();

while (true) {
System.out.println("Player " + currentPlayer + ", enter your move (row column):");
Scanner scanner = new Scanner(System.in);
int row = scanner.nextInt();
int col = scanner.nextInt();

if (isValidMove(row, col)) {
makeMove(row, col);
printBoard();
if (!hasValidMove()) {
System.out.println("Player " + currentPlayer + " has no valid moves. Game over!");
break;
}
currentPlayer = (currentPlayer == BLACK)? WHITE : BLACK;
} else {
System.out.println("Invalid move. Try again.");
}
}
}

private static void initializeBoard() {

for (int i = 0; i < BOARD_SIZE; i++) {
for (int j = 0; j < BOARD_SIZE; j++) {
board[i][j] = EMPTY;
}
}
board[3][3] = board[4][4] = BLACK;
board[3][4] = board[4][3] = WHITE;
}

private static void printBoard() {

System.out.print(" ");
for (int i = 0; i < BOARD_SIZE; i++) {
System.out.print(i + " ");
}
System.out.println();
for (int i = 0; i < BOARD_SIZE; i++) {
System.out.print(i + " ");
for (int j = 0; j < BOARD_SIZE; j++) {
System.out.print(board[i][j] + " ");
}
System.out.println();
}
}

private static boolean isValidMove(int row, int col) {

if (board[row][col]!= EMPTY) {
return false;
}
for (int i = -1; i <= 1; i++) {
for (int j = -1; j <= 1; j++) {
if (i!= 0 || j!= 0) {
if (isValidDirection(row, col, i, j)) {
return true;
}
}

22
Magicoder: Empowering Code Generation with OSS-I NSTRUCT

}
}
return false;
}

private static boolean isValidDirection(int row, int col, int rowDir, int colDir) {
int newRow = row + rowDir;
int newCol = col + colDir;
while (newRow >= 0 && newRow < BOARD_SIZE && newCol >= 0 && newCol < BOARD_SIZE) {
if (board[newRow][newCol] == EMPTY) {
return false;
}
if (board[newRow][newCol] == currentPlayer) {
return true;
}
newRow += rowDir;
newCol += colDir;
}
return false;
}

private static void makeMove(int row, int col) {

board[row][col] = currentPlayer;
for (int i = -1; i <= 1; i++) {
for (int j = -1; j <= 1; j++) {
if (i!= 0 || j!= 0) {
flipDirection(row, col, i, j);
}
}
}
}

private static void flipDirection(int row, int col, int rowDir, int colDir) {
int newRow = row + rowDir;
int newCol = col + colDir;
while (newRow >= 0 && newRow < BOARD_SIZE && newCol >= 0 && newCol < BOARD_SIZE) {
if (board[newRow][newCol] == EMPTY) {
return;
}
if (board[newRow][newCol] == currentPlayer) {
while (true) {
newRow -= rowDir;
newCol -= colDir;
if (newRow == row && newCol == col) {
break;
}
board[newRow][newCol] = currentPlayer;
}
return;
}
newRow += rowDir;
newCol += colDir;
}
}

private static boolean hasValidMove() {

for (int i = 0; i < BOARD_SIZE; i++) {
for (int j = 0; j < BOARD_SIZE; j++) {
if (isValidMove(i, j)) {
return true;
}
}
}
return false;
}
}

This program starts with a 8x8 board with two pieces in the center of the board, one black and one white. The player who
goes first is black. The game ends when there are no more valid moves for either player. The player enters their move as
two numbers, the row and the column. If the move is valid, the game makes the move and switches the player. If the
move is not valid, the player is asked to enter a valid move.

E.3. Writing Specific Gradio Application

23
Magicoder: Empowering Code Generation with OSS-I NSTRUCT

Prompt: Write a gradio (3.48.0) application for the following use case: Take an input image and return a 45 degree
clockwise rotated image. You should also add text description under the output showing the rotation degree.
Response: Here is a simple snake game in Python using the turtle library. This game is a basic version and does not
include many features that a full-fledged snake game would have.
import gradio as gr
import numpy as np
from PIL import Image

def rotate_image(img):
img = Image.fromarray(img)
rotated_img = img.rotate(-45)
return np.array(rotated_img), "45 degree clockwise rotation"

demo = gr.Interface(
fn=rotate_image,
inputs=gr.inputs.Image(type="numpy"),
outputs=[gr.outputs.Image(type="numpy"), "text"],
title="Image Rotation",
description="Upload an image and get a 45 degree clockwise rotated image",
)

demo.launch()

This script defines a function rotate_image that takes an image as input, rotates it 45 degrees clockwise using the PIL
library, and returns the rotated image and a text description. The gr.Interface function is then used to create a Gradio
interface for this function.
The gr.inputs.Image(type="numpy") line specifies that the input to the function should be an image, and the
gr.outputs.Image(type="numpy") line specifies that the output should be an image. The ”text” output type is
used to display the text description.
Finally, demo.launch() is used to launch the Gradio interface.

E.4. Developing Machine Learning Pipeline with PyTorch

Prompt: Build a simple neural network in Python using Pytorch to classify handwritten digits from the MNIST dataset.
You should use CNN as the model structure, train the model for 5 epochs, draw a chart of the training loss, and show the
final result.
Response: Here is a simple implementation of a Convolutional Neural Network (CNN) using PyTorch to classify
handwritten digits from the MNIST dataset.
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt

# Device configuration
device = torch.device(’cuda’ if torch.cuda.is_available() else ’cpu’)

# Hyper-parameters
num_epochs = 5
batch_size = 100
learning_rate = 0.001

# MNIST dataset
train_dataset = torchvision.datasets.MNIST(root=’../../data’,
train=True,
transform=transforms.ToTensor(),
download=True)

test_dataset = torchvision.datasets.MNIST(root=’../../data’,
train=False,
transform=transforms.ToTensor())

# Data loader
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
batch_size=batch_size,

24
Magicoder: Empowering Code Generation with OSS-I NSTRUCT

shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
batch_size=batch_size,
shuffle=False)

def forward(self, x):

out = self.layer1(x)
out = self.layer2(out)
out = out.reshape(out.size(0), -1)
out = self.fc(out)
return out

model = ConvNet().to(device)

# Loss and optimizer

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# Train the model

total_step = len(train_loader)
loss_list = []
for epoch in range(num_epochs):
for i, (images, labels) in enumerate(train_loader):
images = images.to(device)
labels = labels.to(device)

# Forward pass
outputs = model(images)
loss = criterion(outputs, labels)

# Backward and optimize

optimizer.zero_grad()
loss.backward()
optimizer.step()

if (i+1) % 100 == 0:
print (’Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}’
.format(epoch+1, num_epochs, i+1, total_step, loss.item()))
loss_list.append(loss.item())

# Test the model

model.eval() # eval mode (batchnorm uses moving mean/variance instead of mini-batch mean/variance)
with torch.no_grad():
correct = 0
total = 0
for images, labels in test_loader:
images = images.to(device)
labels = labels.to(device)
outputs = model(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()

print(’Test Accuracy of the model on the 10000 test images: {} %’.format(100 * correct / total))

# Draw a chart of the training loss

plt.plot(loss_list)
plt.title(’Training Loss’)
plt.xlabel(’Epoch’)
plt.ylabel(’Loss’)
plt.show()

25
Magicoder: Empowering Code Generation with OSS-I NSTRUCT

This script first loads the MNIST dataset and then defines a Convolutional Neural Network (CNN) with two convolutional
layers. The CNN is trained for 5 epochs using the Adam optimizer and the Cross Entropy Loss function. After training,
the model is tested on the test dataset and the accuracy is printed. Finally, a chart of the training loss is drawn.

F. Limitations
While Magicoders prove to be effective by being instruction-tuned with OSS-I NSTRUCT and Evol-Instruct generated data,
we apply our techniques to models with no more than 7B parameters due to resource constraints. This limitation hinders our
ability to observe the method’s impact on larger models. Moreover, the OSS-I NSTRUCT prompt is sophisticated and may
require a relatively strong model to comprehend the correct intent. Weaker models may fail to produce plausible instruction
data. Future research may explore applying OSS-I NSTRUCT to models at different capacities in the context of self-training.

aif-c01_3
No ratings yet
aif-c01_3
12 pages
Start A Business With AI
No ratings yet
Start A Business With AI
89 pages
Gen AI MCQs Unit Wise
50% (2)
Gen AI MCQs Unit Wise
15 pages
Genarative AI.dev Doc-1
No ratings yet
Genarative AI.dev Doc-1
48 pages
AI How to Sheets CNIL Marengo 1697632356
No ratings yet
AI How to Sheets CNIL Marengo 1697632356
49 pages
AI-self-replication-fudan
No ratings yet
AI-self-replication-fudan
47 pages
VideoPoet A Large Language Model For Zero-Shot Video Generation
No ratings yet
VideoPoet A Large Language Model For Zero-Shot Video Generation
20 pages
2505.23596v1
No ratings yet
2505.23596v1
18 pages
DeepSeek-Prover-V2
No ratings yet
DeepSeek-Prover-V2
34 pages
Multilingual LLMs Are Better Cross-Lingual
No ratings yet
Multilingual LLMs Are Better Cross-Lingual
16 pages
Flexgen: High-Throughput Generative Inference of Large Language Models With A Single Gpu
No ratings yet
Flexgen: High-Throughput Generative Inference of Large Language Models With A Single Gpu
23 pages
Checkpoint Generative Ai For Cybersecurity
No ratings yet
Checkpoint Generative Ai For Cybersecurity
26 pages
A Survey On Evaluation of Large Language Models
No ratings yet
A Survey On Evaluation of Large Language Models
24 pages
LM2: Large Memory Models
No ratings yet
LM2: Large Memory Models
14 pages
E1. Code Language Models
No ratings yet
E1. Code Language Models
40 pages
Pretraining and Evaluation CodeLLMs
No ratings yet
Pretraining and Evaluation CodeLLMs
71 pages
From RAG to QA-RAG - Integrating Generative AI for Pharmaceutical Regulatory Compliance Process
No ratings yet
From RAG to QA-RAG - Integrating Generative AI for Pharmaceutical Regulatory Compliance Process
9 pages
BurstGPT: A Real-World Workload Dataset to Optimize LLM Serving Systems
No ratings yet
BurstGPT: A Real-World Workload Dataset to Optimize LLM Serving Systems
12 pages
WizardMath Paper
No ratings yet
WizardMath Paper
28 pages
Fine Tuning Techniques for Large Language Models LLMs
No ratings yet
Fine Tuning Techniques for Large Language Models LLMs
15 pages
Code Llama
No ratings yet
Code Llama
47 pages
Midterm Quiz 2
No ratings yet
Midterm Quiz 2
11 pages
Instruction Tuning For Secure Code Generation
No ratings yet
Instruction Tuning For Secure Code Generation
20 pages
The Llama Hitchiking Guide to Local LLMs – hackerllama
No ratings yet
The Llama Hitchiking Guide to Local LLMs – hackerllama
13 pages
Cs cr-2402 09497
No ratings yet
Cs cr-2402 09497
18 pages
Language, Camera, Autonomy! Prompt-Engineered Robot Control For Rapidly Evolving Deployment
No ratings yet
Language, Camera, Autonomy! Prompt-Engineered Robot Control For Rapidly Evolving Deployment
5 pages
MAGECODE Machine-Generated Code Detection Method Using Large Language Models
No ratings yet
MAGECODE Machine-Generated Code Detection Method Using Large Language Models
17 pages
Programming Lang Processing
No ratings yet
Programming Lang Processing
70 pages
MCP_ Overhyped, Misunderstood, and Actually Useful
No ratings yet
MCP_ Overhyped, Misunderstood, and Actually Useful
8 pages
(2023) A Survey On Language Models For Code
No ratings yet
(2023) A Survey On Language Models For Code
55 pages
Chapter 2: Chatgpt in Academic Writing and Publishing: A Comprehensive Guide
No ratings yet
Chapter 2: Chatgpt in Academic Writing and Publishing: A Comprehensive Guide
8 pages
Chad Defeats The AI
No ratings yet
Chad Defeats The AI
7 pages
Full Text
No ratings yet
Full Text
16 pages
S Y W: S I - C L M: HOW OUR ORK Cratchpads For Ntermedi ATE Omputation With Anguage Odels
No ratings yet
S Y W: S I - C L M: HOW OUR ORK Cratchpads For Ntermedi ATE Omputation With Anguage Odels
16 pages
2403.00046v2
No ratings yet
2403.00046v2
13 pages
Code Generation 2305.11790v3
No ratings yet
Code Generation 2305.11790v3
20 pages
There Is No AI New Yorker
No ratings yet
There Is No AI New Yorker
9 pages
Agent Coder 2312.13010v2
No ratings yet
Agent Coder 2312.13010v2
21 pages
2502.17139v1
No ratings yet
2502.17139v1
13 pages
Code Llama: Open Foundation Models For Code
No ratings yet
Code Llama: Open Foundation Models For Code
48 pages
Assessing Cybersecurity Vulnerabilities in Code Large Language Models_2404.18567v1
No ratings yet
Assessing Cybersecurity Vulnerabilities in Code Large Language Models_2404.18567v1
12 pages
2108.07732v1
No ratings yet
2108.07732v1
34 pages
2412.12544v1
No ratings yet
2412.12544v1
19 pages
Multi-Head RAG: Solving Multi-Aspect Problems With LLMs
No ratings yet
Multi-Head RAG: Solving Multi-Aspect Problems With LLMs
14 pages
Fully Autonomous Programming With Large Language Models
No ratings yet
Fully Autonomous Programming With Large Language Models
10 pages
Autocoder: Enhancing Code Large Language Model With Aiev-I: Nstruct
No ratings yet
Autocoder: Enhancing Code Large Language Model With Aiev-I: Nstruct
11 pages
2504.07655v1 - Copia
No ratings yet
2504.07655v1 - Copia
12 pages
CodeAct
No ratings yet
CodeAct
12 pages
2401.14196
No ratings yet
2401.14196
23 pages
2406.00515v1
No ratings yet
2406.00515v1
49 pages
How WizardCoder Outperforms Other Code LLMs On HumanEval and HumanEval+
No ratings yet
How WizardCoder Outperforms Other Code LLMs On HumanEval and HumanEval+
8 pages
Synthetic_data
No ratings yet
Synthetic_data
33 pages
P G - C 2: B L L M C R F: AN U Oder Oosting Arge Anguage Odels For Ode With Anking Eedback
No ratings yet
P G - C 2: B L L M C R F: AN U Oder Oosting Arge Anguage Odels For Ode With Anking Eedback
15 pages
Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches
No ratings yet
Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches
16 pages
Next-Gen Talent Matching System: Innovating Recruitment with AI-Driven JD and CV Matching
No ratings yet
Next-Gen Talent Matching System: Innovating Recruitment with AI-Driven JD and CV Matching
5 pages
Seed Coder
No ratings yet
Seed Coder
46 pages
OpenAI Codex Arxiv
No ratings yet
OpenAI Codex Arxiv
35 pages
Large Language Models Meet NL2Code A Survey
No ratings yet
Large Language Models Meet NL2Code A Survey
22 pages
OpenCoder_1731317971
No ratings yet
OpenCoder_1731317971
35 pages
Generative AI & LLM-based Financial Apps - Permanent Vs Contract Team Building
No ratings yet
Generative AI & LLM-based Financial Apps - Permanent Vs Contract Team Building
33 pages
2408.09701v1
No ratings yet
2408.09701v1
11 pages
CodeGeeX - A Pre-Trained Model For Code Generation With Multilingual Evaluations On HumanEval-X
No ratings yet
CodeGeeX - A Pre-Trained Model For Code Generation With Multilingual Evaluations On HumanEval-X
30 pages
AI Engineers - Internship leads to Full time - Ticking Minds
No ratings yet
AI Engineers - Internship leads to Full time - Ticking Minds
3 pages
HumanEval Pro and MBPPPro Evaluating Large Language Models
No ratings yet
HumanEval Pro and MBPPPro Evaluating Large Language Models
27 pages
Legal 2 AI
No ratings yet
Legal 2 AI
10 pages
S 001: N Q A C LLM E: Afurai EW Ualitative Pproach For ODE Valuation
No ratings yet
S 001: N Q A C LLM E: Afurai EW Ualitative Pproach For ODE Valuation
22 pages
Code Generation With LLMs
No ratings yet
Code Generation With LLMs
59 pages
Evaluating Large Language Models Trained On Code
No ratings yet
Evaluating Large Language Models Trained On Code
35 pages
CodeTree
No ratings yet
CodeTree
16 pages
Magicoder - Source Code Is All You Need
No ratings yet
Magicoder - Source Code Is All You Need
16 pages
2732 Shraddha
No ratings yet
2732 Shraddha
4 pages
代码大模型
No ratings yet
代码大模型
18 pages
0.5 YOE SDE Sahil Bhoite
No ratings yet
0.5 YOE SDE Sahil Bhoite
1 page
Updated Project's Draft Paper
No ratings yet
Updated Project's Draft Paper
5 pages
2022 - Expectation vs. Experience - Evaluating The Usability of Code Generation Tools Powered by Large Language Models
No ratings yet
2022 - Expectation vs. Experience - Evaluating The Usability of Code Generation Tools Powered by Large Language Models
7 pages
Code Generation 2305.10679v1
No ratings yet
Code Generation 2305.10679v1
13 pages
2034_999_DOC_LAB Exploring AI-Powered Coding
No ratings yet
2034_999_DOC_LAB Exploring AI-Powered Coding
3 pages
PyTorch Cookbook
From Everand
PyTorch Cookbook
Matthew Rosch
No ratings yet
C + +: C++ programming
From Everand
C + +: C++ programming
Ummed Singh
No ratings yet
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
From Everand
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
Manoj R Chakravarthi
No ratings yet
Objective-C Language Reference and Techniques: Definitive Reference for Developers and Engineers
From Everand
Objective-C Language Reference and Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Learn C++
From Everand
Learn C++
Aishik Dutta
No ratings yet
PyTorch Cookbook: 100+ Solutions across RNNs, CNNs, python tools, distributed training and graph networks
From Everand
PyTorch Cookbook: 100+ Solutions across RNNs, CNNs, python tools, distributed training and graph networks
Matthew Rosch
No ratings yet
Mastering the Art of C# Programming: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Art of C# Programming: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
C# OOP Step by Step: A Practical Guide with Examples
From Everand
C# OOP Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
C# Fundamentals Made Simple: A Practical Guide with Examples
From Everand
C# Fundamentals Made Simple: A Practical Guide with Examples
William E. Clark
No ratings yet
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
PHP Package Mastery: 100 Essential Tools in One Hour - 2024 Edition
From Everand
PHP Package Mastery: 100 Essential Tools in One Hour - 2024 Edition
Kanto
No ratings yet
Image Collection Exploration: Unveiling Visual Landscapes in Computer Vision
From Everand
Image Collection Exploration: Unveiling Visual Landscapes in Computer Vision
Fouad Sabry
No ratings yet

Uploaded by

Uploaded by

Magicoder: Empowering Code Generation with OSS-I NSTRUCT

Abstract 2023a) for domain-specific tasks. Until recently, Large Lan-

Open-source codebase Seed code snippet

Generated solution (details omitted) Generated problem (details omitted)

Seed: shell script Seed: library imports Seed: class signature

0.14 age similarity. This result indicates that the improvements

+ 155 + 220 + 291 + 68 + 106 + 115 + 45 = 1000

Model (7B) Finetuning Data Python (HumanEval+) Others (MultiPL-E)

Wei, Y., Xia, C. S., and Zhang, L. Copiloting the copilots:

You are exceptionally skilled at crafting high-quality programming problems and

Figure 4: The detailed prompt design for OSS-I NSTRUCT

A. More Details of OSS-I NSTRUCT

A.2. Qualitative Examples

A.3. Breakdown of OSS-I NSTRUCT Dataset

Seed: method de inition Seed: shell script

Seed: library imports Seed: class signature

Seed: code statements Seed: comments

Figure 6: The category constitution of OSS-I NSTRUCT

Figure 7: Token count distribution of OSS-I NSTRUCT-generated problems and solutions

B.2. Data Decontamination

C. More Evaluation Results

Model Introductory (60) Interview (180) Competition (60) Overall (300)

C.2. Fill-in-the-Middle Evaluation on DS-1000

Model NumPy Pandas PyTorch SciPy Sklearn TensorFlow Overall

C.3. Impact of Removing Noisy Data

Setting Dataset Size HumanEval+

D. Cases where OSS-I NSTRUCT Improves Code Understanding

D.1. More Precise Understanding of Requirements

D.2. Handling Corner Cases

D.3. Generalizing to Unseen Tasks

Prompt: Create an easy Python coding exercise

# Test the function

E. Generating Complex Programs

E.1. Implementing Snake Game in Python

# Set up the screen

# Main game loop

# Check for a collision with the border

# Hide the segments

# Clear the segments list

# Reset the score

# Reset the delay

# Update the score display

# Check for a collision with the food

# Shorten the delay

# Increase the score

if score > high_score:

# Move the end segments first in reverse order

# Move segment 0 to where the head is

# Check for head collision with the body segments

# Hide the segments

# Clear the segments list

# Reset the score

# Reset the delay

# Update the score display

E.2. Building Othello Game in Java

public class Othello {

public static void main(String[] args) {

private static void initializeBoard() {

private static void printBoard() {

private static boolean isValidMove(int row, int col) {

private static void makeMove(int row, int col) {

private static boolean hasValidMove() {

E.3. Writing Specific Gradio Application

E.4. Developing Machine Learning Pipeline with PyTorch

# Convolutional neural network (two convolutional layers)

def forward(self, x):

# Loss and optimizer

# Train the model

# Backward and optimize

# Test the model

# Draw a chart of the training loss

You might also like