7469_Magicoder_Empowering_Code
7469_Magicoder_Empowering_Code
Yuxiang Wei 1 Zhe Wang 2 † Jiawei Liu 1 Yifeng Ding 1 Lingming Zhang 1
1
Magicoder: Empowering Code Generation with OSS-I NSTRUCT
Figure 1: Overview of OSS-I NSTRUCT and the pass@1 results of different LLMs on HumanEval (+)
OSS-I NSTRUCT leverages a powerful LLM to automati- CL even outperforms WizardCoder-CL-7B, WizardCoder-
cally generate new coding problems by drawing inspira- SC-15B, and all studied SOTA LLMs with less than or equal
tion from any random code snippets collected from the to 16B parameters on all the benchmarks we tested. Also,
open source. In this example, the LLM gets inspired by the pass@1 result of the enhanced MagicoderS-CL is on
two incomplete code fragments from different functions par with ChatGPT on HumanEval (70.7 vs. 72.6) and sur-
and manages to relate them and craft a realistic machine passes it on the more rigorous HumanEval+ (66.5 vs. 65.9),
learning problem. Thanks to the “infinite” real-world open- indicating that MagicoderS-CL can generate more robust
source code, OSS-I NSTRUCT can directly produce diverse, code. It also achieves SOTA results among all code models
realistic, and controllable code instructions by providing at the same scale.
distinct seed code snippets. In the end, we generate 75K
Additionally, we notice a very recent advancement in the
synthetic data to finetune C ODE L LAMA -P YTHON-7B, re-
development of the DeepSeek-Coder series (Guo et al.,
sulting in Magicoder-CL. While being simple and effective,
2024) which has shown exceptional coding performance.
OSS-I NSTRUCT is orthogonal to existing data generation
However, due to the limited technical details disclosed,
methods, and they can be combined to further boost the
we only briefly discuss them in §3.4. Despite this, we
models’ coding capabilities. Therefore, we continually fine-
applied OSS-I NSTRUCT on DeepSeek-Coder-Base 6.7B,
tune Magicoder-CL on an open-source Evol-Instruct dataset
resulting in the creation of Magicoder-DS and MagicoderS-
with 110K entries, producing MagicoderS-CL.
DS. In addition to the consistent findings on the previous
We evaluate Magicoder and MagicoderS on a wide range results with C ODE L LAMA -P YTHON-7B as the base model,
of coding tasks, including HumanEval (Chen et al., 2021) Magicoder-DS and MagicoderS-DS benefit from the more
and MBPP (Austin et al., 2021) for Python text-to-code gen- powerful DeepSeek-Coder-Base-6.7B. This advantage is
eration, MultiPL-E (Cassano et al., 2022) for multilingual demonstrated by MagicoderS-DS, which achieves a remark-
code completion, and DS-1000 (Lai et al., 2022) for solving able 76.8 pass@1 on HumanEval. MagicoderS-DS also out-
data science problems. We further adopt EvalPlus (Liu et al., performs DeepSeek-Coder-Instruct-6.7B on HumanEval (+)
2023b), which includes the augmented HumanEval+ and and MBPP (+) with 8× less finetuning tokens.
MBPP+ datasets for more rigorous model evaluation. Both
To justify the design of OSS-I NSTRUCT, i.e., generating
Magicoder-CL and MagicoderS-CL substantially boost the
instruction-tuning data from open-source references rather
base C ODE L LAMA -P YTHON-7B. Additionally, Magicoder-
2
Magicoder: Empowering Code Generation with OSS-I NSTRUCT
than using the references directly, we demonstrate that fine- initial seed snippets from 80K code documents, 40K from
tuning the base models with semantically relevant comment- Python, and 5K from each of C++, Java, TypeScript, Shell,
function pairs extracted from open-source projects even C#, Rust, PHP, and Swift respectively. Then, each collected
negatively impacts the model performance (§4.2). seed code snippet is applied to the prompt template shown
in Appendix A.1, which a teacher model takes as input and
In general, we make the following contributions:
outputs both a coding problem and its solution.
• We introduce OSS-I NSTRUCT, a pioneering approach to 2.2. Data Cleaning and Decontamination
enlightening LLMs with open-source code snippets to
generate more diverse, realistic, and controllable coding We perform data cleaning by excluding samples that are
instruction data, which can be leveraged to substantially identical or share the same seed code snippet. While there
boost the performance of various LLMs via instruction exist other sorts of noisiness (e.g., the solution is incom-
tuning. It opens a new dimension for creating low-bias plete) in the generated data, inspired by Honovich et al.
and diverse instruction-tuning data from the abundance of (2023), they are not removed as we believe they still con-
open-source references. tain valuable information for LLMs to learn. More experi-
mental details can be found in Appendix C.3. Finally, we
• We build the Magicoder series trained with OSS- apply the same logic as StarCoder Li et al. (2023) to decon-
I NSTRUCT and MagicoderS series trained on a combi- taminate our training data by removing coding problems
nation of OSS-I NSTRUCT and Evol-Instruct. Our eval- that contain docstrings or solutions from HumanEval (Chen
uation across 6 benchmarks shows that all Magicoders et al., 2021) and MBPP (Austin et al., 2021), docstrings
significantly improve the base LLMs. Notably, both from APPS (Hendrycks et al., 2021), prompts from DS-
MagicoderS-CL and MagicoderS-DS outperform Chat- 1000 (Lai et al., 2022), or questions from GSM8K (Cobbe
GPT on HumanEval+ with only 7B parameters. et al., 2021). As part of our analysis, the decontamination
procedure only filters out 9 additional samples. Since the
• We fully open source the model weights, training data, and seed corpus starcoderdata has already gone through
source code at https://github.com/ise-uiuc/ rigorous data decontamination, this observation suggests
magicoder to facilitate future research. that OSS-I NSTRUCT is unlikely to introduce additional data
leakage beyond the seeds. The eventual OSS-I NSTRUCT
2. OSS-I NSTRUCT: Instruction Tuning from dataset contains about 75K entries. An overview of the
dataset statistics can be found in Appendix A.3.
Open Source
In this section, we elaborate on our OSS-I NSTRUCT ap- 2.3. Qualitative Examples of OSS-I NSTRUCT
proach. From a high level, as shown in Figure 1, OSS-
I NSTRUCT works by prompting an LLM (e.g., ChatGPT) Figure 2 shows some qualitative examples of how OSS-
to generate a coding problem and its solution according to I NSTRUCT can help LLM get inspiration from a seed code
some seed code snippet collected from the wild (e.g., from snippet to create new coding problems and solutions. For
GitHub). The seed snippet offers controllability of the gen- example, the shell script example shows how an LLM crafts
eration and encourages the LLM to create diverse coding a Python coding problem with just one line of shell script.
problems that can reflect real-world programming scenarios. The library imports example demonstrates how an LLM
can create a realistic machine learning problem using just
a few import statements. Meanwhile, the class signature
2.1. Generating Coding Problems
instance illustrates the ability of LLM to draw inspiration
OSS-I NSTRUCT is powered by seed code snippets that can from an incomplete class definition featuring annotations
be easily collected from open source. In this work, we like SpringBootApplication and keywords such as
directly adopt starcoderdata as our seed corpus, a fil- bank. From this, the LLM generates a problem that re-
tered version of The Stack (Kocetkov et al., 2022) dataset quires implementing a complete banking system based on
that StarCoder is trained on, containing permissively li- Spring Boot. Overall, OSS-I NSTRUCT can inspire an LLM
censed source code documents in various programming lan- with distinct code structures and semantics to create diverse
guages. We chose starcoderdata because it is widely coding tasks, including algorithmic challenges, realistic
adopted, includes massive high-quality code snippets, and issues, single-function code generation, library-based pro-
is even post-processed for data decontamination (Li et al., gram completion, whole-program development, and even
2023; Allal et al., 2023). For each code document from whole-application construction.
the corpus, we randomly extract 1–15 consecutive lines
as the seed snippet for the model to gain inspiration from Similarity with HumanEval To study whether our data
and produce coding problems. In total, we collected 80K generation process produces more HumanEval-like prob-
3
Magicoder: Empowering Code Generation with OSS-I NSTRUCT
Figure 2: Examples showing how OSS-I NSTRUCT generates problems and solutions from seed code snippets. Detailed
problem requirements, implementations, and explanations are omitted for brevity. More examples can be found in
Appendix A.2.
0.08 3. Evaluation
0.06
We choose C ODE L LAMA -P YTHON-7B and DeepSeek-
0.04
Coder-Base 6.7B as the base LLMs. To derive Magicoder
0.02 series, we first finetune them on 75K synthetic data
0.00 generated through OSS-I NSTRUCT. We then obtain
0.0 0.1 0.2 0.3 0.4 0.5 MagicoderS by continuing finetuning Magicoder with the
Cosine Similarity Score evol-codealpaca-v1 dataset, an open-source Evol-
Instruct implementation containing about 110K samples.
Figure 3: Cosine similarities between HumanEval and syn- More implementation details and additional evaluation re-
thetic data generated by different methods. sults are listed in Appendices B and C. We also present
interesting use cases that reflect the effectiveness of instruc-
tion tuning in Appendix D and demonstrate Magicoder’s
capability to generate complex programs in Appendix E.
lems or solutions that contribute to high performance, we
pair each sample from our 75K dataset with each of the
164 HumanEval (Chen et al., 2021) samples and compute 3.1. Python Text-to-Code Generation
their cosine similarity using TF-IDF (SPARCK JONES, HumanEval (Chen et al., 2021) and MBPP (Austin et al.,
1972) embeddings. We then associate each OSS-I NSTRUCT 2021) are two of the most widely used benchmarks for code
sample with a HumanEval sample with the highest simi- generation. Each task in these benchmarks includes a task
larity score. We also compare our dataset against Code description (e.g., docstring) as the prompt, where LLMs
Alpaca, a 20K dataset applying S ELF -I NSTRUCT to code, generate corresponding code whose correctness is checked
and evol-codealpaca-v1 (theblackcat102, 2023), an by a handful of test cases. Because tests in these benchmarks
open-source reproduction of Evol-Instruct containing 110K can be insufficient, for more rigorous evaluation, we use
coding instructions. We resort to the open-source implemen- HumanEval+ and MBPP+, both powered by the EvalPlus
tation because the official Code Evol-Instruct (Luo et al., framework (Liu et al., 2023b) to obtain 80×/35× more tests.
2023b) dataset is not released. We decontaminate all the Following prior work (Liu et al., 2023b; Chen et al., 2023),
datasets beforehand using the same way discussed in §2.2. for each task and LLM we use greedy decoding to generate
Figure 3 shows that OSS-I NSTRUCT exhibits the lowest one sample and focus on comparing the pass@1 metric.
average similarity among all the studied data generation
techniques while S ELF -I NSTRUCT shows the highest aver- We consider a wide range of baseline models, including
4
Magicoder: Empowering Code Generation with OSS-I NSTRUCT
Table 1: Pass@1 (%) results of different LLMs on HumanEval (+) and MBPP (+) computed with greedy decoding. The
abbreviations “CL” and “SC” refer to the base models C ODE L LAMA -P YTHON and StarCoder, respectively. We report the
results consistently from the EvalPlus (Liu et al., 2023b) Leaderboard.
Benchmark Open-Source
Model Release Date Size
HumanEval (+) MBPP (+) Weight Data
GPT-3.5 Turbo Nov 2023 - 72.6 (65.9) 81.7 (69.4) # #
GPT-4 Turbo Nov 2023 - 85.4 (81.7) 83.0 (70.7) # #
C ODE L LAMA -P YTHON Aug 2023 34B 51.8 (42.7) 67.2 (52.9) #
WizardCoder-CL Sep 2023 34B 73.2 (64.6) 73.2 (59.9) #
CodeT5+ May 2023 16B 31.7 (26.2) 54.6 (44.4)
CodeGen-Mono Mar 2022 16B 32.9 (27.4) 52.6 (43.6)
StarCoder May 2023 15B 34.1 (29.3) 55.1 (46.1)
C ODE L LAMA -P YTHON Aug 2023 13B 42.7 (36.6) 61.2 (50.9) #
WizardCoder-SC Sep 2023 15B 51.9 (45.1) 61.9 (50.6) #
StarCoder May 2023 7B 24.4 (20.7) 33.1 (28.8)
Mistral Oct 2023 7B 28.7 (23.2) 50.1 (40.9) #
CodeT5+ May 2023 6B 29.3 (23.8) 51.9 (40.9)
CodeGen-Mono Mar 2022 6B 29.3 (25.6) 49.9 (42.1)
C ODE L LAMA -P YTHON Aug 2023 7B 37.8 (34.1) 57.6 (45.4) #
WizardCoder-CL Sep 2023 7B 48.2 (40.9) 56.6 (47.1) #
Magicoder-CL Dec 2023 7B 60.4 (55.5) 64.2 (52.6)
MagicoderS-CL Dec 2023 7B 70.7 (66.5) 68.4 (56.6)
C ODE L LAMA -P YTHON (Rozière et al., 2023), Wizard- 3.2. Multilingual Code Generation
Coder (Luo et al., 2023b), GPT-3.5 Turbo (OpenAI, 2022),
In addition to Python, as shown in Table 2, we perform
GPT-4 Turbo (OpenAI, 2023), StarCoder (Li et al., 2023),
an extensive evaluation on 6 widely used programming
CodeT5+ (Wang et al., 2023b), CodeGen-Mono (Nijkamp
languages, i.e., Java, JavaScript, C++, PHP, Swift, and
et al., 2023), and Mistral (Jiang et al., 2023a). All the re-
Rust, using the MultiPL-E benchmark (Cassano et al.,
sults are consistently reported from the EvalPlus (Liu et al.,
2022). We report available results from the WizardCoder pa-
2023b) leaderboard (EvalPlus hash: 1895d2f).
per (Luo et al., 2023b) and evaluate our models consistently
Table 1 shows the pass@1 results of different LLMs on through bigcode-evaluation-harness (Ben Allal
these benchmarks. From the results, we can first observe et al., 2022). We skip proprietary models such as Chat-
that Magicoder-CL has a clear improvement over the base GPT and GPT-4 as they are not supported by the frame-
C ODE L LAMA -P YTHON-7B, and outperforms all studied work. Due to a significant inference latency when running
open-source models except C ODE L LAMA -P YTHON-34B WizardCoder-CL-7B using the harness in our environment,
and WizardCoder-CL-34B. Notably, Magicoder-CL sur- we choose not to include it in our analysis.
passes WizardCoder-SC-15B and has a substantial improve-
The results indicate that Magicoder-CL improves the base
ment on HumanEval and HumanEval+ over C ODE L LAMA -
C ODE L LAMA -P YTHON-7B by a large margin among all
P YTHON-34B. MagicoderS-CL demonstrates further im-
the studied programming languages. Moreover, Magicoder-
provements by being trained with the orthogonal Evol-
CL also achieves better results than the SOTA 15B
Instruct method. MagicoderS-CL outperforms ChatGPT
WizardCoder-SC among half of the programming lan-
and all other open-source models on HumanEval+. More-
guages. Additionally, MagicoderS-CL demonstrates fur-
over, although it scores slightly lower than WizardCoder-
ther improvement over Magicoder-CL on all program-
CL-34B and ChatGPT on HumanEval, it surpasses both of
ming languages, achieving comparable performance against
them on the more rigorous HumanEval+ dataset, indicating
WizardCoder-CL-34B with only 7B parameters. It is worth
that MagicoderS-CL may produce more robust code.
noting that Magicoder-CL is only trained with very limited
multilingual data but still outperforms other LLMs with
similar or even larger sizes. Also, although the harness
5
Magicoder: Empowering Code Generation with OSS-I NSTRUCT
evaluates models in completion formats which are for base forming OSS-I NSTRUCT may produce code in a different
models, Magicoders still show significant improvements de- programming language than the seed.
spite being only instruction-tuned. This implies that LLMs
Table 5 shows the evaluation results, where we consistently
can learn knowledge from the data beyond its format.
finetune the base C ODE L LAMA -P YTHON-7B for 2 epochs
on different data partitions using the same training hyper-
3.3. Code Generation for Data Science parameters explained in Appendix B. From the table, we
The DS-1000 dataset (Lai et al., 2022) contains 1K distinct can see that, as can be imagined, training on Python or
data science coding issues ranging from 7 popular data sci- non-Python data can substantially boost the performance of
ence libraries in Python. It evaluates the realistic and practi- the base model in Python or non-Python tasks, respectively.
cal use case of an LLM and offers unit tests for validating Interestingly, instruction tuning on different programming
each problem. DS-1000 has both completion and insertion languages can still boost the overall coding performance
modes, but here we only evaluate completion because the that includes out-of-distribution languages. For example,
base C ODE L LAMA -P YTHON does not support infilling. Ta- when trained on only non-Python data, Magicoder-CL still
ble 3 shows the evaluation results where we include the achieves a 10.4 percentage point improvement over the base
recent I N C ODER (Fried et al., 2023), CodeGen (Nijkamp model in the Python-only evaluation. This implies LLMs
et al., 2023), Code-Cushman-001 (Microsoft, 2023a), Star- can establish correlations between different programming
Coder (Li et al., 2023), C ODE L LAMA -P YTHON (Rozière languages and perform transfer learning of deeper code se-
et al., 2023), and WizardCoder (Luo et al., 2023b). We mantics. Finally, we observe a more significant boost in
can see from the table that Magicoder-CL-7B already out- Python evaluation when combining data from both sources,
performs all the baselines we evaluate, including state- with a slight decrease in multilingual performance compared
of-the-art WizardCoder-CL-7B and WizardCoder-SC-15B. with only finetuning on multilingual data. We attribute this
MagicoderS-CL-7B further breaks the limit by introduc- decrease to the dominant amount of Python data (around
ing an 8.3 percentage point absolute improvement over 57%) during instruction tuning.
WizardCoder-SC-15B.
4.2. OSS-I NSTRUCT vs. Direct Finetuning
3.4. Comparison with DeepSeek-Coder The fact that OSS-I NSTRUCT gets an LLM inspired from
DeepSeek-Coder (Guo et al., 2024) is a series of models open-source code snippets may lead to a natural question:
released concurrently to our work and they demonstrate su- why not directly finetuning on these open-source code? To
perior coding performance. We only briefly discuss it in answer this question, we follow CodeSearchNet (Husain
this section because its data and instruction tuning details et al., 2020) to mine semantically relevant comment-function
are not publicly available at the time of writing. We apply pairs from the same seed document corpus we use to con-
the same finetuning strategy on DeepSeek-Coder-Base-6.7B struct the 75K OSS-I NSTRUCT dataset. We then train the
as we performed on C ODE L LAMA -P YTHON-7B, leading model to predict the function bodies from the function signa-
to Magicoder-DS and MagicoderS-DS. Table 4 shows a tures and comments. We prioritize comment-function pairs
similar trend as Table 1 that the base model can be sig- that overlap with our 75K seed snippets, resulting in about
nificantly improved after applying OSS-I NSTRUCT. Re- 11K data points. To align with our 75K samples, we collect
markably, the MagicoderS-DS variant surpasses DeepSeek- the remaining 64K samples using the whole corpus of 75K
Coder-Instruct-6.7B on all the benchmarks with ×8 fewer seed documents. Eventually, we have the same number of
training tokens, and it also closely matches DeepSeek- comment-function pairs with OSS-I NSTRUCT data.
Coder-Instruct-33B on these datasets. We finetune the base C ODE L LAMA -P YTHON-7B for 2
epochs using the paired data, following the same training
4. Ablations of Data Source setup discussed in Appendix B. From Table 6, we observe
that finetuning on 75K paired comment-function data even
4.1. Impact of the Language Distribution worsens the base model, while OSS-I NSTRUCT helps to
To understand the correlation between the programming lan- introduce a substantial boost. We conjecture that the degra-
guages appearing in the training data and the downstream dation is owing to the substantial noise and inconsistency
performance of different languages, we conduct an addi- that exists intrinsically in the data pairs, even though these
tional ablation study about the training data. We classify the paired data exhibit very similar format as HumanEval or
75K training data into approximately 43K Python-only, and MultiPL-E problems. This further shows that data factual-
32K non-Python data according to whether ‘‘‘python ity, rather than the format, is essential to code instruction
is a substring of the generated data. We do not classify tuning. It also indicates the superiority of OSS-I NSTRUCT
the data based on the seed code snippet because LLMs per- which can translate these loosely related code fragments
6
Magicoder: Empowering Code Generation with OSS-I NSTRUCT
Table 2: Pass@1 results of different LLMs on MultiPL-E (Cassano et al., 2022) following the same hyperparameter
settings as the WizardCoder paper (Luo et al., 2023b): temperature = 0.2, top p = 0.95, max length = 512, and
num samples = 50. We evaluate all 7B models using bigcode-evaluation-harness (Ben Allal et al., 2022) and
report other results from WizardCoder.
Programming Language
Model Size
Java JavaScript C++ PHP Swift Rust
C ODE L LAMA 34B 40.2 41.7 41.4 40.4 35.3 38.7
C ODE L LAMA -P YTHON 34B 39.5 44.7 39.1 39.8 34.3 39.7
C ODE L LAMA -I NSTRUCT 34B 41.5 45.9 41.5 37.0 37.6 39.3
WizardCoder-CL 34B 44.9 55.3 47.2 47.2 44.3 46.2
StarCoderBase 15B 28.5 31.7 30.6 26.8 16.7 24.5
StarCoder 15B 30.2 30.8 31.6 26.1 22.7 21.8
WizardCoder-SC 15B 35.8 41.9 39.0 39.3 33.7 27.1
C ODE L LAMA 7B 29.3 31.7 27.0 25.1 25.6 25.5
C ODE L LAMA -P YTHON 7B 29.1 35.7 30.2 29.0 27.1 27.0
Magicoder-CL 7B 36.4 45.9 36.5 39.5 33.4 30.6
MagicoderS-CL 7B 42.9 57.5 44.4 47.6 44.1 40.3
Table 3: Pass@1 results on DS-1000 (completion format) with temperature = 0.2, top p = 0.5, max length =
1024, and num samples = 40, following the same hyperparameter setting used in WizardCoder (Luo et al., 2023b). We
evaluate all the 7B models with their preferred prompt formats and report other results from WizardCoder.
Table 4: Pass@1 (greedy decoding) comparison between Magicoder and DeepSeek-Coder (Guo et al., 2024) on Hu-
manEval (+) and MBPP (+). DeepSeek-Coder results are reported from EvalPlus (Liu et al., 2023b) Leaderboard.
Benchmark Open-Source
Model Size Training Tokens
HumanEval (+) MBPP (+) Weight Data
1.3B 2T - 55.4 (46.9) #
DeepSeek-Coder-Base 6.7B 2T 47.6 (39.6) 70.2 (56.6) #
33B 2T 51.2 (43.3) - #
1.3B +2B 64.6 (58.5) 63.7 (53.1) #
DeepSeek-Coder Instruct 6.7B +2B 73.8 (70.1) 72.7 (63.4) #
33B +2B 78.7 (72.6) 78.7 (66.7) #
Magicoder-DS 6.7B +90M 66.5 (60.4) 75.4 (61.9)
MagicoderS-DS 6.7B +240M 76.8 (70.7) 75.7 (64.4)
7
Magicoder: Empowering Code Generation with OSS-I NSTRUCT
Table 5: Ablation study of using different programming languages as training data. We show the pass@1 results on
HumanEval+ (Liu et al., 2023b) for Python and the average pass@1 results on MultiPL-E (Cassano et al., 2022) for the
same set of programming languages used in Table 2 (i.e., Java, JavaScript, C++, PHP, Swift, and Rust). All the variants are
finetuned with 2 epochs and evaluated through greedy-decoding.
Table 6: Comparison between OSS-I NSTRUCT and directly Table 7 indicates that Magicoder-CL-Mixtral-7B not
finetuning on comment-function pairs with C ODE L LAMA - only significantly improves over the base C ODE L LAMA -
P YTHON-7B as the base model. P YTHON, but is also better than Mixtral-8x7B-Instruct-v0.1
(i.e., the teacher model) across HumanEval+ and MBPP+.
Finetuning Data HumanEval+ MultiPL-E These results suggest that OSS-I NSTRUCT is not simply dis-
Base model w/o finetuning 34.1 29.6 tilling a teacher model, but also triggering the base model’s
Comment-function pairs (75K) 34.1 24.1 own capability and effectively leveraging the information
OSS-I NSTRUCT (75K) 55.5 37.8 encapsulated in seed code snippets.
5. Related Work
into semantically-consistent instruction-tuning data. Foundation models for code Trained over billions of
lines of code, LLMs have demonstrated outstanding per-
4.3. OSS-I NSTRUCT with A Less Powerful Teacher formance in a wide range of software engineering tasks,
In this section, we explore the factors contributing to the including code generation (Chen et al., 2021; Austin et al.,
effectiveness of OSS-I NSTRUCT beyond just the distillation 2021), program repair (Xia & Zhang, 2022; Wei et al.,
of the teacher model. We propose two potential key reasons. 2023; Xia et al., 2023b; Jiang et al., 2023b; Bouzenia et al.,
First, since the base model is pretrained with comprehen- 2024), and software testing (Xia et al., 2023a; Deng et al.,
sive code data, the distillation process likely activates the 2023; Yuan et al., 2023; Schäfer et al., 2023; Lemieux et al.,
model’s internal capabilities, leading to improved perfor- 2023). In particular, prominent base models, such as Code-
mance in coding tasks. Second, OSS-I NSTRUCT uses seed Gen (Nijkamp et al., 2023), CodeT5 (Wang et al., 2021),
code snippets to generate problem-solution pairs in one shot. StarCoder (Li et al., 2023), and C ODE L LAMA (Rozière
These seed snippets provide valuable context, enabling the et al., 2023), are pre-trained over a huge number of code-
model to create better solutions than a plain teacher model base from scratch, establishing the fundamental ability of
lacking such seed information. These enhanced solutions general code generation and understanding. More recent
can then be used to train more effective student models. To code LLMs, such as DeepSeek-Coder (Guo et al., 2024) and
verify these points, we conduct an additional experiment StarCoder2 (Lozhkov et al., 2024), additionally organize
by generating a subset of 20K OSS-I NSTRUCT data using the pretraining data at the repository level to enhance the
Mixtral-8x7B-Instruct-v0.1 (Jiang et al., 2024), a state-of- model’s contextual understanding capabilities. Furthermore,
the-art, general-purpose, open-source LLM. these base models are also finetuned (Luo et al., 2023b) or
prompted (Chen et al., 2023) to unlock their true potential
to specialize in solving domain-specific coding tasks.
Table 7: Pass@1 on HumanEval+ and MBPP+ when fine-
tuning C ODE L LAMA -P YTHON-7B for 2 epochs on 20K
OSS-I NSTRUCT data generated by Mixtral-8x7B-Instruct- Instruction tuning with synthetic data Instruction tun-
v0.1 (Jiang et al., 2024). ing aims to improve pretrained LLMs by finetuning them
with a mixture of instructions and corresponding re-
Model HumanEval+ MBPP+ sponses (Wei et al., 2022). However, obtaining high-
quality instructional data is oftentimes laborious. Hence,
Mixtral-8x7B-Instruct-v0.1 39.6 47.4
researchers are increasingly focusing on the development
C ODE L LAMA -P YTHON-7B 34.1 45.4
of methods to generate synthetic instruction data. Wang
Magicoder-CL-Mixtral-7B 55.5 50.4
et al. (2023a) introduces S ELF -I NSTRUCT, where a founda-
8
Magicoder: Empowering Code Generation with OSS-I NSTRUCT
tion LLM (GPT-3 (Brown et al., 2020)) is used to gen- HumanEval benchmarks. We fully open source the model
erate synthetic instruction-response pairs with carefully weights, training data, and source code, to enable future
crafted prompts. The same LLM is then instruction-tuned on research in LLMs for code. In the near future, we will ap-
the synthetic data to distill such self-generated knowledge. ply OSS-I NSTRUCT to larger base models. We will also
This technique has been further extended to create synthetic continue advancing OSS-I NSTRUCT by generating higher-
data with different LLMs. For example, Alpaca (Taori et al., quality data with a strategically designed distribution of the
2023) and Code Alpaca (Chaudhary, 2023) apply S ELF - seed code snippets and with more advanced teacher LLMs
I NSTRUCT to finetune L LAMA with ChatGPT-generated such as GPT-4.
instructions. To improve S ELF -I NSTRUCT, WizardLM (Xu
et al., 2023) and WizardCoder (Luo et al., 2023a) propose Acknowledgement
Evol-Instruct and Code Evol-Instruct by guiding ChatGPT
with heuristic prompts to make the synthetic data more com- We thank all the reviewers for their insightful comments and
plex and diverse. More recently, Gunasekar et al. (2023) suggestions for our paper. This work was partially supported
shows that textbook-quality synthetic data alone can help by NSF grant CCF-2131943, as well as Kwai Inc.
the model achieve remarkable coding and reasoning capa-
bilities. Orthogonal to all existing methods, our proposed Impact Statement
OSS-I NSTRUCT allows LLMs to get inspired from real-
world code snippets for better controllability, quality, and This work is motivated to boost large language models
creativity in coding tasks. in terms of their code generation and understanding ca-
pabilities through instruction tuning. The proposed OSS-
Evaluating LLMs for code Most code benchmarks eval- I NSTRUCT method leverages the abundance of open source
uate LLMs on generating single-function programs from to generate diverse and controllable instruction data. We ex-
natural language descriptions. Such benchmarks include pect this idea to also foster innovative software solutions tai-
HumanEval (Chen et al., 2021), MBPP (Austin et al., 2021), lored to domain-specific needs, particularly in areas where
APPS (Hendrycks et al., 2021), and CodeContests (Li et al., real data is private and scarce, by generating extensive syn-
2022). A handful of manual tests are used to assess the thetic data. Additionally, our method reinforces the value
functional correctness of LLM-generated solutions. How- of community-driven content and knowledge sharing by
ever, insufficient tests can lead to false negatives. Conse- incorporating open-source code as references.
quently, the EvalPlus framework (Liu et al., 2023b) pro- However, it is essential to recognize the potential for misuse,
duces HumanEval+ and MBPP+ by extending 80×/35× such as the deliberate generation of vulnerable code that can
more tests. To address dataset contamination issues, re- be exploited for malicious purposes. Ultimately, adhering
searchers propose LiveCodeBench (Jain et al., 2024), which to ethical guidelines is crucial to ensure the responsible use
compiles fresh coding problems not included in model of this technique.
training, and EvoEval (Xia et al., 2024), which strategi-
cally leverages LLMs to evolve existing benchmarks into
new coding tasks. Meanwhile, there are comprehensive
References
benchmarks evaluating code generation for data science Allal, L. B., Li, R., Kocetkov, D., Mou, C., Akiki, C., Fer-
(DS-1000 (Lai et al., 2022)), addressing open-source issues randis, C. M., Muennighoff, N., Mishra, M., Gu, A., Dey,
(SWE-bench (Jimenez et al., 2023)), and repository-level M., Umapathi, L. K., Anderson, C. J., Zi, Y., Poirier, J. L.,
code generation (C ROSS C ODE E VAL (Ding et al., 2023) and Schoelkopf, H., Troshin, S., Abulkhanov, D., Romero,
RepoEval (Zhang et al., 2023)). M., Lappert, M., Toni, F. D., del Rı́o, B. G., Liu, Q.,
Bose, S., Bhattacharyya, U., Zhuo, T. Y., Yu, I., Villegas,
6. Conclusion and Future Work P., Zocca, M., Mangrulkar, S., Lansky, D., Nguyen, H.,
Contractor, D., Villa, L., Li, J., Bahdanau, D., Jernite, Y.,
We propose OSS-I NSTRUCT, a novel data generation Hughes, S., Fried, D., Guha, A., de Vries, H., and von
method using Large Language Models to generate diverse Werra, L. Santacoder: don’t reach for the stars!, 2023.
coding challenges from open-source code snippets. This
approach enables Magicoder, which significantly improves Austin, J., Odena, A., Nye, M. I., Bosma, M., Michalewski,
the base LLM. Despite having less than 7B parameters, it H., Dohan, D., Jiang, E., Cai, C. J., Terry, M., Le, Q. V.,
can outperform all evaluate LLMs with less than or equal to and Sutton, C. Program synthesis with large language
16B parameters, including the 15B WizardCoder. Combin- models. CoRR, abs/2108.07732, 2021. URL https:
ing OSS-I NSTRUCT with Evol-Instruct allows us to build //arxiv.org/abs/2108.07732.
the enhanced MagicoderS models. They achieve remark-
able results by rivaling leading models like ChatGPT in Ben Allal, L., Muennighoff, N., Kumar Umapathi,
9
Magicoder: Empowering Code Generation with OSS-I NSTRUCT
L., Lipkin, B., and von Werra, L. A framework Chen, X., Lin, M., Schärli, N., and Zhou, D. Teaching large
for the evaluation of code generation models. language models to self-debug, 2023.
https://github.com/bigcode-project/
bigcode-evaluation-harness, 2022. Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H.,
Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano,
Bouzenia, I., Devanbu, P., and Pradel, M. Repairagent: An R., Hesse, C., and Schulman, J. Training verifiers to solve
autonomous, llm-based agent for program repair. arXiv math word problems, 2021.
preprint arXiv:2403.17134, 2024.
Deng, Y., Xia, C. S., Peng, H., Yang, C., and Zhang, L.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
Large language models are zero-shot fuzzers: Fuzzing
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
deep-learning libraries via large language models, 2023.
Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G.,
Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Ding, Y., Wang, Z., Ahmad, W. U., Ding, H., Tan, M.,
Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Jain, N., Ramanathan, M. K., Nallapati, R., Bhatia, P.,
Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Roth, D., and Xiang, B. Crosscodeeval: A diverse
Radford, A., Sutskever, I., and Amodei, D. Language and multilingual benchmark for cross-file code comple-
models are few-shot learners. In Larochelle, H., tion. In Thirty-seventh Conference on Neural Informa-
Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), tion Processing Systems Datasets and Benchmarks Track,
Advances in Neural Information Processing Systems, 2023. URL https://openreview.net/forum?
volume 33, pp. 1877–1901. Curran Associates, Inc., id=wgDcbBMSfh.
2020. URL https://proceedings.neurips.
cc/paper_files/paper/2020/file/ Feng, Y., Martins, R., Bastani, O., and Dillig, I. Program
1457c0d6bfcb4967418bfb8ac142f64a-Paper. synthesis using conflict-driven learning. SIGPLAN Not.,
pdf. 53(4):420–435, jun 2018. ISSN 0362-1340. doi: 10.
Cambronero, J., Gulwani, S., Le, V., Perelman, D., Rad- 1145/3296979.3192382. URL https://doi.org/
hakrishna, A., Simon, C., and Tiwari, A. Flashfill++: 10.1145/3296979.3192382.
Scaling programming by example by cutting to the Fried, D., Aghajanyan, A., Lin, J., Wang, S., Wallace,
chase. Proc. ACM Program. Lang., 7(POPL), jan 2023. E., Shi, F., Zhong, R., Yih, S., Zettlemoyer, L., and
doi: 10.1145/3571226. URL https://doi.org/10. Lewis, M. Incoder: A generative model for code infilling
1145/3571226. and synthesis. In The Eleventh International Confer-
Cassano, F., Gouwar, J., Nguyen, D., Nguyen, S., Phipps- ence on Learning Representations, 2023. URL https:
Costin, L., Pinckney, D., Yee, M.-H., Zi, Y., Anderson, //openreview.net/forum?id=hQwb-lbM6EL.
C. J., Feldman, M. Q., Guha, A., Greenberg, M., and
Jangda, A. Multipl-e: A scalable and extensible approach Gulwani, S., Polozov, O., and Singh, R. Program syn-
to benchmarking neural code generation, 2022. thesis. Foundations and Trends® in Programming Lan-
guages, 4(1-2):1–119, 2017. ISSN 2325-1107. doi:
Chaudhary, S. Code alpaca: An instruction-following llama 10.1561/2500000010. URL http://dx.doi.org/
model for code generation. https://github.com/ 10.1561/2500000010.
sahil280114/codealpaca, 2023.
Gunasekar, S., Zhang, Y., Aneja, J., Mendes, C. C. T.,
Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, Giorno, A. D., Gopi, S., Javaheripi, M., Kauffmann, P.,
H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., de Rosa, G., Saarikivi, O., Salim, A., Shah, S., Behl,
Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, H. S., Wang, X., Bubeck, S., Eldan, R., Kalai, A. T., Lee,
M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, Y. T., and Li, Y. Textbooks are all you need, 2023.
S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavar-
ian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Guo, D., Zhu, Q., Yang, D., Xie, Z., Dong, K., Zhang, W.,
Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss, A., Chen, G., Bi, X., Wu, Y., Li, Y. K., Luo, F., Xiong, Y.,
Guss, W. H., Nichol, A., Paino, A., Tezak, N., Tang, and Liang, W. Deepseek-coder: When the large language
J., Babuschkin, I., Balaji, S., Jain, S., Saunders, W., model meets programming – the rise of code intelligence,
Hesse, C., Carr, A. N., Leike, J., Achiam, J., Misra, 2024.
V., Morikawa, E., Radford, A., Knight, M., Brundage,
M., Murati, M., Mayer, K., Welinder, P., McGrew, B., Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora,
Amodei, D., McCandlish, S., Sutskever, I., and Zaremba, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., and
W. Evaluating large language models trained on code, Steinhardt, J. Measuring coding challenge competence
2021. with apps, 2021.
10
Magicoder: Empowering Code Generation with OSS-I NSTRUCT
Honovich, O., Scialom, T., Levy, O., and Schick, T. Unnat- Li, R., Allal, L. B., Zi, Y., Muennighoff, N., Kocetkov, D.,
ural instructions: Tuning language models with (almost) Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., Liu, Q.,
no human labor. In Rogers, A., Boyd-Graber, J., and Zheltonozhskii, E., Zhuo, T. Y., Wang, T., Dehaene, O.,
Okazaki, N. (eds.), Proceedings of the 61st Annual Meet- Davaadorj, M., Lamy-Poirier, J., Monteiro, J., Shliazhko,
ing of the Association for Computational Linguistics (Vol- O., Gontier, N., Meade, N., Zebaze, A., Yee, M.-H., Uma-
ume 1: Long Papers), pp. 14409–14428, Toronto, Canada, pathi, L. K., Zhu, J., Lipkin, B., Oblokulov, M., Wang,
July 2023. Association for Computational Linguistics. Z., Murthy, R., Stillerman, J., Patel, S. S., Abulkhanov,
doi: 10.18653/v1/2023.acl-long.806. URL https: D., Zocca, M., Dey, M., Zhang, Z., Fahmy, N., Bhat-
//aclanthology.org/2023.acl-long.806. tacharyya, U., Yu, W., Singh, S., Luccioni, S., Villegas,
P., Kunakov, M., Zhdanov, F., Romero, M., Lee, T., Timor,
Hugging Face. Hugging face: The ai community build-
N., Ding, J., Schlesinger, C., Schoelkopf, H., Ebert, J.,
ing the future. https://huggingface.co/, 2023.
Dao, T., Mishra, M., Gu, A., Robinson, J., Anderson,
Accessed: 2023-12-01.
C. J., Dolan-Gavitt, B., Contractor, D., Reddy, S., Fried,
Husain, H., Wu, H.-H., Gazit, T., Allamanis, M., and D., Bahdanau, D., Jernite, Y., Ferrandis, C. M., Hughes,
Brockschmidt, M. Codesearchnet challenge: Evaluat- S., Wolf, T., Guha, A., von Werra, L., and de Vries, H.
ing the state of semantic code search, 2020. Starcoder: may the source be with you!, 2023.
Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T., Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser,
Wang, S., Solar-Lezama, A., Sen, K., and Stoica, I. Live- J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F.,
codebench: Holistic and contamination free evaluation of Dal Lago, A., Hubert, T., Choy, P., de Masson d’Autume,
large language models for code, 2024. C., Babuschkin, I., Chen, X., Huang, P.-S., Welbl, J.,
Gowal, S., Cherepanov, A., Molloy, J., Mankowitz,
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C.,
D. J., Sutherland Robson, E., Kohli, P., de Freitas,
Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel,
N., Kavukcuoglu, K., and Vinyals, O. Competition-
G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-
level code generation with alphacode. Science, 378
A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix,
(6624):1092–1097, December 2022. ISSN 1095-9203.
T., and Sayed, W. E. Mistral 7b, 2023a.
doi: 10.1126/science.abq1158. URL http://dx.doi.
Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, org/10.1126/science.abq1158.
B., Bamford, C., Chaplot, D. S., de las Casas, D., Hanna,
Liu, J., Peng, J., Wang, Y., and Zhang, L. Neuri: Di-
E. B., Bressand, F., Lengyel, G., Bour, G., Lample, G.,
versifying dnn generation via inductive rule inference.
Lavaud, L. R., Saulnier, L., Lachaux, M.-A., Stock, P.,
In Proceedings of the 31st ACM Joint European Soft-
Subramanian, S., Yang, S., Antoniak, S., Scao, T. L.,
ware Engineering Conference and Symposium on the
Gervet, T., Lavril, T., Wang, T., Lacroix, T., and Sayed,
Foundations of Software Engineering, ESEC/FSE 2023,
W. E. Mixtral of experts, 2024.
pp. 657–669, New York, NY, USA, 2023a. Associa-
Jiang, N., Liu, K., Lutellier, T., and Tan, L. Impact of code tion for Computing Machinery. ISBN 9798400703270.
language models on automated program repair, 2023b. doi: 10.1145/3611643.3616337. URL https://doi.
org/10.1145/3611643.3616337.
Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press,
O., and Narasimhan, K. Swe-bench: Can language mod- Liu, J., Xia, C. S., Wang, Y., and Zhang, L. Is your code
els resolve real-world github issues?, 2023. generated by chatGPT really correct? rigorous evaluation
of large language models for code generation. In Thirty-
Kocetkov, D., Li, R., Allal, L. B., Li, J., Mou, C., Ferrandis,
seventh Conference on Neural Information Processing
C. M., Jernite, Y., Mitchell, M., Hughes, S., Wolf, T.,
Systems, 2023b. URL https://openreview.net/
Bahdanau, D., von Werra, L., and de Vries, H. The stack:
forum?id=1qvx610Cu7.
3 tb of permissively licensed source code, 2022.
Lozhkov, A., Li, R., Allal, L. B., Cassano, F., Lamy-Poirier,
Lai, Y., Li, C., Wang, Y., Zhang, T., Zhong, R., Zettlemoyer,
J., Tazi, N., Tang, A., Pykhtar, D., Liu, J., Wei, Y., Liu, T.,
L., tau Yih, S. W., Fried, D., Wang, S., and Yu, T. Ds-
Tian, M., Kocetkov, D., Zucker, A., Belkada, Y., Wang,
1000: A natural and reliable benchmark for data science
Z., Liu, Q., Abulkhanov, D., Paul, I., Li, Z., Li, W.-D.,
code generation, 2022.
Risdal, M., Li, J., Zhu, J., Zhuo, T. Y., Zheltonozhskii,
Lemieux, C., Inala, J. P., Lahiri, S. K., and Sen, S. Co- E., Dade, N. O. O., Yu, W., Krauß, L., Jain, N., Su, Y.,
damosa: Escaping coverage plateaus in test genera- He, X., Dey, M., Abati, E., Chai, Y., Muennighoff, N.,
tion with pre-trained large language models. In 2023 Tang, X., Oblokulov, M., Akiki, C., Marone, M., Mou,
IEEE/ACM 45th International Conference on Software C., Mishra, M., Gu, A., Hui, B., Dao, T., Zebaze, A.,
Engineering (ICSE), pp. 919–931. IEEE, 2023. Dehaene, O., Patry, N., Xu, C., McAuley, J., Hu, H.,
11
Magicoder: Empowering Code Generation with OSS-I NSTRUCT
Scholak, T., Paquet, S., Robinson, J., Anderson, C. J., Services, A. W. AI Code Generator - Amazon Code-
Chapados, N., Patwary, M., Tajbakhsh, N., Jernite, Y., Whisperer - AWS. https://aws.amazon.com/
Ferrandis, C. M., Zhang, L., Hughes, S., Wolf, T., Guha, codewhisperer/, 2023.
A., von Werra, L., and de Vries, H. Starcoder 2 and the
stack v2: The next generation, 2024. Shazeer, N. and Stern, M. Adafactor: Adaptive learning
rates with sublinear memory cost, 2018.
Luo, Z., Xu, C., Zhao, P., Sun, Q., Geng, X., Hu, W., Tao, C.,
Ma, J., Lin, Q., and Jiang, D. Wizardcoder: Empowering SPARCK JONES, K. A statistical interpretation of term
code large language models with evol-instruct. arXiv specificity and its application in retrieval. 28(1):11–21,
preprint arXiv:2306.08568, 2023a. 2023/11/30 1972. doi: 10.1108/eb026526. URL https:
Luo, Z., Xu, C., Zhao, P., Sun, Q., Geng, X., Hu, W., Tao, C., //doi.org/10.1108/eb026526.
Ma, J., Lin, Q., and Jiang, D. Wizardcoder: Empowering
code large language models with evol-instruct, 2023b. Su, H., Shi, W., Kasai, J., Wang, Y., Hu, Y., Ostendorf,
M., Yih, W.-t., Smith, N. A., Zettlemoyer, L., and Yu, T.
Microsoft. Azure openai service models. https: One embedder, any task: Instruction-finetuned text em-
//learn.microsoft.com/en-us/azure/ beddings. 2022. URL https://arxiv.org/abs/
cognitive-services/openai/concepts/ 2212.09741.
models, 2023a.
Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li,
Microsoft. GitHub Copilot – Your AI pair pro- X., Guestrin, C., Liang, P., and Hashimoto, T. B.
grammer. https://github.com/features/ Stanford alpaca: An instruction-following llama
copilot, 2023b. model. https://github.com/tatsu-lab/
Muennighoff, N., Liu, Q., Zebaze, A., Zheng, Q., Hui, B., stanford_alpaca, 2023.
Zhuo, T. Y., Singh, S., Tang, X., von Werra, L., and
Longpre, S. Octopack: Instruction tuning code large theblackcat102. The evolved code alpaca dataset.
language models, 2023. https://huggingface.co/datasets/
theblackcat102/evol-codealpaca-v1,
Nijkamp, E., Pang, B., Hayashi, H., Tu, L., Wang, H., 2023.
Zhou, Y., Savarese, S., and Xiong, C. Codegen: An
open large language model for code with multi-turn pro- Wang, X., Dillig, I., and Singh, R. Program synthesis using
gram synthesis. In The Eleventh International Confer- abstraction refinement. Proc. ACM Program. Lang., 2
ence on Learning Representations, 2023. URL https: (POPL), dec 2017. doi: 10.1145/3158151. URL https:
//openreview.net/forum?id=iaYcJKpY2B_. //doi.org/10.1145/3158151.
Olausson, T. X., Inala, J. P., Wang, C., Gao, J., and Wang, Y., Wang, W., Joty, S., and Hoi, S. C. CodeT5:
Solar-Lezama, A. Is self-repair a silver bullet for code Identifier-aware unified pre-trained encoder-decoder mod-
generation? In The Twelfth International Conference els for code understanding and generation. In Moens,
on Learning Representations, 2024. URL https:// M.-F., Huang, X., Specia, L., and Yih, S. W.-t. (eds.), Pro-
openreview.net/forum?id=y0GJXRungR. ceedings of the 2021 Conference on Empirical Methods
OpenAI. Chatgpt: Optimizing language models for dialogue. in Natural Language Processing, pp. 8696–8708, On-
https://openai.com/blog/chatgpt/, 2022. line and Punta Cana, Dominican Republic, November
2021. Association for Computational Linguistics. doi:
OpenAI. Gpt-4 technical report, 2023. 10.18653/v1/2021.emnlp-main.685. URL https://
Rozière, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, aclanthology.org/2021.emnlp-main.685.
X. E., Adi, Y., Liu, J., Remez, T., Rapin, J., Kozhevnikov,
Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A.,
A., Evtimov, I., Bitton, J., Bhatt, M., Ferrer, C. C.,
Khashabi, D., and Hajishirzi, H. Self-instruct: Align-
Grattafiori, A., Xiong, W., Défossez, A., Copet, J., Azhar,
ing language models with self-generated instructions. In
F., Touvron, H., Martin, L., Usunier, N., Scialom, T., and
Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Pro-
Synnaeve, G. Code llama: Open foundation models for
ceedings of the 61st Annual Meeting of the Association
code, 2023.
for Computational Linguistics (Volume 1: Long Papers),
Schäfer, M., Nadi, S., Eghbali, A., and Tip, F. An empirical pp. 13484–13508, Toronto, Canada, July 2023a. Associ-
evaluation of using large language models for automated ation for Computational Linguistics. doi: 10.18653/v1/
unit test generation. IEEE Transactions on Software En- 2023.acl-long.754. URL https://aclanthology.
gineering, 2023. org/2023.acl-long.754.
12
Magicoder: Empowering Code Generation with OSS-I NSTRUCT
Wang, Y., Le, H., Gotmare, A. D., Bui, N. D. Q., Li, J., and
Hoi, S. C. H. Codet5+: Open code large language models
for code understanding and generation, 2023b.
Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester,
B., Du, N., Dai, A. M., and Le, Q. V. Finetuned language
models are zero-shot learners, 2022.
13
Magicoder: Empowering Code Generation with OSS-I NSTRUCT
Length distribution We depict the length distribution for both generated problems and solutions in Figure 7. The x-axis
represents the number of tokens in each problem/solution, while the y-axis shows the correspondent number of samples.
B. Implementation Details
B.1. Data Generation
We use gpt-3.5-turbo-1106 as the foundation model to do OSS-I NSTRUCT due to its high cost-effectiveness. We
randomly extract 1–15 lines from each selected code document from starcoderdata and let gpt-3.5-turbo-1106
imagine a self-contained coding problem and a correct solution. Given the numerous seed code snippets, we perform greedy
decoding to maximize the consistency between the generated problems and solutions.
14
Magicoder: Empowering Code Generation with OSS-I NSTRUCT
Problem Problem
Your task is to complete the `render` method to Create a Python program that generates an error ile
generate the rendered shape as a string... based on a given dataset...
Code Code
class ShapeRenderer { def generate_error_file(dataset_file, ...):
constructor(vertices) { error_lines = []
this.vertices = vertices; with open(dataset_file, 'r') as file:
} for line in file:
render() { ...
let renderedShape = ""; with open(error_file_name, 'w') as error_file:
for (let i = 0; i < this.vertices.length; i++) { for error_line in error_lines:
const vertex = this.vertices[i]; error_file.write(error_line + '\n')
renderedShape += `(${vertex.x}, ${vertex.y})`; if __name__ == "__main__":
if (i < this.vertices.length - 1) { if len(sys.argv) != 3:
renderedShape += " - "; print("Usage: ...")
} else:
} dataset_file = sys.argv[1]
return renderedShape; dataset_number = sys.argv[2]
} generate_error_file(...)
}
Figure 5: More examples showing how OSS-I NSTRUCT generates problems and solutions from seed code snippets. Detailed
problem requirements, implementations, and explanations are omitted for brevity.
15
Magicoder: Empowering Code Generation with OSS-I NSTRUCT
problem
7
#Count (Thousand)
solution
6
5
4
3
2
1
0
0 100 200 300 400 500 600 700
Number of Tokens
B.3. Training
We employ C ODE L LAMA -P YTHON-7B and DeepSeek-Coder-Base 6.7B as the base LLMs. To obtain Magicoder series, we
first finetune the base models on about 75K synthetic data generated through OSS-I NSTRUCT using the transformers
library from Hugging Face (Hugging Face, 2023). We finetune the base models for 2 epochs using two NVIDIA A100-80GB
GPUs through the Distributed Data Parallel (DDP) module from PyTorch. We set the initial learning rate at 5e-5 with 15
warmup steps and a linear scheduler. We use Adafactor (Shazeer & Stern, 2018) as our optimizer and choose a batch size of
512 with a sequence truncation length of 1216. To obtain MagicoderS, we continue to finetune Magicoder models with the
evol-codealpaca-v1 dataset, an open-source Evol-Instruct implementation containing about 110K samples. We use
the same hyperparameters except for 15 warmup steps and a 1024 maximum sequence length.
16
Magicoder: Empowering Code Generation with OSS-I NSTRUCT
Meanwhile, DeepSeek-Coder-based MagicoderS-DS achieves the best result among all the evaluated baselines, substantially
outperforming the instruction-tuned DeepSeek-Coder-6.7B-Instruct.
Table 8: Pass@1 results on APPS evaluated using greedy decoding in a zero-shot setting.
Table 9: Pass@1 results on DS-1000 (Insertion format) with temperature = 0.2, top p = 0.5, max length = 1024,
and num samples = 40.
Table 10: Pass@1 on HumanEval+ when finetuning C ODE L LAMA -P YTHON-7B with and without noisy samples.
17
Magicoder: Empowering Code Generation with OSS-I NSTRUCT
Prompt: Write a function that takes an integer a and returns True if this integer is a cube of some integer number. Note:
you should check whether the input is valid.
def is_cube(a): def is_cube(a):
if a <= 0: if not isinstance(a, int) or a < 0:
return False return False # Invalid input
x = int(a ** (1/3))
return x**3 == a # Check if the number is a perfect cube
x = int(a ** (1/3))
DeepSeek-Coder-6.7B-Base return x**3 == a
Magicoder-DS-6.7B
Prompt: Implement a Scala function that takes a string as input and returns a new string where all the characters are
reversed except the first and last characters. Reply only with the function.
def reverseExceptFirstLast(str: String): def reverseExceptFirstLast(input: String):
String = { String = {
val chars = str.toCharArray if (input.length <= 2) {
val first = chars(0) input
val last = chars(chars.length - 1) } else {
val middle = chars.slice(1, chars.length val reversedMiddle = input.substring(1,
- 1).reverse input.length - 1).reverse
first + middle + last input.charAt(0) + reversedMiddle +
} input.charAt(input.length - 1)
}
DeepSeek-Coder-6.7B-Base }
Magicoder-DS-6.7B
18
Magicoder: Empowering Code Generation with OSS-I NSTRUCT
Magicoder-DS-6.7B
delay = 0.1
# Score
score = 0
high_score = 0
# Snake head
head = turtle.Turtle()
head.speed(0)
head.shape("square")
head.color("black")
head.penup()
head.goto(0, 0)
head.direction = "stop"
# Snake food
food = turtle.Turtle()
food.speed(0)
food.shape("circle")
food.color("red")
food.penup()
food.goto(0, 100)
segments = []
# Pen
pen = turtle.Turtle()
pen.speed(0)
pen.shape("square")
19
Magicoder: Empowering Code Generation with OSS-I NSTRUCT
pen.color("white")
pen.penup()
pen.hideturtle()
pen.goto(0, 260)
pen.write("Score: 0 High Score: 0", align="center", font=("Courier", 24, "normal"))
# Functions
def go_up():
if head.direction!= "down":
head.direction = "up"
def go_down():
if head.direction!= "up":
head.direction = "down"
def go_left():
if head.direction!= "right":
head.direction = "left"
def go_right():
if head.direction!= "left":
head.direction = "right"
def move():
if head.direction == "up":
y = head.ycor()
head.sety(y + 20)
if head.direction == "down":
y = head.ycor()
head.sety(y - 20)
if head.direction == "left":
x = head.xcor()
head.setx(x - 20)
if head.direction == "right":
x = head.xcor()
head.setx(x + 20)
# Keyboard bindings
wn.listen()
wn.onkeypress(go_up, "w")
wn.onkeypress(go_down, "s")
wn.onkeypress(go_left, "a")
wn.onkeypress(go_right, "d")
20
Magicoder: Empowering Code Generation with OSS-I NSTRUCT
normal"))
# Add a segment
new_segment = turtle.Turtle()
new_segment.speed(0)
new_segment.shape("square")
new_segment.color("grey")
new_segment.penup()
segments.append(new_segment)
pen.clear()
pen.write("Score: {} High Score: {}".format(score, high_score), align="center", font=("Courier", 24, "
normal"))
move()
time.sleep(delay)
wn.mainloop()
This game is a simple version of the classic snake game. The snake starts with a single segment and grows in size when it
eats food. The game ends when the snake hits the border or its own body. The score is increased by 10 points for each
piece of food eaten. The game speed increases as the snake grows.
21
Magicoder: Empowering Code Generation with OSS-I NSTRUCT
while (true) {
System.out.println("Player " + currentPlayer + ", enter your move (row column):");
Scanner scanner = new Scanner(System.in);
int row = scanner.nextInt();
int col = scanner.nextInt();
if (isValidMove(row, col)) {
makeMove(row, col);
printBoard();
if (!hasValidMove()) {
System.out.println("Player " + currentPlayer + " has no valid moves. Game over!");
break;
}
currentPlayer = (currentPlayer == BLACK)? WHITE : BLACK;
} else {
System.out.println("Invalid move. Try again.");
}
}
}
22
Magicoder: Empowering Code Generation with OSS-I NSTRUCT
}
}
return false;
}
private static boolean isValidDirection(int row, int col, int rowDir, int colDir) {
int newRow = row + rowDir;
int newCol = col + colDir;
while (newRow >= 0 && newRow < BOARD_SIZE && newCol >= 0 && newCol < BOARD_SIZE) {
if (board[newRow][newCol] == EMPTY) {
return false;
}
if (board[newRow][newCol] == currentPlayer) {
return true;
}
newRow += rowDir;
newCol += colDir;
}
return false;
}
private static void flipDirection(int row, int col, int rowDir, int colDir) {
int newRow = row + rowDir;
int newCol = col + colDir;
while (newRow >= 0 && newRow < BOARD_SIZE && newCol >= 0 && newCol < BOARD_SIZE) {
if (board[newRow][newCol] == EMPTY) {
return;
}
if (board[newRow][newCol] == currentPlayer) {
while (true) {
newRow -= rowDir;
newCol -= colDir;
if (newRow == row && newCol == col) {
break;
}
board[newRow][newCol] = currentPlayer;
}
return;
}
newRow += rowDir;
newCol += colDir;
}
}
This program starts with a 8x8 board with two pieces in the center of the board, one black and one white. The player who
goes first is black. The game ends when there are no more valid moves for either player. The player enters their move as
two numbers, the row and the column. If the move is valid, the game makes the move and switches the player. If the
move is not valid, the player is asked to enter a valid move.
23
Magicoder: Empowering Code Generation with OSS-I NSTRUCT
Prompt: Write a gradio (3.48.0) application for the following use case: Take an input image and return a 45 degree
clockwise rotated image. You should also add text description under the output showing the rotation degree.
Response: Here is a simple snake game in Python using the turtle library. This game is a basic version and does not
include many features that a full-fledged snake game would have.
import gradio as gr
import numpy as np
from PIL import Image
def rotate_image(img):
img = Image.fromarray(img)
rotated_img = img.rotate(-45)
return np.array(rotated_img), "45 degree clockwise rotation"
demo = gr.Interface(
fn=rotate_image,
inputs=gr.inputs.Image(type="numpy"),
outputs=[gr.outputs.Image(type="numpy"), "text"],
title="Image Rotation",
description="Upload an image and get a 45 degree clockwise rotated image",
)
demo.launch()
This script defines a function rotate_image that takes an image as input, rotates it 45 degrees clockwise using the PIL
library, and returns the rotated image and a text description. The gr.Interface function is then used to create a Gradio
interface for this function.
The gr.inputs.Image(type="numpy") line specifies that the input to the function should be an image, and the
gr.outputs.Image(type="numpy") line specifies that the output should be an image. The ”text” output type is
used to display the text description.
Finally, demo.launch() is used to launch the Gradio interface.
# Device configuration
device = torch.device(’cuda’ if torch.cuda.is_available() else ’cpu’)
# Hyper-parameters
num_epochs = 5
batch_size = 100
learning_rate = 0.001
# MNIST dataset
train_dataset = torchvision.datasets.MNIST(root=’../../data’,
train=True,
transform=transforms.ToTensor(),
download=True)
test_dataset = torchvision.datasets.MNIST(root=’../../data’,
train=False,
transform=transforms.ToTensor())
# Data loader
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
batch_size=batch_size,
24
Magicoder: Empowering Code Generation with OSS-I NSTRUCT
shuffle=True)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
batch_size=batch_size,
shuffle=False)
model = ConvNet().to(device)
# Forward pass
outputs = model(images)
loss = criterion(outputs, labels)
if (i+1) % 100 == 0:
print (’Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}’
.format(epoch+1, num_epochs, i+1, total_step, loss.item()))
loss_list.append(loss.item())
print(’Test Accuracy of the model on the 10000 test images: {} %’.format(100 * correct / total))
25
Magicoder: Empowering Code Generation with OSS-I NSTRUCT
This script first loads the MNIST dataset and then defines a Convolutional Neural Network (CNN) with two convolutional
layers. The CNN is trained for 5 epochs using the Adam optimizer and the Cross Entropy Loss function. After training,
the model is tested on the test dataset and the accuracy is printed. Finally, a chart of the training loss is drawn.
F. Limitations
While Magicoders prove to be effective by being instruction-tuned with OSS-I NSTRUCT and Evol-Instruct generated data,
we apply our techniques to models with no more than 7B parameters due to resource constraints. This limitation hinders our
ability to observe the method’s impact on larger models. Moreover, the OSS-I NSTRUCT prompt is sophisticated and may
require a relatively strong model to comprehend the correct intent. Weaker models may fail to produce plausible instruction
data. Future research may explore applying OSS-I NSTRUCT to models at different capacities in the context of self-training.
26