0% found this document useful (0 votes)
82 views22 pages

Data Management for Training Large Language Models

This survey discusses the critical role of data management in training Large Language Models (LLMs) during both pretraining and supervised fine-tuning stages. It highlights the importance of constructing well-suited training datasets through effective data selection, combination, and evaluation strategies, while also addressing existing challenges and future directions in the field. The document serves as a comprehensive resource for practitioners aiming to enhance LLM performance through efficient data management practices.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views22 pages

Data Management for Training Large Language Models

This survey discusses the critical role of data management in training Large Language Models (LLMs) during both pretraining and supervised fine-tuning stages. It highlights the importance of constructing well-suited training datasets through effective data selection, combination, and evaluation strategies, while also addressing existing challenges and future directions in the field. The document serves as a comprehensive resource for practitioners aiming to enhance LLM performance through efficient data management practices.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Data Management For Training Large Language Models: A Survey

Zige Wang1,2∗ Wanjun Zhong2† Yufei Wang2 Qi Zhu2 Fei Mi2 Baojun Wang2
Lifeng Shang2 Xin Jiang2 Qun Liu2
1
School of Computer Science, Peking University
2
Huawei Noah’s Ark Lab
[email protected]
{zhongwanjun1, wangyufei44, zhuqi41, mifei2, puking.w}@huawei.com
{Shang.Lifeng, Jiang.Xin, qun.liu}@huawei.com

Abstract To construct suitable training datasets, data man-


agement is vitally important and challenging in
Data plays a fundamental role in training Large both the pretraining and SFT stages of LLMs,
Language Models (LLMs). Efficient data man-
arXiv:2312.01700v3 [cs.CL] 2 Aug 2024

which we define as following:


agement, particularly in formulating a well-
suited training dataset, is significant for enhanc-
ing model performance and improving training
Data management: the process of orga-
efficiency during pretraining and supervised nizing a well-suited training dataset with
fine-tuning stages. Despite the considerable im- collected data, including the data selection,
portance of data management, the underlying combination and utilization strategies, and
mechanism of current prominent practices are the evaluation of the chosen strategies.
still unknown. Consequently, the exploration
of data management has attracted more and In the pretraining stage, constructing datasets
more attention among the research community.
with high-quality data is essential for efficient train-
This survey aims to provide a comprehensive
overview of current research in data manage- ing (Jain et al., 2020; Gupta et al., 2021). To equip
ment within both the pretraining and supervised LLMs with diverse and comprehensive abilities,
fine-tuning stages of LLMs, covering various heterogeneous dataset composition with mixtures
aspects of data management strategy design. of domains is also required (Gao et al., 2020; Long-
Looking into the future, we extrapolate exist- pre et al., 2023b; Shen et al., 2023). However, many
ing challenges and outline promising directions prominent LLMs do not enclose (Anil et al., 2023;
for development in this field. Therefore, this
OpenAI, 2023) or only document (Brown et al.,
survey serves as a guiding resource for prac-
titioners aspiring to construct powerful LLMs
2020; Le Scao et al., 2023; Touvron et al., 2023a)
through efficient data management practices. the techniques used in the construction of their pre-
training dataset, leaving the reasons and effects of
1 Introduction choosing specific data management strategies ab-
sent. In the SFT stage, LLMs’ performance and
Large Language Models (LLMs) have shocked instruction-following abilities are primarily evoked
the natural language processing (NLP) community by carefully constructed instruction datasets (Sanh
with their strong performance and emergent abil- et al., 2022; Ouyang et al., 2022). Although a hand-
ities (OpenAI, 2023; Touvron et al., 2023a; Wei ful of instruction datasets/benchmarks have been
et al., 2022). According to previous studies (Ka- proposed (Wang et al., 2022, 2023c; Taori et al.,
plan et al., 2020; Hoffmann et al., 2022), LLMs’ 2023; Anand et al., 2023), practitioners still find it
achievements depend heavily on self-supervised confusing about the effects of instruction datasets
pretraining over processed vast volumes of text on the performance of fine-tuned LLMs, leading
data. Recent research (Zhou et al., 2023a; Ouyang to difficulties in choosing proper data management
et al., 2022) further enhances LLMs’ instruction- strategies in LLM SFT practices. To address the
following ability and performance on downstream sparsity problem of existing data, collecting data
tasks through Supervised Fine-Tuning (SFT) on from multimodal source (Zhang et al., 2023a; Yang
deliberately curated instruction datasets. et al., 2023b) and model synthesis (Maini et al.,
∗ 2024; Li et al., 2024a) rise as new trends.
Work done during Zige Wang’s internship at Huawei
Noah’s Ark Lab. To address these challenges, researchers try to

Corresponding author ([email protected]) discover and explore the underlying principles of
data management. With more and more works been code and math take up higher ratio of the total pre-
proposed to address different aspects, it is neces- training data. A trend can be concluded that more
sary to conduct a systematic discussion consider- and more domains are included to pretrain LLMs
ing the whole picture. This survey aims to provide with more various and powerful abilities. The ben-
a comprehensive overview of current research in efits of multi-domain composition are also studied
LLM data management and a guiding resource to in a recent study (Longpre et al., 2023b).
practitioners attempting to build powerful LLMs Proper domain mixture ratio is also important
with efficient data management practices. in the pretraining of LLMs. Early attempts usu-
In Section 2 and 3, we respectively discuss cur- ally found the ratio by elaborated experiments
rent research in the pretraining and SFT stages of and intuitions (Gao et al., 2020; Du et al., 2022;
LLMs, covering multiple aspects in data manage- Thoppilan et al., 2022). Recently, domain general-
ment like domain/task composition, data quality, ization techniques are leveraged to automatically
data quantity, etc., as shown in Figure 3. However, assign domain weights to form a suitable target
there still lacks a well-established and acknowl- distribution, such as importance resampling (Xie
edged general data management pipeline. Hence, et al., 2023b) and Group Distributionally Robust
We hope our work can inspire future research to Optimization (Xie et al., 2023a). Contribution
establish and analyze such general pipelines. With of each domain measured via gradients is also
the vision that the development of data manage- adopted to reweight domains (Fan et al., 2023).
ment should keep pace with that of LLMs’ abil- Xia et al. (2023) assign batch-level weights dynam-
itites, we present more existing challenges and ically based on varying losses. Ye et al. (2024)
promising future directions in Section 4. propose data mixing laws to predict model perfor-
mance with different mixing ratios.
2 Pretraining of LLM Although proper domain composition is broadly
acknowledged as beneficial in the pretraining of
Data management is found to be important in the
LLMs, as discussed previously, some empirical
pretraining stage of many prominent LLMs (Ope-
analyses arrive at different conclusions and leave
nAI, 2023; Touvron et al., 2023a; Wei et al., 2022).
open questions for future research. For example,
In this section, we will discuss works trying to Longpre et al. (2023b) claim that the inclusion of
explore data management in the pretraining stage diverse web domains may perform better than spe-
of LLMs, including domain composition, data cific mixtures in certain tasks. Nijkamp et al. (2023)
quantity and data quality, as shown in Figure 1(a). study programming and natural language mixtures
Strategies adopted by prominent pretrained models and find that models trained with mixtures do not
are listed in Table 1. perform better than but closely to domain-matched
models given the same computing budget.
2.1 Domain Composition
Publicly available pretraining datasets, like the 2.2 Data Quantity
Pile (Gao et al., 2020), usually contain mixtures of It is recognized that the pretraining of LLMs needs
data collected from multiple sources and domains. large amounts of data. Scaling laws are proposed
Many prominent models (Du et al., 2022; Gao et al., to depict the relationships between data quantity
2023; Zhang et al., 2023a) are also trained on a and model size. Repeatedly training on data is also
mixture of data from different domains. Figure 2 studied due to data exhaustion.
summarizes the revealed domain mixture ratios in
the pretraining datasets of prominent models. 2.2.1 Scaling Laws
Early pretraining corpus mostly contain data Before the popularization of LLMs, the relation-
with high diversity (Web and Wiki). With recent ship between training dataset size and the per-
emphasis on the data quality and the requirement formance of Transformer-based language mod-
for advanced abilities, high quality text (Books els (Vaswani et al., 2017) had already attracted
and academic text) are integrated. Most recently, researchers’ attention. Kaplan et al. (2020) find
with improved importance of Coding LLMs and that the language model loss has a power-law re-
the essential finding that code-based pretraining lationship with training dataset size or model size,
can enhance reasoning capability of LLMs (Liang respectively, when not bottlenecked by each other
et al., 2022; Guo et al., 2024), domain data like and the training computing budget. They further de-
Domain Quantity Quality Control (Sec. 2.3)
Composition Control
(Sec. 2.1) (Sec. 2.2)
Quality Filtering Deduplication Toxicity Filtering
Web (Sec. 2.3.1) (Sec. 2.3.2) (Sec. 2.3.3)
Scaling laws
Heuristics, N-gram-and-hashing, Heuristics,
Raw text Kaplan's,
Classifiers, Line/document-level, Classifiers
Wiki Chinchilla
Metric thresholding, Neural models,
Raw text Clustering Semantic clustering Pretraining
Repetition Dataset
......
multi-epoch, Diversity & Age
Code
repeating on Diversity-difficulty balancing data selection,
Raw text selected data
Temporal shift between evaluation and pretraining data

(a) Data management pipeline in the pretraining stage of LLMs

Task Composition Quality Control (Sec. 3.2) Quantity Control (Sec. 3.3)
(Sec. 3.1)
Quality (Sec. 3.2.1) Scaling up v.s. Scaling down,

Task 1 Scaling patterns for different abilities


Heuristic/Model-based indicators,
Raw LLMs as quality judges
instructions Dynamic
Diversity (Sec. 3.2.2) Data-Efficient
Task 2 Learning
Raw Quantitative measure, (
Sec. 3.4)
instructions Similarity/Distance-based filtering, Learning
...... Augmentation with active searching Affects Data LLM
Fine-tuning
Task N Complexity (Sec. 3.2.3) Instruction Data
Raw Dataset Affects Learning
Quantitative measure,
instructions
Incremental augmentation

(b) Data management pipeline in the supervised fine-tuning stage of LLMs

Figure 1: Data management pipelines for the pretraining and supervised fine-tuning of Large Language Models.

pict the dependence between model size and train- guage models and arrive at a new scaling law, usu-
ing dataset size as: ally called as Chinchilla Scaling Law:
Nc αN D c αD A B
L(N, D) = αD
+ (1) L(N, D) = E +α
+ β (2)
N D N D
where L is the language model test loss, D is where they empirically fit E = 1.69, A = 406.4,
the number of training tokens, N is the number of B = 410.7, α = 0.34 and β = 0.28. The optimal
model parameters, αD and αN are the power-law allocation of Dopt and Nopt are also analyzed as
components for the scaling of D and N , respec- Dopt ∼ C 0.54 and Nopt ∼ C 0.46 . Hence, they
tively, and Dc and Nc are constant numbers 1 . draw a different conclusion that model and training
Fitting Equation 1, they conclude that model dataset sizes should scale roughly at the same rate
loss decreases predictably as long as the model with a larger computing budget. Su et al. (2024)
size and training dataset size are scaled up simul- dig deeper into Kaplan’s scaling laws and provide
taneously. Still, overfitting will happen if either more detailed instructions to fit the constants.
of them is fixed while the other increases. Given
fixed computing budget C, they analyze the opti- 2.2.2 Data Repetition
mal allocation of Dopt ∼ C 0.27 and Nopt ∼ C 0.73 , While Kaplan et al. (2020) and Hoffmann et al.
showing that the model size should increase faster (2022) both focus on scaling laws with unique data
than the training dataset size. trained only for one epoch, Hernandez et al. (2022)
Following Kaplan et al. (2020), Hoffmann et al. study the scaling laws with a small fraction of re-
(2022) conduct experiments on much larger lan- peated data in the training dataset and find that the
1
text overlap may be harmful to model performance,
The precise numerical values of Dc and Nc depend on
vocabulary size and tokenization and do not have fundamental causing a divergence from Kaplan’s scaling law on
meaning. model size larger than 100M parameters.
Web Wiki Books Dialog Code Academic

T5 (2019) 99 1
GPT-3 (2020) 82 3 16
GLaM (2021) 46 6 20 28
LaMDA (2022) 25 12.5 50 12.5
Chinchilla (2022) 65 1 30 4
BLOOM (2022) 60 5 10 5 10 10
PaLM (2022) 28 4 13 50 5
LLaMA (2023) 82 4.5 4.5 2 4.5 2.5
AlphaCode (2022) 100
phi-1/1.5/2 (2023) 1 99

Figure 2: The domain composition of prominent Large Language Models.

With the models grow larger and larger, data 2.3.1 Quality Filtering
has becoming more and more demanding, raising
concerns about the exhaustion of high-quality train- Public datasets like Common Crawl 2 and multilin-
ing data (Villalobos et al., 2022; Hoffmann et al., gual datasets (Kreutzer et al., 2022) usually contain
2022). Addressing these concerns, several works low-quality data that hampers the training of LLMs.
study the consequence of repeatedly pretraining on Hence, existing works usually perform quality fil-
the whole datasets for multiple epochs. Scaling law tering using hand-crafted heuristics (Yang et al.,
on repeated training is proposed to depict the dimin- 2019; Raffel et al., 2020; Nijkamp et al., 2022), a
ishing of returns with more repetition and larger trained classifier (Brown et al., 2020; Gao et al.,
model sizes (Muennighoff et al., 2023) and shows 2020; Du et al., 2022; Touvron et al., 2023a; Wettig
a multi-epoch degradation phenomenon (Xue et al., et al., 2024), metric thresholding (Wenzek et al.,
2023). Further analysis digs out that dataset size, 2020; Muennighoff et al., 2023) or combinations of
model parameters, and training objectives are the these techniques. Besides instance-level filtering,
key factors to this phenomenon, and classic regu- embedding clustering is also adopted to filter one
larization techniques may not be helpful, except for cluster at a time (Kaddour, 2023).
dropout (Xue et al., 2023). Despite the reduction of data quantity, quality
There are still positive results in the research filtering is usually proven to be beneficial in model
of data repetition. Muennighoff et al. (2023) find performance improvement (Longpre et al., 2023b).
that repeatedly training on the whole dataset up Several carefully filtered high-quality datasets are
to 4 epochs only causes trivial harm to test loss proposed to train lightweight LLMs with outstand-
compared to training on unique new data. Instead ing performances (Gunasekar et al., 2023; Li et al.,
of simply repeating over the whole dataset, Tiru- 2023d; Javaheripi and Bubeck, 2023; Penedo et al.,
mala et al. (2023) show that repeatedly training 2023). However, Gao (2021) finds that aggressive
on carefully selected data can outperform that on filtering might lead to performance degradation on
randomly selected new data, suggesting a feasible a wide range of tasks for GPT-like LLMs due to
way of repeating on intelligently selected data. the poor representativity of the filtering proxy ob-
Recently, pretraining with mixed real and syn- jectives. To address this issue, Marion et al. (2023)
thesized data is adopted to meet the data exhaus- comprehensively examine different data quality es-
tion challenge (Javaheripi and Bubeck, 2023; Meta, timators and find that pruning datasets based on
2024). It is also gaining more an more attention perplexity performs better than more complicated
and develops into a new trend as data synthesize. techniques like memorization. Gan et al. (2023)
develop data-centric scaling laws and show that
2.3 Data Quality improving semantic and grammatical quality is
more effective. However, there still lacks a well-
In the pretraining of LLMs, Quality control tech- established and theoretically efficient filtering strat-
niques of the pretraining datasets usually form an egy, leaving room for further exploration.
order (Rae et al., 2021; Nguyen et al., 2023; Tiru-
mala et al., 2023; Gan et al., 2023), namely quality
filtering, deduplication and toxicity filtering. Data 2
https://commoncrawl.org/, a large text corpus contains
diversity and age are also explored. raw web page data, metadata extracts, and text extracts.
2.3.2 Deduplication 2.3.4 Data Diversity

Deduplication is a necessary step in many LLMs’ Some works focus on other aspects of data manage-
ment in the pretraining stage of LLMs. Lee et al.
pretraining data management procedures and
(2023a) show that the format diversities of pub-
the preprocessing of many publicly available
licly available pretraining datasets are high when
datasets (Brown et al., 2020; Le Scao et al., 2023;
measured by Task2Vec diversity coefficient (Mi-
Touvron et al., 2023a; Raffel et al., 2020). Lee et al.
(2021) find that deduplication is beneficial in mem- randa et al., 2022). Maharana et al. (2023) propose
D2 Pruning to balance data diversity and difficulty
orization mitigation, train-test overlap avoidance,
in data selection by representing datasets as undi-
and training efficiency improvement while keep-
rected graphs and adopting forward-and-reverse
ing model perplexity. Kandpal et al. (2022) also
message passing strategy to select a subgraph en-
show that deduplication can considerably lower the
success rate of privacy attacks aiming at model veloping both diverse and difficult data samples.
memorization. 2.3.5 Data Age
Among practices of deduplication, N-gram- In current practices, more recent LLMs are usually
and-hashing is the most commonly adopted tech- pretrained using newer data 3 . Some knowledge
nique (Lee et al., 2021; Borgeaud et al., 2022; Rae learned by pretrained LLMs could also be time-
et al., 2021). It can operate at line-level (Touvron sensitive. Longpre et al. (2023b) study the impact
et al., 2023a), document-level (Hoffmann et al., of data age and find that the temporal shift between
2022; Li et al., 2022b) or combinations of them. Re- evaluation and pretraining data will lead to inaccu-
cently, neural models are experimentally proven to rate performance estimation. This temporal mis-
outperform traditional N-gram-and-hashing meth- alignment might not be overcome by fine-tuning,
ods (Silcock et al., 2022). Addressing semantic especially for larger models.
deduplication, Abbas et al. (2023) propose SemD-
eDup to remove semantic duplicates that lie closely 2.4 Relations Among Domain Composition,
in the pretrained model’s embedding space and ap- Data Quantity and Data Quality
ply clustering to reduce the searching computation. Recently, several scaling laws are proposed to ex-
plore the synergistic effect of different aspects on
2.3.3 Toxicity Filtering the pretrained model performance, such as the bi-
variate model performance prediction regarding
Toxicity refers to the text content which is data quantity and domain composition ratio (Ge
"rude, disrespectful, or unreasonable language et al., 2024a), the quality-quantity tradeoff under
that is likely to make someone leave a discus- different computing budget (Goyal et al., 2024),
sion" (Gehman et al., 2020; Welbl et al., 2021). and the positive correlation between data quality
As raw text corpora usually contain toxic text (Luc- and model scale under the same data quantity (Bi
cioni and Viviano, 2021; Longpre et al., 2023b), et al., 2024). What’s more, Shen et al. (2023) em-
toxicity filtering aims to remove text with undesir- phasize global deduplication to remove overlaps
able toxic text in the pretraining datasets, further among different domains. Longpre et al. (2023b)
preventing LLMs from generating toxic utterances. claim that domains with high quality and diversity
Similar to quality filtering, heuristic and rule-based are more beneficial than other domains.
filtering (Lees et al., 2022; Gargee et al., 2022;
Friedl, 2023) and N-gram classifiers (Raffel et al., 3 Supervised Fine-Tuning of LLM
2020) are usually adopted as toxicity filters.
Based on the general knowledge and capabilities
Although effective in model detoxifying, Long-
learned in the pretraining stage, supervised fine-
pre et al. (2023b) discover that toxicity filtering
tuning (SFT) is proposed to further improve LLMs
reduces the risk of toxic generation by sacrific-
with instruction-following ability and alignment
ing model generalization and toxicity identifica-
with human expectations (Wei et al., 2021; Sanh
tion ability. Moreover, Xu et al. (2021) and Welbl
et al., 2022; Ouyang et al., 2022). Although LLMs
et al. (2021) find that training dataset detoxifica-
fined-tuned with existing instruction datasets have
tion leads to the marginalization of minority groups
achieved remarkable performance in various NLP
like dialects and minority identity mentions, posing
3
challenges in building unbiased LLMs. https://platform.openai.com/docs/models
tasks, the impacts of instruction data management 3.2 Data Quality
on fine-tuned models are still under debate. The Data quality is always a focal point in the SFT of
data management process in the SFT stage can be LLMs, addressing instruction quality, diversity, and
summarized as illustrated in Figure 1(b), including complexity. Here, we focus more on managing and
task composition, data quality control, data quan- analyzing existing instruction data instead of in-
tity control and dynamic data-efficient learning. struction generation methods discussed in previous
Table 2 summarizes the data management practices surveys (Zhang et al., 2023b; Wang et al., 2023e).
of prominent fine-tuned LLMs.
3.2.1 Instruction Quality
Many researchers have found that the quality of
3.1 Task Composition instruction data is one of the most important factors
in improving model performance (Chia et al., 2023;
Since LLMs have shown surprisingly emergent Zhou et al., 2023a; Ding et al., 2023). During the
abilities in handling various NLP tasks, multitask construction of instruction dataset, there is usually
fine-tuning appears to be promising to improve a filtering step to select high-quality instructions
LLMs’ generalization performance on unseen tasks. generated by models.
The benefits of increasing the number of tasks in Heuristic- and model-based natural language in-
SFT have been experimentally proven on models dicators like perplexity and uncertainty are com-
with different sizes ranging from 3B to 540B pa- monly adopted filtering criteria (Wang et al., 2023d;
rameters (Wang et al., 2022; Sanh et al., 2022; Wei Cao et al., 2023; Bhatt et al., 2024). What’s more,
et al., 2021; Chung et al., 2022). With the scaling losses (Zhou et al., 2023b; Li et al., 2023b, 2024b)
of tasks, the mixture ratio of data targeting differ- and output probabilities (Li et al., 2023a,e; Chen
ent tasks is also found to be critical and usually and Mueller, 2024; He et al., 2024b; Liu et al.,
decided by experiments and intuitions (Iyer et al., 2024) of LLMs are adopted to compute more com-
2022; Longpre et al., 2023a). To enable LLMs plex scores for data selection. Popular searching
to solve targeted tasks with specific skills, repre- approaches like BlendSearch (Wang et al., 2020)
sentation similarity (Ivison et al., 2023; Lee et al., are also leveraged to find high-quality instructions
2024) and gradient similarity (Xia et al., 2024) is satisfying the criteria (Cao et al., 2023).
proposed to select relevant multitask subsets. In addition, LLMs are also queried to directly
evaluate the quality of instructions. Fine-tuned
However, conflicts might exist among the many
LLMs are prompted to assign quality scores (Li
tasks. Dong et al. (2023) focus on task composition
et al., 2023c) or provide self-feedback (Lu et al.,
among mathematical reasoning, code generation,
2023a; Madaan et al., 2023) to their own responses
and general human-aligning abilities. They find
to iteratively improve model prediction. Strong
that model abilities are improved when the mixed
LLMs like ChatGPT (Ye et al., 2023; Chen et al.,
data amount is small but decreased otherwise. The
2023c; Li et al., 2023a) or reward models (Du et al.,
negative impact of large amount mixing data might
2023) are also adopted as quality judges during
lie in the similarity degree of data format and data
instruction data filtering. Recently, weak-to-strong
distribution among different SFT tasks. Wang et al.
strategy is introduced to select high-quality data
(2023b) also experimentally show that different
with smaller and weaker models (Li et al., 2024c;
instruction datasets may correspond to different
Yang et al., 2024; Mekala et al., 2024).
specific abilities. And winning across all evalua-
tions using a single dataset or combination seems 3.2.2 Instruction Diversity
to be challenging.
The intention and semantic diversity of instructions
Divergent from compositing multiple tasks, is another important factor that has shown positive
some works claim that integration of LLMs tuned effects on model performance improvement and
on single task data can outperform one LLM tuned robustness (Zhou et al., 2023a; Ding et al., 2023;
on multiple tasks (Jang et al., 2023; Chen et al., Taori et al., 2023; Bukharin and Zhao, 2023). How-
2023b). But fine-tuning more task-specific LLMs ever, there is no well-acknowledged measurement
also means more resource consumption. How to to quantitatively indicate the diversity of an instruc-
efficiently equip LLMs with the ability to solve tion dataset. #InsTag (Lu et al., 2023b) propose
multiple tasks still demands more exploration. to measure instruction diversity using fine-grained
tags generated by ChatGPT 4 . Specifically, it quan- LLMs, and propose to scaling down the instruction
tifies instruction diversity as the unique tag cover- datasets with limited high-quality data (Zhou et al.,
age rate in the overall tag set. 2023a; Chen et al., 2023b). However, Zhang et al.
To maintain both diversity and data-efficiency in (2024) propose a power-based multiplicative joint
the instruction datasets, Rouge-L similarity (Wang scaling law, showing that increased fine-tuning data
et al., 2023c), embedding distance (Wu et al., 2023; could lead to improved model performance after
Bukharin and Zhao, 2023; Huang et al., 2024) and achieving good results with limited data.
scoring models (Ge et al., 2024b) are proposed to Addressing this conflict, several works attempt
select instructions that are different from each other to analyze the scaling patterns for different tasks
in literal, semantic and human-aligning level. or different model abilities. A consensus of these
Due to data constraints, diversity can be chal- works is that different abilities have different scal-
lenging in some domain-specific tasks. Thus, Wan ing patterns and develop at different paces. Dong
et al. (2023) propose to enlarge the data coverage et al. (2023) find that general ability can be en-
through active searching variations and possibili- hanced with about 1,000 samples and improves
ties of instructions using LLMs. slowly after then, while mathematical reasoning
and code generation improve consistently with the
3.2.3 Instruction Complexity
increasing of instruction data amount. Similarly,
Instruction complexity is found to be crucial Yuan et al. (2023) observe a log-linear relation be-
in developing LLMs with complex instruction- tween instruction data amount and models’ math-
following and reasoning abilities (Xu et al., 2023a; ematical reasoning performance, but stronger pre-
Luo et al., 2023b; Mukherjee et al., 2023; He et al., trained models improve less with more instruction
2024a). Several works endeavor to quantify and data. Surprisingly, the empirical study of Ji et al.
evaluate instruction complexity. Using aforemen- (2023) on 12 major real-world online user cases
tioned tags, #InsTag (Lu et al., 2023b) quantifies draws to an exactly opposite point. Song et al.
complexity as the average tag number assigned to (2023) also show that some abilities have com-
each query in a dataset. He et al. (2023) evaluate pletely different patterns from others.
complex instruction with eight features addressing
the length, contents, and formats of input texts and 3.4 Dynamic Data-Efficient Learning
task descriptions.
It is also empirically showed that complexity en- While works discussed above focus more on the
hancement is necessary for performance improve- static management of instruction datasets without
ment (Zhao et al., 2023b). To increase the in- interaction with model fine-tuning, some works try
struction complexity in SFT datasets, some works to combine data selection with model fine-tuning,
propose to incrementally augment existing instruc- achieving data-efficient learning in a dynamic way.
tions by adding nodes to semantic tree (Zhao et al.,
Training affects data. Some works propose to
2023b) or performing operations such as increasing
dynamically change the datasets along with the
reasoning, adding constraints, in-breadth evolving,
fine-tuning process. Attendu and Corbeil (2023)
deepening, and so on (Xu et al., 2023a; Luo et al.,
propose a dynamic data pruning method that pe-
2023b; Jiang et al., 2023b; Sun et al., 2024a).
riodically filters out unimportant examples during
3.3 Data Quantity SFT using extended versions of EL2N metric (Paul
et al., 2021; Fayyaz et al., 2022). AlShikh et al.
Different with the acknowledged scaling laws of
(2023) predict the responses as "answer-like or not"
pretraining data, explorations of the relationship
by a binary classifier, in order to measure LLMs’
between scaling instruction data quantity and fine-
instruction-following ability and serve as an early-
tuned model performance diverge in two directions.
stopping criterion. Kung et al. (2023) conduct ac-
In the earlier stage, researchers follow the obser-
tive task searching to select informative tasks based
vations in the pretraining of LLMs and argue that
on prompt uncertainty and fine-tune in a loop.
scaling up the instruction data quantity is crucial
for success (Wei et al., 2021; Sanh et al., 2022). Data affects training. Instead of manipulating
Recently, more works claim that data quality is instruction datasets, some works propose special
more important than data quantity in the SFT of training strategies to accommodate the datasets. To
4
https://chatgpt.openai.com/ mitigate forgetting and negative task impact, Yin
et al. (2023a) and Wang et al. (2024) treat task selec- Beyond that, a more autonomous data manage-
tion as a replay strategy in continual learning sce- ment system is also needed to greatly save human
narios; DMT (Dong et al., 2023) learns specialized efforts. To build such systems, LLMs might be
and general abilities sequentially while keeping a leveraged and serve as different roles such as qual-
small proportion of specialized data. To efficiently ity examinator, data augmentor, and so on.
learn mixed-quality data acquired from LLMs with
different level of abilities, OpenChat (Wang et al., Data debiasing and detoxifying Current pre-
2023a) proposes C-RLFT strategy that considers training corpora and instruction datasets might con-
different data sources as coarse-grained reward la- tain harmful information and social biases, which
bels; Xu et al. (2023b), Sun et al. (2024a) and Kim lead to negative social impacts and undesirable
and Lee (2024) propose to make the model progres- model behavior. With the application of LLMs
sively learn from easy to hard, respectively regard- keeps extending to more demanding fields, the fair-
ing different data quality, instruction complexity ness and harmlessness of LLMs will become more
and task hardness. and more innegligible. Hence, as one way to build
ideal LLMs without biases and harmful output, de-
3.5 Relations Among Task composition, Data biasing and detoxifying of pretraining and instruc-
Quality and Data Quantity tion data is an important research direction.

Similar as in the pretraining stage, different aspects Multimodal data management Current re-
of supervised fine-tuning data management can af- search in data management mostly focuses on nat-
fect model performance jointly. Lu et al. (2023b) ural language processing. With the application of
analyze popular open-set SFT datasets using #In- LLMs extending to modalities like vision, audio,
sTag and show that larger dataset sizes tend to be etc., it is necessary to see the impacts of multimodal
more diverse and induce higher performance. Cur- data management on the performance of fine-tuned
rent research on data selection tends to uniformly multimodal LLMs.
consider instruction quality and diversity (Bukharin
and Zhao, 2023; Xu et al., 2023c). Since different Data management for LLM self-exploration
model abilities have different scaling patterns as The ability to actively explore the unknown envi-
discussed in Section 3.3, more efficient task com- ronment and tasks is one of the future perspectives
position strategies are required to build stronger in LLM development. Learning from large-scale
multi-task LLMs. interaction data requires efficient data management
system to construct suitable datasets.
In summary, we provide a list of takeaways in
Appendix A. Some other aspects of data manage- Efficient filtering for synthesized data As data
ment are discussed in Appendix B. annotation requires intensive human labors and ex-
isting data will be exhausted, automatically synthe-
4 Challenges and Future Directions sizing new data using LLMs is newly proposed as
a promising solution (Maini et al., 2024; Li et al.,
The exploration of data management and its impact
2024a). In this process, efficient filtering for syn-
on LLM pretraining and SFT is still an ongoing
thesized data is required to ensure its quality.
task. In this section, we point out several challenges
and corresponding future directions in the research Fine-grained data ordering Some works start
of training data management for LLMs. to pay attention to the ordering of data in both the
pretraining (Gan et al., 2023; Guo et al., 2024) and
General data management framework Al- SFT stage (Xu et al., 2023b; Yin et al., 2023a). It is
though data management systems are proposed shown that more fine-grained data ordering could
to compose various data recipes in either the pre- be beneficial to model performance improvement.
training or SFT stage of LLM (Chen et al., 2023a;
Zhou et al., 2023c; Sun et al., 2024b), practitioners Conflicted data separation In multi-task fine-
still need to spend efforts on organizing suitable tuning, negative impact of mixing data is observed
datasets. A well-established general data manage- and attribute to conflicts among different task
ment framework suitable for a broad range of ap- data (Dong et al., 2023). Thus, separating and
plications is an urgent and worthy future direction effectively learning from conflicted data samples is
in developing and promoting LLMs. a challenging problem in multi-task learning.
5 Conclusions Qiushi Du, Zhe Fu, et al. 2024. Deepseek llm: Scal-
ing open-source language models with longtermism.
This paper overviews the training data manage- arXiv preprint arXiv:2401.02954.
ment of LLMs. We discuss the pretraining and
Sebastian Borgeaud, Arthur Mensch, Jordan Hoff-
supervised fine-tuning stages of LLM successively mann, Trevor Cai, Eliza Rutherford, Katie Milli-
and summarize the up-to-date research efforts ac- can, George Bm Van Den Driessche, Jean-Baptiste
cording to the data management process of each Lespiau, Bogdan Damoc, Aidan Clark, et al. 2022.
stage. Finally, we highlight several challenges and Improving language models by retrieving from tril-
lions of tokens. In International conference on ma-
future directions for LLM training data manage- chine learning, pages 2206–2240. PMLR.
ment. We hope this survey can provide insightful
guidance for practitioners and inspire further re- Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
search in efficient training data management for Neelakantan, Pranav Shyam, Girish Sastry, Amanda
the development of LLMs. Askell, et al. 2020. Language models are few-shot
learners. Advances in neural information processing
systems, 33:1877–1901.
References
Alexander W. Bukharin and Tuo Zhao. 2023. Data
Amro Abbas, Kushal Tirumala, Dániel Simig, Surya diversity matters for robust instruction tuning. ArXiv,
Ganguli, and Ari S Morcos. 2023. Semdedup: Data- abs/2311.14736.
efficient learning at web-scale through semantic dedu-
plication. arXiv preprint arXiv:2303.09540. Yihan Cao, Yanbin Kang, and Lichao Sun. 2023. In-
struction mining: High-quality instruction data se-
Waseem AlShikh, Manhal Daaboul, Kirk Goddard, lection for large language models. arXiv preprint
Brock Imel, Kiran Kamble, Parikshith Kulkarni, and arXiv:2307.06290.
Melisa Russak. 2023. Becoming self-instruct: intro-
ducing early stopping criteria for minimal instruct Daoyuan Chen, Yilun Huang, Zhijian Ma, Hesen Chen,
tuning. arXiv preprint arXiv:2307.03692. Xuchen Pan, Ce Ge, Dawei Gao, Yuexiang Xie,
Zhaoyang Liu, Jinyang Gao, et al. 2023a. Data-juicer:
Yuvanesh Anand, Zach Nussbaum, Brandon Duder- A one-stop data processing system for large language
stadt, Benjamin Schmidt, and Andriy Mulyar. 2023. models. arXiv preprint arXiv:2309.02033.
Gpt4all: Training an assistant-style chatbot with large
Hao Chen, Yiming Zhang, Qi Zhang, Hantao Yang, Xi-
scale data distillation from gpt-3.5-turbo. https:
aomeng Hu, Xuetao Ma, Yifan Yanggong, and Junbo
//github.com/nomic-ai/gpt4all.
Zhao. 2023b. Maybe only 0.5% data is needed: A
Rohan Anil, Andrew M Dai, Orhan Firat, Melvin John- preliminary exploration of low training data instruc-
son, Dmitry Lepikhin, Alexandre Passos, Siamak tion tuning. arXiv preprint arXiv:2305.09246.
Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Jiuhai Chen and Jonas Mueller. 2024. Automated
Chen, et al. 2023. Palm 2 technical report. arXiv data curation for robust language model fine-tuning.
preprint arXiv:2305.10403. arXiv preprint arXiv:2403.12776.
Jean-michel Attendu and Jean-philippe Corbeil. 2023. Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa
NLU on data diets: Dynamic data subset selection Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srini-
for NLP classification tasks. In Proceedings of The vasan, Tianyi Zhou, Heng Huang, et al. 2023c. Al-
Fourth Workshop on Simple and Efficient Natural pagasus: Training a better alpaca with fewer data.
Language Processing (SustaiNLP), pages 129–146, arXiv preprint arXiv:2307.08701.
Toronto, Canada (Hybrid). Association for Computa-
tional Linguistics. Yew Ken Chia, Pengfei Hong, Lidong Bing, and Sou-
janya Poria. 2023. Instructeval: Towards holistic
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, evaluation of instruction-tuned large language mod-
Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei els. arXiv preprint arXiv:2306.04757.
Huang, et al. 2023. Qwen technical report. arXiv
preprint arXiv:2309.16609. Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng,
Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan
Gantavya Bhatt, Yifang Chen, Arnav M. Das, Jifan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion
Zhang, Sang T. Truong, Stephen Mussmann, Yinglun Stoica, and Eric P. Xing. 2023. Vicuna: An open-
Zhu, Jeff Bilmes, Simon Shaolei Du, Kevin Jamieson, source chatbot impressing gpt-4 with 90%* chatgpt
Jordan T. Ash, and Robert Nowak. 2024. An exper- quality.
imental design framework for label-efficient super-
vised finetuning of large language models. ArXiv, Hyung Won Chung, Le Hou, Shayne Longpre, Barret
abs/2401.06692. Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi
Wang, Mostafa Dehghani, Siddhartha Brahma, et al.
Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, 2022. Scaling instruction-finetuned language models.
Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, arXiv preprint arXiv:2210.11416.
Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Paul Friedl. 2023. Dis/similarities in the design and
Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, development of legal and algorithmic normative sys-
Matei Zaharia, and Reynold Xin. 2023. Free dolly: tems: the case of perspective api. Law, Innovation
Introducing the world’s first truly open instruction- and Technology, 15(1):25–59.
tuned llm.
Ruyi Gan, Ziwei Wu, Renliang Sun, Junyu Lu, Xiao-
Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi jun Wu, Dixiang Zhang, Kunhao Pan, Ping Yang,
Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, Qi Yang, Jiaxing Zhang, et al. 2023. Ziya2: Data-
and Bowen Zhou. 2023. Enhancing chat language centric learning is all llms need. arXiv preprint
models by scaling high-quality instructional conver- arXiv:2311.03301.
sations. arXiv preprint arXiv:2305.14233.
Leo Gao. 2021. An empirical exploration in quality fil-
Jesse Dodge, Maarten Sap, Ana Marasović, William
tering of text data. arXiv preprint arXiv:2109.00698.
Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret
Mitchell, and Matt Gardner. 2021. Documenting Leo Gao, Stella Biderman, Sid Black, Laurence Gold-
large webtext corpora: A case study on the colos- ing, Travis Hoppe, Charles Foster, Jason Phang, Ho-
sal clean crawled corpus. In Proceedings of the race He, Anish Thite, Noa Nabeshima, et al. 2020.
2021 Conference on Empirical Methods in Natural The pile: An 800gb dataset of diverse text for lan-
Language Processing, EMNLP 2021, Virtual Event guage modeling. arXiv preprint arXiv:2101.00027.
/ Punta Cana, Dominican Republic, 7-11 November,
2021, pages 1286–1305. Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie
Guanting Dong, Hongyi Yuan, Keming Lu, Cheng- Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui
peng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, He, Xiangyu Yue, et al. 2023. Llama-adapter v2:
Zheng Yuan, Chang Zhou, and Jingren Zhou. 2023. Parameter-efficient visual instruction model. arXiv
How abilities in large language models are affected preprint arXiv:2304.15010.
by supervised fine-tuning data composition. arXiv
preprint arXiv:2310.05492. SK Gargee, Pranav Bhargav Gopinath, Shridhar
Reddy SR Kancharla, CR Anand, and Anoop S Babu.
Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, 2022. Analyzing and addressing the difference in
Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, toxicity prediction between different comments with
Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. 2022. same semantic meaning in google’s perspective api.
Glam: Efficient scaling of language models with In ICT Systems and Sustainability: Proceedings of
mixture-of-experts. In International Conference on ICT4SD 2022, pages 455–464. Springer.
Machine Learning, pages 5547–5569. PMLR.
Ce Ge, Zhijian Ma, Daoyuan Chen, Yaliang Li, and
Qianlong Du, Chengqing Zong, and Jiajun Zhang. 2023. Bolin Ding. 2024a. Data mixing made efficient: A
Mods: Model-oriented data selection for instruction bivariate scaling law for language model pretraining.
tuning. ArXiv, abs/2311.15653. arXiv preprint arXiv:2405.14908.
Nouha Dziri, Sivan Milton, Mo Yu, Osmar Zaiane, and Yuan Ge, Yilun Liu, Chi Hu, Weibin Meng, Shimin
Siva Reddy. 2022. On the origin of hallucinations Tao, Xiaofeng Zhao, Hongxia Ma, Li Zhang, Hao
in conversational models: Is it the datasets or the Yang, and Tong Xiao. 2024b. Clustering and ranking:
models? In Proceedings of the 2022 Conference of Diversity-preserved instruction selection through
the North American Chapter of the Association for expert-aligned quality estimation. arXiv preprint
Computational Linguistics: Human Language Tech- arXiv:2402.18191.
nologies, pages 5271–5285, Seattle, United States.
Association for Computational Linguistics. Samuel Gehman, Suchin Gururangan, Maarten Sap,
Simin Fan, Matteo Pagliardini, and Martin Jaggi. 2023. Yejin Choi, and Noah A Smith. 2020. Realtoxici-
Doge: Domain reweighting with generalization esti- typrompts: Evaluating neural toxic degeneration in
mation. arXiv preprint arXiv:2310.15393. language models. arXiv preprint arXiv:2009.11462.

Mohsen Fayyaz, Ehsan Aghazadeh, Ali Modarressi, Mo- Hila Gonen, Srini Iyer, Terra Blevins, Noah A Smith,
hammad Taher Pilehvar, Yadollah Yaghoobzadeh, and Luke Zettlemoyer. 2022. Demystifying prompts
and Samira Ebrahimi Kahou. 2022. Bert on a data in language models via perplexity estimation. arXiv
diet: Finding important examples by gradient-based preprint arXiv:2212.04037.
pruning. arXiv preprint arXiv:2211.05610.
Sachin Goyal, Pratyush Maini, Zachary C Lipton, Aditi
Shangbin Feng, Chan Young Park, Yuhan Liu, and Yu- Raghunathan, and J Zico Kolter. 2024. Scaling laws
lia Tsvetkov. 2023. From pretraining data to lan- for data filtering–data curation cannot be compute
guage models to downstream tasks: Tracking the agnostic. arXiv preprint arXiv:2404.07177.
trails of political biases leading to unfair nlp models.
In Proceedings of the 61st Annual Meeting of the Arnav Gudibande, Eric Wallace, Charlie Snell, Xinyang
Association for Computational Linguistics (Volume Geng, Hao Liu, Pieter Abbeel, Sergey Levine, and
1: Long Papers), ACL 2023, Toronto, Canada, July Dawn Song. 2023. The false promise of imitating
9-14, 2023, pages 11737–11762. proprietary llms. arXiv preprint arXiv:2305.15717.
Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio Hamish Ivison, Noah A. Smith, Hannaneh Hajishirzi,
César Teodoro Mendes, Allie Del Giorno, Sivakanth and Pradeep Dasigi. 2023. Data-efficient finetuning
Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo using cross-task nearest neighbors. In Findings of
de Rosa, Olli Saarikivi, et al. 2023. Textbooks are all the Association for Computational Linguistics: ACL
you need. arXiv preprint arXiv:2306.11644. 2023, pages 9036–9061, Toronto, Canada. Associa-
tion for Computational Linguistics.
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai
Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Srinivasan Iyer, Xi Victoria Lin, Ramakanth Pasunuru,
Y Wu, YK Li, et al. 2024. Deepseek-coder: When the Todor Mihaylov, Daniel Simig, Ping Yu, Kurt Shus-
large language model meets programming–the rise of ter, Tianlu Wang, Qing Liu, Punit Singh Koura, et al.
code intelligence. arXiv preprint arXiv:2401.14196. 2022. Opt-iml: Scaling language model instruc-
tion meta learning through the lens of generalization.
Nitin Gupta, Shashank Mujumdar, Hima Patel, Satoshi
arXiv preprint arXiv:2212.12017.
Masuda, Naveen Panwar, Sambaran Bandyopadhyay,
Sameep Mehta, Shanmukha Guttula, Shazia Afzal, Abhinav Jain, Hima Patel, Lokesh Nagalapatti,
Ruhi Sharma Mittal, et al. 2021. Data quality for Nitin Gupta, Sameep Mehta, Shanmukha Guttula,
machine learning tasks. In Proceedings of the 27th Shashank Mujumdar, Shazia Afzal, Ruhi Sharma Mit-
ACM SIGKDD conference on knowledge discovery tal, and Vitobha Munigala. 2020. Overview and im-
& data mining, pages 4040–4041. portance of data quality for machine learning tasks.
Suchin Gururangan, Dallas Card, Sarah Dreier, Emily In Proceedings of the 26th ACM SIGKDD interna-
Gade, Leroy Wang, Zeyu Wang, Luke Zettlemoyer, tional conference on knowledge discovery & data
and Noah A. Smith. 2022. Whose language counts mining, pages 3561–3562.
as high quality? measuring language ideologies in Joel Jang, Seungone Kim, Seonghyeon Ye, Doyoung
text data selection. In Proceedings of the 2022 Con- Kim, Lajanugen Logeswaran, Moontae Lee, Kyung-
ference on Empirical Methods in Natural Language jae Lee, and Minjoon Seo. 2023. Exploring the bene-
Processing, pages 2562–2580, Abu Dhabi, United fits of training expert language models over instruc-
Arab Emirates. Association for Computational Lin- tion tuning. In International Conference on Machine
guistics. Learning, ICML 2023, 23-29 July 2023, Honolulu,
Qianyu He, Jie Zeng, Qianxi He, Jiaqing Liang, and Hawaii, USA, volume 202 of Proceedings of Machine
Yanghua Xiao. 2024a. From complex to simple: En- Learning Research, pages 14702–14729. PMLR.
hancing multi-constraint complex instruction follow-
Mojan Javaheripi and Sébastien Bubeck. 2023. Phi-
ing ability of large language models. arXiv preprint
2: The surprising power of small language models.
arXiv:2404.15846.
Blog post.
Qianyu He, Jie Zeng, Wenhao Huang, Lina Chen, Jin
Xiao, Qianxi He, Xunzhe Zhou, Lida Chen, Xin- Yunjie Ji, Yong Deng, Yan Gong, Yiping Peng, Qiang
tao Wang, Yuncheng Huang, et al. 2023. Can large Niu, Lei Zhang, Baochang Ma, and Xiangang Li.
language models understand real-world complex in- 2023. Exploring the impact of instruction data
structions? arXiv preprint arXiv:2309.09150. scaling on large language models: An empirical
study on real-world use cases. arXiv preprint
Yexiao He, Ziyao Wang, Zheyu Shen, Guoheng Sun, arXiv:2303.14742.
Yucong Dai, Yongkai Wu, Hongyi Wang, and Ang
Li. 2024b. Shed: Shapley-based automated dataset AQ Jiang, A Sablayrolles, A Mensch, C Bamford,
refinement for instruction fine-tuning. arXiv preprint DS Chaplot, D de las Casas, F Bressand, G Lengyel,
arXiv:2405.00705. G Lample, L Saulnier, et al. 2023a. Mistral 7b (2023).
arXiv preprint arXiv:2310.06825.
Danny Hernandez, Tom Brown, Tom Conerly, Nova
DasSarma, Dawn Drain, Sheer El-Showk, Nelson Yuxin Jiang, Yufei Wang, Xingshan Zeng, Wanjun
Elhage, Zac Hatfield-Dodds, Tom Henighan, Tristan Zhong, Liangyou Li, Fei Mi, Lifeng Shang, Xin
Hume, et al. 2022. Scaling laws and interpretabil- Jiang, Qun Liu, and Wei Wang. 2023b. Follow-
ity of learning from repeated data. arXiv preprint bench: A multi-level fine-grained constraints follow-
arXiv:2205.10487. ing benchmark for large language models. arXiv
preprint arXiv:2310.20410.
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch,
Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Jean Kaddour. 2023. The minipile challenge for
Diego de Las Casas, Lisa Anne Hendricks, Johannes data-efficient language models. arXiv preprint
Welbl, Aidan Clark, et al. 2022. An empirical analy- arXiv:2304.08442.
sis of compute-optimal large language model training.
Advances in Neural Information Processing Systems, Nikhil Kandpal, Eric Wallace, and Colin Raffel. 2022.
35:30016–30030. Deduplicating training data mitigates privacy risks
in language models. In International Conference on
Hui Huang, Bing Xu, Xinnian Liang, Kehai Chen, Machine Learning, pages 10697–10707. PMLR.
Muyun Yang, Tiejun Zhao, and Conghui Zhu. 2024.
Multi-view fusion for instruction mining of large lan- Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B
guage model. Information Fusion, page 102480. Brown, Benjamin Chess, Rewon Child, Scott Gray,
Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W Black,
Scaling laws for neural language models. arXiv and Yulia Tsvetkov. 2019. Measuring bias in con-
preprint arXiv:2001.08361. textualized word representations. arXiv preprint
arXiv:1906.07337.
Daniel Khashabi, Xinxi Lyu, Sewon Min, Lianhui
Qin, Kyle Richardson, Sean Welleck, Hannaneh Ha- Teven Le Scao, Angela Fan, Christopher Akiki, El-
jishirzi, Tushar Khot, Ashish Sabharwal, Sameer lie Pavlick, Suzana Ilić, Daniel Hesslow, Roman
Singh, and Yejin Choi. 2022. Prompt wayward- Castagné, Alexandra Sasha Luccioni, François Yvon,
ness: The curious case of discretized interpretation Matthias Gallé, et al. 2023. Bloom: A 176b-
of continuous prompts. In Proceedings of the 2022 parameter open-access multilingual language model.
Conference of the North American Chapter of the hal-03850124.
Association for Computational Linguistics: Human
Language Technologies, pages 3631–3643, Seattle, Alycia Lee, Brando Miranda, and Sanmi Koyejo. 2023a.
United States. Association for Computational Lin- Beyond scale: the diversity coefficient as a data qual-
guistics. ity metric demonstrates llms are pre-trained on for-
mally diverse data. arXiv preprint arXiv:2306.13840.
Jisu Kim and Juhwan Lee. 2024. Strategic data or- Ariel N Lee, Cole J Hunter, and Nataniel Ruiz. 2023b.
dering: Enhancing large language model perfor- Platypus: Quick, cheap, and powerful refinement of
mance through curriculum learning. arXiv preprint llms. arXiv preprint arXiv:2308.07317.
arXiv:2405.07490.
Changho Lee, Janghoon Han, Seonghyeon Ye, Stan-
Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, ley Jungkyu Choi, Honglak Lee, and Kyunghoon
Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Bae. 2024. Instruction matters, a simple yet effec-
Abdullah Barhoum, Nguyen Minh Duc, Oliver Stan- tive task selection approach in instruction tuning for
ley, Richárd Nagyfi, et al. 2023. Openassistant specific tasks. arXiv preprint arXiv:2404.16418.
conversations–democratizing large language model
alignment. arXiv preprint arXiv:2304.07327. Katherine Lee, Daphne Ippolito, Andrew Nystrom,
Chiyuan Zhang, Douglas Eck, Chris Callison-Burch,
Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, and Nicholas Carlini. 2021. Deduplicating training
Daan van Esch, Nasanbayar Ulzii-Orshikh, Allah- data makes language models better. In Proceedings
sera Tapo, Nishant Subramani, Artem Sokolov, Clay- of the 60th Annual Meeting of the Association for
tone Sikasote, Monang Setyawan, Supheakmungkol Computational Linguistics (Volume 1: Long Papers),
Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, An- ACL 2022, Dublin, Ireland, May 22-27, 2022, pages
nette Rios, Isabel Papadimitriou, Salomey Osei, Pe- 8424–8445.
dro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, An-
dre Niyongabo Rubungo, Toan Q. Nguyen, Math- Alyssa Lees, Vinh Q Tran, Yi Tay, Jeffrey Sorensen, Jai
ias Müller, André Müller, Shamsuddeen Hassan Gupta, Donald Metzler, and Lucy Vasserman. 2022.
Muhammad, Nanda Muhammad, Ayanda Mnyak- A new generation of perspective api: Efficient multi-
eni, Jamshidbek Mirzakhalov, Tapiwanashe Matan- lingual character-level transformers. In Proceedings
gira, Colin Leong, Nze Lawson, Sneha Kudugunta, of the 28th ACM SIGKDD Conference on Knowledge
Yacine Jernite, Mathias Jenny, Orhan Firat, Bonaven- Discovery and Data Mining, pages 3197–3207.
ture F. P. Dossou, Sakhile Dlamini, Nisansa de Silva,
Sakine Çabuk Ballı, Stella Biderman, Alessia Bat- Haoran Li, Qingxiu Dong, Zhengyang Tang, Chaojun
tisti, Ahmed Baruwa, Ankur Bapna, Pallavi Baljekar, Wang, Xingxing Zhang, Haoyang Huang, Shaohan
Israel Abebe Azime, Ayodele Awokoya, Duygu Ata- Huang, Xiaolong Huang, Zeqiang Huang, Dongdong
man, Orevaoghene Ahia, Oghenefego Ahia, Sweta Zhang, et al. 2024a. Synthetic data (almost) from
Agrawal, and Mofetoluwa Adeyemi. 2022. Quality scratch: Generalized instruction tuning for language
at a glance: An audit of web-crawled multilingual models. arXiv preprint arXiv:2402.13064.
datasets. Transactions of the Association for Compu-
Haoran Li, Yiran Liu, Xingxing Zhang, Wei Lu, and
tational Linguistics, 10:50–72.
Furu Wei. 2023a. Tuna: Instruction tuning using
feedback from large language models. In Conference
Po-Nien Kung and Nanyun Peng. 2023. Do models re- on Empirical Methods in Natural Language Process-
ally learn to follow instructions? an empirical study ing.
of instruction tuning. In Proceedings of the 61st An-
nual Meeting of the Association for Computational Ming Li, Lichang Chen, Jiuhai Chen, Shwai He, Ji-
Linguistics (Volume 2: Short Papers), pages 1317– uxiang Gu, and Tianyi Zhou. 2024b. Selective
1328, Toronto, Canada. Association for Computa- reflection-tuning: Student-selected data recycling for
tional Linguistics. llm instruction-tuning. ArXiv, abs/2402.10110.

Po-Nien Kung, Fan Yin, Di Wu, Kai wei Chang, and Ming Li, Yong Zhang, Shwai He, Zhitao Li, Hongyu
Nanyun Peng. 2023. Active instruction tuning: Zhao, Jianzong Wang, Ning Cheng, and Tianyi Zhou.
Improving cross-task generalization by training on 2024c. Superfiltering: Weak-to-strong data filtering
prompt sensitive tasks. ArXiv, abs/2311.00288. for fast instruction-tuning. ArXiv, abs/2402.00530.
Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Jianqiao Lu, Wanjun Zhong, Wenyong Huang, Yufei
Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, and Wang, Fei Mi, Baojun Wang, Weichao Wang, Lifeng
Jing Xiao. 2023b. From quantity to quality: Boosting Shang, and Qun Liu. 2023a. Self: Language-driven
llm performance with self-guided data selection for self-evolution for large language model. arXiv
instruction tuning. ArXiv, abs/2308.12032. preprint arXiv:2310.00533.
Shaobo Li, Xiaoguang Li, Lifeng Shang, Zhenhua Dong, Keming Lu, Hongyi Yuan, Zheng Yuan, Runji Lin, Jun-
Cheng-Jie Sun, Bingquan Liu, Zhenzhou Ji, Xin yang Lin, Chuanqi Tan, Chang Zhou, and Jingren
Jiang, and Qun Liu. 2022a. How pre-trained lan- Zhou. 2023b. #instag: Instruction tagging for analyz-
guage models capture factual knowledge? a causal- ing supervised fine-tuning of large language models.
inspired analysis. In Findings of the Association for
Computational Linguistics: ACL 2022, pages 1720– Alexandra Sasha Luccioni and Joseph D Viviano. 2021.
1732. What’s in the box? a preliminary analysis of unde-
sirable content in the common crawl corpus. arXiv
Xian Li, Ping Yu, Chunting Zhou, Timo Schick, Luke preprint arXiv:2105.02732.
Zettlemoyer, Omer Levy, Jason Weston, and Mike
Lewis. 2023c. Self-alignment with instruction back- Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jian-
translation. arXiv preprint arXiv:2308.06259. guang Lou, Chongyang Tao, Xiubo Geng, Qingwei
Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Lin, Shifeng Chen, and Dongmei Zhang. 2023a. Wiz-
Del Giorno, Suriya Gunasekar, and Yin Tat Lee. ardmath: Empowering mathematical reasoning for
2023d. Textbooks are all you need ii: phi-1.5 techni- large language models via reinforced evol-instruct.
cal report. arXiv preprint arXiv:2309.05463. arXiv preprint arXiv:2308.09583.

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo
Julian Schrittwieser, Rémi Leblond, Tom Eccles, Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qing-
James Keeling, Felix Gimeno, Agustin Dal Lago, wei Lin, and Daxin Jiang. 2023b. Wizardcoder:
et al. 2022b. Competition-level code generation with Empowering code large language models with evol-
alphacode. Science, 378(6624):1092–1097. instruct. arXiv preprint arXiv:2306.08568.

Yunshui Li, Binyuan Hui, Xiaobo Xia, Jiaxi Yang, Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler
Min Yang, Lei Zhang, Shuzheng Si, Junhao Liu, Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon,
Tongliang Liu, Fei Huang, and Yongbin Li. 2023e. Nouha Dziri, Shrimai Prabhumoye, Yiming Yang,
One shot learning as instruction data prospector for et al. 2023. Self-refine: Iterative refinement with
large language models. ArXiv, abs/2312.10302. self-feedback. arXiv preprint arXiv:2303.17651.
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Adyasha Maharana, Prateek Yadav, and Mohit Bansal.
Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian 2023. D2 pruning: Message passing for balancing di-
Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Ku- versity and difficulty in data pruning. arXiv preprint
mar, et al. 2022. Holistic evaluation of language arXiv:2310.07931.
models. arXiv preprint arXiv:2211.09110.
Pratyush Maini, Skyler Seto, He Bai, David Grangier,
Shihao Liang, Kunlun Zhu, Runchu Tian, Yujia Qin, Yizhe Zhang, and Navdeep Jaitly. 2024. Rephrasing
Huadong Wang, Xin Cong, Zhiyuan Liu, Xiaojiang the web: A recipe for compute and data-efficient lan-
Liu, and Maosong Sun. 2023. Exploring format guage modeling. arXiv preprint arXiv:2401.16380.
consistency for instruction tuning. arXiv preprint
arXiv:2307.15504. Max Marion, Ahmet Üstün, Luiza Pozzobon, Alex
Liangxin Liu, Xuebo Liu, Derek F Wong, Dongfang Li, Wang, Marzieh Fadaee, and Sara Hooker. 2023.
Ziyi Wang, Baotian Hu, and Min Zhang. 2024. Se- When less is more: Investigating data pruning
lectit: Selective instruction tuning for large language for pretraining llms at scale. arXiv preprint
models via uncertainty-aware self-reflection. arXiv arXiv:2309.04564.
preprint arXiv:2402.16705.
Nick McKenna, Tianyi Li, Liang Cheng, Moham-
Shayne Longpre, Le Hou, Tu Vu, Albert Webson, mad Javad Hosseini, Mark Johnson, and Mark Steed-
Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V man. 2023. Sources of hallucination by large lan-
Le, Barret Zoph, Jason Wei, et al. 2023a. The flan guage models on inference tasks. arXiv preprint
collection: Designing data and methods for effective arXiv:2305.14552.
instruction tuning. In ICML.
Nicholas Meade, Elinor Poole-Dayan, and Siva Reddy.
Shayne Longpre, Gregory Yauney, Emily Reif, Kather- 2022. An empirical survey of the effectiveness of
ine Lee, Adam Roberts, Barret Zoph, Denny Zhou, debiasing techniques for pre-trained language models.
Jason Wei, Kevin Robinson, David Mimno, et al. In Proceedings of the 60th Annual Meeting of the
2023b. A pretrainer’s guide to training data: Measur- Association for Computational Linguistics (Volume
ing the effects of data age, domain coverage, quality, 1: Long Papers), ACL 2022, Dublin, Ireland, May
& toxicity. arXiv preprint arXiv:2305.13169. 22-27, 2022, pages 1878–1898.
Dheeraj Mekala, Alex Nguyen, and Jingbo Shang. 2024. Mansheej Paul, Surya Ganguli, and Gintare Karolina
Smaller language models are capable of selecting Dziugaite. 2021. Deep learning on a data diet: Find-
instruction-tuning training data for larger language ing important examples early in training. Advances
models. arXiv preprint arXiv:2402.10430. in Neural Information Processing Systems, 34:20596–
20607.
Meta. 2024. Introducing meta llama 3: The most capa-
ble openly available llm to date. Guilherme Penedo, Quentin Malartic, Daniel Hesslow,
Ruxandra Cojocaru, Hamza Alobeidli, Alessandro
Brando Miranda, Patrick Yu, Yu-Xiong Wang, and Cappelli, Baptiste Pannier, Ebtesam Almazrouei, and
Sanmi Koyejo. 2022. The curse of low task diver- Julien Launay. 2023. The refinedweb dataset for fal-
sity: On the failure of transfer learning to outperform con llm: Outperforming curated corpora with web
maml and their empirical equivalence. arXiv preprint data only. In Thirty-seventh Conference on Neural
arXiv:2208.01545. Information Processing Systems Datasets and Bench-
marks Track.
Swaroop Mishra, Daniel Khashabi, Chitta Baral, Yejin
Choi, and Hannaneh Hajishirzi. 2022. Reframing Baolin Peng, Chunyuan Li, Pengcheng He, Michel Gal-
instructional prompts to GPTk’s language. In Find- ley, and Jianfeng Gao. 2023. Instruction tuning with
ings of the Association for Computational Linguistics: gpt-4. arXiv preprint arXiv:2304.03277.
ACL 2022, pages 589–612, Dublin, Ireland. Associa-
tion for Computational Linguistics. Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie
Millican, Jordan Hoffmann, Francis Song, John
Niklas Muennighoff, Alexander M Rush, Boaz Barak, Aslanides, Sarah Henderson, Roman Ring, Susan-
Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, nah Young, et al. 2021. Scaling language models:
Sampo Pyysalo, Thomas Wolf, and Colin Raffel. Methods, analysis & insights from training gopher.
2023. Scaling data-constrained language models. arXiv preprint arXiv:2112.11446.
arXiv preprint arXiv:2305.16264.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawa- Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
har, Sahaj Agarwal, Hamid Palangi, and Ahmed Wei Li, and Peter J Liu. 2020. Exploring the limits
Awadallah. 2023. Orca: Progressive learning from of transfer learning with a unified text-to-text trans-
complex explanation traces of gpt-4. arXiv preprint former. The Journal of Machine Learning Research,
arXiv:2306.02707. 21(1):5485–5551.
Victor Sanh, Albert Webson, Colin Raffel, Stephen H
Nikita Nangia, Clara Vania, Rasika Bhalerao, and
Bach, Lintang Sutawika, Zaid Alyafeai, Antoine
Samuel R Bowman. 2020. Crows-pairs: A chal-
Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja,
lenge dataset for measuring social biases in masked
et al. 2022. Multitask prompted training enables zero-
language models. arXiv preprint arXiv:2010.00133.
shot task generalization. In International Conference
Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu on Learning Representations.
Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A Zhiqiang Shen, Tianhua Tao, Liqun Ma, Willie
Rossi, and Thien Huu Nguyen. 2023. Culturax: A Neiswanger, Joel Hestness, Natalia Vassilieva, Daria
cleaned, enormous, and multilingual dataset for large Soboleva, and Eric Xing. 2023. Slimpajama-dc: Un-
language models in 167 languages. arXiv preprint derstanding data combinations for llm training. arXiv
arXiv:2309.09400. preprint arXiv:2309.10818.
Erik Nijkamp, Hiroaki Hayashi, Caiming Xiong, Sil- Emily Silcock, Luca D’Amico-Wong, Jinglin Yang, and
vio Savarese, and Yingbo Zhou. 2023. Codegen2: Melissa Dell. 2022. Noise-robust de-duplication at
Lessons for training llms on programming and natu- scale. In The Eleventh International Conference on
ral languages. arXiv preprint arXiv:2305.02309. Learning Representations.
Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Chiyu Song, Zhanchao Zhou, Jianhao Yan, Yuejiao Fei,
Wang, Yingbo Zhou, Silvio Savarese, and Caiming Zhenzhong Lan, and Yue Zhang. 2023. Dynamics of
Xiong. 2022. Codegen: An open large language instruction tuning: Each ability of large language
model for code with multi-turn program synthesis. In models has its own growth pace. arXiv preprint
The Eleventh International Conference on Learning arXiv:2310.19651.
Representations.
Hui Su, Zhi Tian, Xiaoyu Shen, and Xunliang Cai. 2024.
OpenAI. 2023. Gpt-4 technical report. Unraveling the mystery of scaling laws: Part i. arXiv
preprint arXiv:2403.06563.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Carroll Wainwright, Pamela Mishkin, Chong Zhang, Haoran Sun, Lixin Liu, Junjie Li, Fengyu Wang, Bao-
Sandhini Agarwal, Katarina Slama, Alex Ray, et al. hua Dong, Ran Lin, and Ruohui Huang. 2024a.
2022. Training language models to follow instruc- Conifer: Improving complex constrained instruction-
tions with human feedback. Advances in Neural following ability of large language models. arXiv
Information Processing Systems, 35:27730–27744. preprint arXiv:2404.02823.
Yiding Sun, Feng Wang, Yutao Zhu, Wayne Xin Zhao, Yifan Wang, Yafei Liu, Chufan Shi, Haoling Li, Chen
and Jiaxin Mao. 2024b. An integrated data process- Chen, Haonan Lu, and Yujiu Yang. 2024. In-
ing framework for pretraining foundation models. scl: A data-efficient continual learning paradigm for
arXiv preprint arXiv:2402.16358. fine-tuning large language models with instructions.
arXiv preprint arXiv:2403.11435.
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann
Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack
and Tatsunori B. Hashimoto. 2023. Stanford alpaca: Hessel, Tushar Khot, Khyathi Raghavi Chandu,
An instruction-following llama model. https:// David Wadden, Kelsey MacMillan, Noah A Smith,
github.com/tatsu-lab/stanford_alpaca. Iz Beltagy, et al. 2023b. How far can camels go?
exploring the state of instruction tuning on open re-
sources. arXiv preprint arXiv:2306.04751.
Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam
Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa
Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh
2022. Lamda: Language models for dialog applica- Hajishirzi. 2023c. Self-instruct: Aligning language
tions. arXiv preprint arXiv:2201.08239. models with self-generated instructions. In Proceed-
ings of the 61st Annual Meeting of the Association for
Kushal Tirumala, Daniel Simig, Armen Aghajanyan, Computational Linguistics (Volume 1: Long Papers),
and Ari S Morcos. 2023. D4: Improving llm pretrain- pages 13484–13508, Toronto, Canada. Association
ing via document de-duplication and diversification. for Computational Linguistics.
arXiv preprint arXiv:2308.12284.
Yizhong Wang, Swaroop Mishra, Pegah Alipoormo-
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier labashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva
Martinet, Marie-Anne Lachaux, Timothée Lacroix, Naik, Arjun Ashok, Arut Selvan Dhanasekaran, An-
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal jana Arunkumar, David Stap, et al. 2022. Super-
Azhar, et al. 2023a. Llama: Open and effi- naturalinstructions: Generalization via declarative
cient foundation language models. arXiv preprint instructions on 1600+ nlp tasks. In Proceedings of
arXiv:2302.13971. the 2022 Conference on Empirical Methods in Natu-
ral Language Processing, pages 5085–5109.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- Yue Wang, Xinrui Wang, Juntao Li, Jinxiong Chang,
bert, Amjad Almahairi, Yasmine Babaei, Nikolay Qishen Zhang, Zhongyi Liu, Guannan Zhang, and
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Min Zhang. 2023d. Harnessing the power of david
Bhosale, et al. 2023b. Llama 2: Open founda- against goliath: Exploring instruction data generation
tion and fine-tuned chat models. arXiv preprint without using closed-source models. arXiv preprint
arXiv:2307.09288. arXiv:2308.12711.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xing-
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz shan Zeng, Wenyong Huang, Lifeng Shang, Xin
Kaiser, and Illia Polosukhin. 2017. Attention is all Jiang, and Qun Liu. 2023e. Aligning large lan-
you need. Advances in neural information processing guage models with human: A survey. arXiv preprint
systems, 30. arXiv:2307.12966.
Lucas Weber, Elia Bruni, and Dieuwke Hupkes. 2023.
Pablo Villalobos, Jaime Sevilla, Lennart Heim, Tamay Mind the instructions: a holistic evaluation of con-
Besiroglu, Marius Hobbhahn, and Anson Ho. 2022. sistency and interactions in prompt-based learning.
Will we run out of data? an analysis of the limits of arXiv preprint arXiv:2310.13486.
scaling datasets in machine learning. arXiv preprint
arXiv:2211.04325. Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu,
Adams Wei Yu, Brian Lester, Nan Du, Andrew M
Fanqi Wan, Xinting Huang, Tao Yang, Xiaojun Dai, and Quoc V Le. 2021. Finetuned language mod-
Quan, Wei Bi, and Shuming Shi. 2023. Explore- els are zero-shot learners. In International Confer-
instruct: Enhancing domain-specific instruction cov- ence on Learning Representations.
erage through active exploration. arXiv preprint
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel,
arXiv:2310.09168.
Barret Zoph, Sebastian Borgeaud, Dani Yogatama,
Maarten Bosma, Denny Zhou, Donald Metzler, et al.
Chi Wang, Qingyun Wu, Silu Huang, and Amin Saied. 2022. Emergent abilities of large language models.
2020. Economic hyperparameter optimization with Transactions on Machine Learning Research.
blended search strategy. In International Conference
on Learning Representations. Johannes Welbl, Amelia Glaese, Jonathan Uesato,
Sumanth Dathathri, John Mellor, Lisa Anne Hen-
Guan Wang, Sijie Cheng, Xianyuan Zhan, Xiangang Li, dricks, Kirsty Anderson, Pushmeet Kohli, Ben Cop-
Sen Song, and Yang Liu. 2023a. Openchat: Advanc- pin, and Po-Sen Huang. 2021. Challenges in detox-
ing open-source language models with mixed-quality ifying language models. In Findings of the Associ-
data. arXiv preprint arXiv:2309.11235. ation for Computational Linguistics: EMNLP 2021,
Virtual Event / Punta Cana, Dominican Republic, Fuzhao Xue, Yao Fu, Wangchunshu Zhou, Zangwei
16-20 November, 2021, pages 2447–2469. Zheng, and Yang You. 2023. To repeat or not to
repeat: Insights from scaling llm under token-crisis.
Guillaume Wenzek, Marie-Anne Lachaux, Alexis Con- arXiv preprint arXiv:2305.13230.
neau, Vishrav Chaudhary, Francisco Guzmán, Ar-
mand Joulin, and Édouard Grave. 2020. Ccnet: Ex- Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian
tracting high quality monolingual datasets from web Han, Qizhang Feng, Haoming Jiang, Bing Yin, and
crawl data. In Proceedings of the Twelfth Language Xia Hu. 2023a. Harnessing the power of llms in
Resources and Evaluation Conference, pages 4003– practice: A survey on chatgpt and beyond. arXiv
4012. preprint arXiv:2304.13712.
Alexander Wettig, Aatmik Gupta, Saumya Malik, and
Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge,
Danqi Chen. 2024. Qurating: Selecting high-quality
Xiu Li, and Ying Shan. 2023b. Gpt4tools: Teaching
data for training language models. arXiv preprint
large language model to use tools via self-instruction.
arXiv:2402.09739.
arXiv preprint arXiv:2305.18752.
Shengguang Wu, Keming Lu, Benfeng Xu, Junyang Lin,
Qi Su, and Chang Zhou. 2023. Self-evolved diverse Yu Yang, Siddhartha Mishra, Jeffrey N Chiang, and
data sampling for efficient instruction tuning. arXiv Baharan Mirzasoleiman. 2024. Smalltolarge (s2l):
preprint arXiv:2311.08182. Scalable data selection for fine-tuning large language
models by summarizing training trajectories of small
Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi models. arXiv preprint arXiv:2403.07384.
Chen. 2023. Sheared llama: Accelerating language
model pre-training via structured pruning. arXiv Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Car-
preprint arXiv:2310.06694. bonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019.
Xlnet: Generalized autoregressive pretraining for lan-
Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, guage understanding. In Neural Information Process-
Sanjeev Arora, and Danqi Chen. 2024. Less: Se- ing Systems.
lecting influential data for targeted instruction tuning.
arXiv preprint arXiv:2402.04333. Jiasheng Ye, Peiju Liu, Tianxiang Sun, Yunhua Zhou,
Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Jun Zhan, and Xipeng Qiu. 2024. Data mix-
Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V Le, ing laws: Optimizing data mixtures by predicting
Tengyu Ma, and Adams Wei Yu. 2023a. Doremi: language modeling performance. arXiv preprint
Optimizing data mixtures speeds up language model arXiv:2403.16952.
pretraining. arXiv preprint arXiv:2305.10429.
Seonghyeon Ye, Yongrae Jo, Doyoung Kim, Sungdong
Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Kim, Hyeonbin Hwang, and Minjoon Seo. 2023.
Percy Liang. 2023b. Data selection for language Selfee: Iterative self-revising llm empowered by self-
models via importance resampling. arXiv preprint feedback generation. Blog post.
arXiv:2302.03169.
Da Yin, Xiao Liu, Fan Yin, Ming Zhong, Hritik Bansal,
Albert Xu, Eshaan Pathak, Eric Wallace, Suchin Guru- Jiawei Han, and Kai-Wei Chang. 2023a. Dynosaur:
rangan, Maarten Sap, and Dan Klein. 2021. Detoxi- A dynamic growth paradigm for instruction-tuning
fying language models risks marginalizing minority data curation. arXiv preprint arXiv:2305.14327.
voices. In Proceedings of the 2021 Conference of the
North American Chapter of the Association for Com- Fan Yin, Jesse Vig, Philippe Laban, Shafiq Joty, Caim-
putational Linguistics: Human Language Technolo- ing Xiong, and Chien-Sheng Wu. 2023b. Did you
gies, NAACL-HLT 2021, Online, June 6-11, 2021, read the instructions? rethinking the effectiveness of
pages 2390–2397. task definitions in instruction learning. In Proceed-
ings of the 61st Annual Meeting of the Association for
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Computational Linguistics (Volume 1: Long Papers),
Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin pages 3063–3079, Toronto, Canada. Association for
Jiang. 2023a. Wizardlm: Empowering large lan- Computational Linguistics.
guage models to follow complex instructions. arXiv
preprint arXiv:2304.12244. Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting
Canwen Xu, Corby Rosset, Luciano Del Corro, Dong, Chuanqi Tan, and Chang Zhou. 2023. Scal-
Shweti Mahajan, Julian McAuley, Jennifer Neville, ing relationship on learning mathematical reason-
Ahmed Hassan Awadallah, and Nikhil Rao. 2023b. ing with large language models. arXiv preprint
Contrastive post-training large language models on arXiv:2308.01825.
data curriculum. arXiv preprint arXiv:2310.02263.
Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wen-
Yang Xu, Yongqiang Yao, Yufan Huang, Mengnan hao Huang, Huan Sun, Yu Su, and Wenhu Chen.
Qi, Maoquan Wang, Bin Gu, and Neel Sundaresan. 2023. Mammoth: Building math generalist models
2023c. Rethinking the instruction quality: Lift is through hybrid instruction tuning. arXiv preprint
what you need. ArXiv, abs/2312.11508. arXiv:2309.05653.
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, A Takeaways
Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu,
Wendi Zheng, Xiao Xia, et al. 2022. Glm-130b: In the pretraining stage of LLMs:
An open bilingual pre-trained model. arXiv preprint
arXiv:2210.02414. • The coverage of more domains and proper do-
main mixture ratio are important. Recently,
Daochen Zha, Zaid Pervaiz Bhat, Kwei-Herng Lai, Fan researchers try to automatically find the proper
Yang, Zhimeng Jiang, Shaochen Zhong, and Xia Hu.
domain mixture weights, which still show
2023. Data-centric artificial intelligence: A survey.
arXiv preprint arXiv:2303.10158. room for improvement.

Biao Zhang, Zhongtao Liu, Colin Cherry, and Orhan


• Large amount of data is widely considered
Firat. 2024. When scaling meets llm finetuning: The critical, and proper data repetition may also
effect of data, model and finetuning method. arXiv bring positive impacts to model performance.
preprint arXiv:2402.17193.
• Data quality control is necessary usually form
Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, an order, namely quality filtering, deduplica-
Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and tion and toxicity filtering.. However, over-
Yu Qiao. 2023a. Llama-adapter: Efficient fine-tuning
of language models with zero-init attention. arXiv
aggressive quality and toxicity filtering may
preprint arXiv:2303.16199. lead to performance degradation and social
biases, which is still under-explored.
Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang,
Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tian- • Data diversity and temporal misalignment also
wei Zhang, Fei Wu, et al. 2023b. Instruction tuning have impacts on model performance, which
for large language models: A survey. arXiv preprint call for future study.
arXiv:2308.10792.
In the supervised fine-tuning stage of LLMs:
Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu,
Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, • Multitask fine-tuning is widely adopted nowa-
Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei days. However, conflictions may exist among
Bi, Freda Shi, and Shuming Shi. 2023c. Siren’s song
in the ai ocean: A survey on hallucination in large tasks and hinders the model abilities. Hence,
language models. arXiv preprint arXiv:2309.01219. dealing with negative task confliction is also
calling for better answers. Ensembling multi-
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, ple single-task experts instead of training one
Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen multitask model also arises as an new trend.
Zhang, Junjie Zhang, Zican Dong, et al. 2023a. A
survey of large language models. arXiv preprint • Quality control are usually achieved through
arXiv:2303.18223.
heuristics, human evaluation or LLMs as qual-
Yingxiu Zhao, Bowen Yu, Binyuan Hui, Haiyang Yu, ity judges. Instruction diversity and complex-
Fei Huang, Yongbin Li, and Nevin L Zhang. 2023b. ity are also beneficial and enhanced by several
A preliminary study of the intrinsic relationship be- works. The exploration of more diverse and
tween complexity and alignment. arXiv preprint complex instructions is still an open question.
arXiv:2308.05696.
• Works have shown that the SFT of LLM rely
Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao more on data quality than data quantity. How-
Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu,
Lili Yu, et al. 2023a. Lima: Less is more for align- ever, digging deeper into the influence of data
ment. arXiv preprint arXiv:2305.11206. quantity, some researchers find that the learn-
ing of different tasks may require different
Haotian Zhou, Tingkai Liu, Qianli Ma, Jianbo Yuan, amount of data.
Pengfei Liu, Yang You, and Hongxia Yang. 2023b.
Lobass: Gauging learnability in supervised fine- • Instead of keep instruction datasets unchanged
tuning data. arXiv preprint arXiv:2310.13008. during fine-tuning, works propose to adjust
the datasets dynamically through fine-tuning.
Tong Zhou, Yubo Chen, Pengfei Cao, Kang Liu, Jun
Zhao, and Shengping Liu. 2023c. Oasis: Data cura-
Special fine-tuning strategies are also contin-
tion and assessment system for pretraining of large ually shown up to utilize the instruction data
language models. arXiv preprint arXiv:2311.12537. more efficiently.
B Other Aspects of Data Management ingly find that the discretized interpretation of con-
For LLMs tinuous prompts is not always consistent with the
discrete prompts describing the same task as heuris-
B.1 Social Bias tically expected. Yin et al. (2023b) find that remov-
Besides the marginalization of minority groups ing the descriptions of task output, especially the
caused by data detoxifying mentioned in Sec- label information, might be the only reason for
tion 2.3.3, several works (Kurita et al., 2019; Nan- performance degradation. They also propose an
gia et al., 2020; Meade et al., 2022; Feng et al., automatic task definition compression algorithm
2023) find that pre-trained LLMs can capture social to remove almost half or more of the tokens while
biases contained in the large amounts of training improving model performance. Kung and Peng
text. Evaluating on the C4.EN (Raffel et al., 2020) (2023) also remove all semantic components in
dataset, Dodge et al. (2021) recommend document- task definitions but the output space information.
ing the social biases and representational harms as They achieve comparable model performance using
well as excluded voices and identities in large web the modified task definitions and delusive examples
text corpora. Using a dataset of U.S. high school containing incorrect input-output mappings. Based
newspaper articles, Gururangan et al. (2022) also on their experiment results, they cast doubts on the
argue that the quality filters used for GPT-3 (Brown performance gain of fine-tuned models and state
et al., 2020) prefer newspapers published by larger that the model may only learn superficial patterns
schools located in wealthier, educated, and urban during instruction tuning.
ZIP codes, leading to a language ideology. Feng Besides the choice of phrasing, the generation
et al. (2023) conduct a comprehensive case study source of prompts is another factor in prompt de-
focusing on the effects of media political biases in sign. Gudibande et al. (2023) raise questions on
the pretraining corpus on the fairness of hate speech fine-tuning a weaker language model on outputs of
detection and misinformation detection w.r.t. parti- a stronger model and find that the imitation model
san leanings and how it is propagated to language might adapt to mimic the stronger model’s style but
models even further to downstream tasks. not its functionality. Similarly, Song et al. (2023)
As addressed in previous research, there is still also observe that human-designed data can out-
a large gap between current prominent LLMs and perform synthetically generated data from GPT-
ideal LLMs without social biases. Many questions 4 (OpenAI, 2023) to a relatively large extent.
are worth exploring, such as how to mitigate the
potential biases in pretraining datasets, the exis- B.3 Hallucinations
tence of bias in the SFT datasets, and whether it is Despite their strong power, LLMs are notorious for
feasible to reduce social bias through SFT. their hallucinations, i.e. the generation of input-,
context- or fact-conflicting contents (Zhang et al.,
B.2 Prompt Design 2023c). Several works in hallucination trace down
Current instructions are either heuristically de- the occurrence of hallucination to the lack of per-
signed by human (Wang et al., 2022; Köpf et al., tinent knowledge and the internalization of false
2023) or synthetically generated by prominent mod- knowledge from the pretraining corpora (Li et al.,
els (Peng et al., 2023; Ding et al., 2023). The choice 2022a; McKenna et al., 2023; Dziri et al., 2022).
of prompts might cause significant model perfor- To mitigate hallucination, the curation of pretrain-
mance variation (Gonen et al., 2022; Weber et al., ing corpora is adopted by many LLMs, mainly fo-
2023). Early attempts include manual reformula- cusing on the extracting of high-quality data, e.g.,
tion of prompts into the ones easier to follow for GPT-3 (Brown et al., 2020), Llama 2 (Touvron
language models (Mishra et al., 2022), and choos- et al., 2023b), and Falcon (Penedo et al., 2023).
ing prompts with the lowest perplexity to get the The manually curated (Zhou et al., 2023a) and au-
most significant gains in model performance (Go- tomatically selected (Chen et al., 2023c; Cao et al.,
nen et al., 2022). Recently, Liang et al. (2023) 2023; Lee et al., 2023b) high-quality instruction
develop a format transfer framework UIT to trans- data are also experimentally shown to be effective
fer instructions from different datasets into unified in reducing hallucination during the SFT stage. It
formats automatically. can be seen from the previous research that data
Some works focus on studying the impact of management in both the pretraining and SFT stages
prompt phrasing. Khashabi et al. (2022) surpris- can be a promising solution to hallucination.
C Related Surveys
As LLMs draw more and more attention, a hand-
ful of surveys have been published or preprinted
addressing different aspects of their development.
Related to our work, several of them also include
parts of the data preparation process in the pretrain-
ing or SFT of LLM. Zhao et al. (2023a) review the
development of LLMs and the latest advancements
covering a wide range of topics. Yang et al. (2023a)
also provide an overview of the LLM evolution and
discuss the related techniques from model, data,
and downstream tasks. Also concentrating on data,
Zha et al. (2023) introduce data-centric AI and
its related tasks and methods for general machine
learning models instead of LLMs. Zhang et al.
(2023b) survey the instruction tuning of LLMs and
its related methodologies, data construction, appli-
cations, and so on. Wang et al. (2023e) review the
technologies aligning LLMs with human expecta-
tions including data collection, training methodolo-
gies, and model evaluation.
Unlike previous surveys, this survey provides
a systematic and detailed overview of data man-
agement at both the pretraining and SFT stages
of LLMs. We focus on the proper organization
of training datasets and discuss recent research
addressing the effects of different data manage-
ment strategies, the evaluation of curated train-
ing datasets, and the latest advances in training
data management strategies, providing a guiding
resource for practitioners aiming to build powerful
LLMs through efficient data management.

D Comparison of Data Management


Strategies Used by Representative
LLMs
We provide two summary tables, Table 1 for pre-
trained LLMs and Table 2 for SFT LLMs, with bet-
ter and clearer comparison of the data management
strategies used by current representative LLMs.

E Taxonomy
The full taxonomy of research discussed in this
survey is illustrated in Figure 3
Open- Quality Toxicity
Pretrained LLMs Quantity Deduplication Domian Composition
sourced Filters Filters
T5 √
750GB N-gram Heuristic Heuristic 99% Web, < 1% Wiki
(Raffel et al., 2020)
GPT-3 499B MinHash, 82% Web, 16% Books,
Classifier
(Brown et al., 2020) tokens LSH 3% Wiki
GLaM 1.6T 46% Web, 28% Dialog,
Classifier
(Du et al., 2022) tokens 20% Books, 6% Wiki
50% Dialog, 25% Web,
LaMDA 1.56T
12.5% Wiki,
(Thoppilan et al., 2022) words
12.5% Code
Chinchilla 1.4T N-gram, 65% Web, 30% Books,
Heuristic Heuristic
(Hoffmann et al., 2022) tokens Doc-level 4% Code, 1% Wiki
AlphaCode
715.1GB Doc-level Heuristic 100% Code
(Li et al., 2022b)
GLM √ 400B 50% Pile,
(Zeng et al., 2022) tokens 50% Chinese Web data
60% Web, 10% Books,
√ SimHash,
BLOOM 1.61TB 10% Code,
Substring Heuristic Heuristic
(Le Scao et al., 2023) text 10% Academic,
clustering
5% Dialog, 5% Wiki
50% Dialog, 28% Web,
PaLM 780B Levenshtein Heuristic,
Classifier 13% Books, 5% Code,
(Anil et al., 2023) tokens distance Classifier
4% Wiki

82% Web, 4.5% Code,


√ 4.5% Wiki,
LLaMA 1.4T Line-level, Heuristic,
Classifier 4.5% Books,
(Touvron et al., 2023a) tokens Book-level Classifier
2.5% Academic,
2% Dialog
Mistral √
- - - - -
(Jiang et al., 2023a)
phi-1/1.5 √ 7B 99% Academic,
(Gunasekar et al., 2023) Classifier
tokens <1% Code
(Li et al., 2023d)

phi-2 √ 1.4B
Classifier
(Javaheripi and Bubeck, 2023) tokens
GPT-4
- - - - -
(OpenAI, 2023)
LLaMA 2 √ 2.0T
Heuristic
(Touvron et al., 2023b) tokens

√ Exact Match,
QWen 3T Heuristic, Web, Books,
MinHash, Classifier
(Bai et al., 2023) tokens Classifier Codes, Academic
LHS
Deepseek LLM √
- - - - -
(Bi et al., 2024)

Table 1: The data management strategies used by representative pretrained models. The blank units mean no specific
design of corresponding strategies according to the original papers. The ’-’ means the data management process is
not released. Part of the data is adopted from (Longpre et al., 2023b)
Quality Diversity Complexity No. of Task
SFT LLMs Dataset Quantity
Control Control Enhancing Tasks Balancing
Limited
Tk-Instruct Heuristic
NIv2 5M 1616 instances
(Wang et al., 2022) Human
per task

Flan-T5 Input Experiments


Flan 2022 15M 1836
(Longpre et al., 2023a) Inversion intuitions
OPT-IML OPT-IML
18M 2000 Experiments
(Iyer et al., 2022) Bench
Alpaca ROUGE-L
Alpaca 52K Heuristic 80
(Taori et al., 2023) similarity
Vicuna
ShareGPT 70K Heuristic
(Chiang et al., 2023)
LIMA Heuristic Heuristic,
LIMA 1K
(Zhou et al., 2023a) Human Human
Dolly
dolly-15k 15K Human
(Conover et al., 2023)
Chat-GPT/
Orca sampled
5M GPT-4
(Mukherjee et al., 2023) Flan 2022
augmentation

WizardLM
(Xu et al., 2023a)
WizardLM
WizardCoder
WizardCoder 250K Evol-Instruct Evol-Instruct
(Luo et al., 2023b)
WizardMath
WizardMath
(Luo et al., 2023a)
AlpaGasus Chat-GPT
AlpaGasus 9K
(Chen et al., 2023c) grading
Platypus Open- Dedup,
25K
(Lee et al., 2023b) Platypus Heuristic
OpenChat
ShareGPT 6K C-RLFT
(Wang et al., 2023a)
MAmmoTH 7 math Combining
MathInstruct 260K
(Yue et al., 2023) fields CoT and PoT

Table 2: The data management strategies used by representative supervised finetuned models. The blank units
mean no specific design of corresponding strategies according to the original papers. "NIv2" is the abbreviation for
"Super-NaturalInstructions". "Dedup" is the abbreviation for "Deduplication".
Longpre et al. (2023b), Nijkamp et al. (2023), Shen et al. (2023),
Domain Composition (§2.1) Xie et al. (2023b), Xie et al. (2023a), Fan et al. (2023), Ye et al. (2024),
(Xia et al., 2023)

Scaling Laws Kaplan et al. (2020), Hoffmann et al. (2022), Su et al. (2024)
Data Quantity (§2.2)
Villalobos et al. (2022), Muennighoff et al. (2023), Hernandez et al. (2022),
Data Repetition
Xue et al. (2023), Tirumala et al. (2023)

Gao (2021), Kreutzer et al. (2022), Gunasekar et al. (2023), Li et al. (2023d),
Penedo et al. (2023), Marion et al. (2023), Longpre et al. (2023b),
Quality Filtering
Kaddour (2023), Javaheripi and Bubeck (2023), Gan et al. (2023),
Wettig et al. (2024)
Pretraining (§2)

Lee et al. (2021), Kandpal et al. (2022), Silcock et al. (2022),


Deduplication
Abbas et al. (2023)

Luccioni and Viviano (2021), Xu et al. (2021), Welbl et al. (2021),


Data Quality (§2.3) Toxicity Filtering
Longpre et al. (2023b)

Diversity & Age Lee et al. (2023a), Maharana et al. (2023), Longpre et al. (2023b)

Dodge et al. (2021), Meade et al. (2022), Gururangan et al. (2022)


Social Bias*
Feng et al. (2023)

Hallucinations* Li et al. (2022a), McKenna et al. (2023), Dziri et al. (2022)

Relations Among Ge et al. (2024a), Goyal et al. (2024), Bi et al. (2024), Shen et al. (2023),
Different Aspects (§2.4) Longpre et al. (2023b)
Data Management

Wei et al. (2021), Wang et al. (2022), Sanh et al. (2022), Chung et al. (2022),
Longpre et al. (2023a), Jang et al. (2023), Chen et al. (2023b), Xia et al. (2024)
Task Composition (§3.1)
Dong et al. (2023), Iyer et al. (2022), Wang et al. (2023b), Ivison et al. (2023),
Lee et al. (2024)

Chia et al. (2023), Zhou et al. (2023a), Li et al. (2023b), Li et al. (2024b)
Ding et al. (2023), Wang et al. (2023d), Li et al. (2023c), Zhou et al. (2023b),
Cao et al. (2023), Madaan et al. (2023), Du et al. (2023), Li et al. (2024c)
Instruction Quality
Lu et al. (2023a), Ye et al. (2023), Chen et al. (2023c), Li et al. (2023a),
Li et al. (2023e), Bhatt et al. (2024), Chen and Mueller (2024),
Yang et al. (2024), Mekala et al. (2024), He et al. (2024b), Liu et al. (2024)

Ding et al. (2023), Zhou et al. (2023a), Bukharin and Zhao (2023)
Instruction Diversity Taori et al. (2023), Lu et al. (2023b), Wang et al. (2023c),
Supervised Fine-Tuning (§3)

Wan et al. (2023), Wu et al. (2023), Ge et al. (2024b), Huang et al. (2024)

Data Quality (§3.2) Lu et al. (2023b), Xu et al. (2023a), Luo et al. (2023b), Mukherjee et al. (2023),
Instruction Complexity Zhao et al. (2023b), He et al. (2023), Jiang et al. (2023b), Sun et al. (2024a)
He et al. (2024a)

Mishra et al. (2022), Khashabi et al. (2022), Gonen et al. (2022),


Prompt Design* Yin et al. (2023b), Kung and Peng (2023), Liang et al. (2023),
Weber et al. (2023), Gudibande et al. (2023), Song et al. (2023)

Hallucinations* Zhou et al. (2023a), Chen et al. (2023c), (Cao et al., 2023), Lee et al. (2023b)

Ji et al. (2023), Zhou et al. (2023a), Yuan et al. (2023), Zhang et al. (2024)
Data Quantity (§3.3)
Chen et al. (2023b), Dong et al. (2023), Song et al. (2023)

Training Affects Data Attendu and Corbeil (2023), AlShikh et al. (2023), Kung et al. (2023)
Dynamic
Data-Efficient Learning (§3.4) Yin et al. (2023a), Wang et al. (2023a), Dong et al. (2023),
Data Affects Training
Xu et al. (2023b), Wang et al. (2024), Sun et al. (2024a), Kim and Lee (2024)
Relations Among
Lu et al. (2023b), Bukharin and Zhao (2023), Xu et al. (2023c)
Different Aspects (§2.4)

Figure 3: Taxonomy of research in data management for pretraining and supervised fine-tuning of Large Language
Models (LLM).

You might also like