
Enhancing AI Model Performance Through Data Optimization
Understanding the Challenge of Data Selection in LLM Pretraining
Creating large language models (LLMs) requires significant computational resources, particularly when testing various pretraining datasets. Conducting full-scale comparisons—using billions of parameters and tokens—can exhaust hundreds of thousands of GPU hours for each experiment. As a result, many practitioners opt for smaller-scale tests as substitutes for larger models. Unfortunately, these preliminary studies often go unpublished, leading to a fragmented research landscape where similar small-scale tests are repeated without standardized benchmarks or methodologies. This lack of transparency hampers reproducibility, limits collective insights, and clouds the understanding of the trade-offs between computational investment and model performance.
Introducing DataDecide
To tackle these issues, the Allen Institute for AI (AI2), in collaboration with the University of Washington and the University of Pennsylvania, has launched DataDecide. This comprehensive suite of controlled pretraining experiments encompasses 25 distinct datasets and 14 model sizes, ranging from 4 million to 1 billion parameters. The datasets utilized include reputable sources such as Dolma, DCLM, RefinedWeb, C4, and FineWeb, along with variations created through domain ablation, deduplication, quality filtering, and source mixing. Each model is trained at a fixed token-to-parameter ratio of 100, which optimizes inference efficiency. In total, DataDecide offers over 1,050 models and more than 30,000 checkpoints, each evaluated across ten downstream tasks.
Technical Structure and Practical Benefits
DataDecide organizes its experiments along three key axes:
- Data Recipes: Twenty-five well-documented pretraining corpora, each representing different curation strategies.
- Model Scale: Fourteen parameter configurations (4M–1B), derived programmatically to ensure consistent training hyperparameters across scales.
- Evaluation Suite: The OLMES benchmark comprises ten multiple-choice tasks that assess language understanding, commonsense reasoning, and code generation performance.
This framework allows researchers to reuse checkpoints for new evaluations, explore innovative prediction methods, and examine how benchmarks respond to variations in training data and model scale.
Key Findings and Quantitative Insights
DataDecide’s systematic analysis has led to four actionable guidelines:
- Single-Scale Baseline Robustness: Evaluating datasets based on downstream accuracy at a smaller scale (e.g., 150M parameters) achieves approximately 80% accuracy in predicting the best dataset for the 1B-parameter scale, outperforming complex scaling-law extrapolations.
- Task-Dependent Compute Sensitivity: The compute budget needed for reliable decisions varies by task, with certain benchmarks requiring significantly less computational effort to achieve accuracy.
- Proxy Metric Selection: Continuous likelihood metrics outperform discrete accuracy measures at smaller scales, particularly in code-related tasks.
- Variance and Spread Considerations: High decision accuracy is associated with low variability and significant performance differences across datasets, emphasizing the importance of effective proxy metrics.
Conclusion
DataDecide transforms the process of pretraining data selection from a subjective practice into a transparent, data-driven methodology. By making all 25 corpora, 1,050 models, and over 30,000 checkpoints publicly available, AI2 encourages the research community to reproduce findings, extend evaluations, and innovate in decision-making strategies. As the demand for computational resources in LLM development grows, DataDecide provides a structured framework that minimizes wasted experiments and maximizes insights, paving the way for more efficient, reproducible, and collaborative AI research.