
NVIDIA Introduces CLIMB: A Framework for Optimizing Language Model Pretraining Data
Understanding the Challenges in Pretraining Data Selection
As large language models (LLMs) continue to grow in complexity and capability, selecting the right pretraining data becomes crucial for achieving optimal performance. Many LLMs rely on extensive datasets like Common Crawl, which, while comprehensive, often lack specific domain labels. This makes it challenging to create data mixtures that effectively balance general knowledge with specialized expertise.
Traditional methods of dataset curation, such as The Pile, are labor-intensive and do not scale effectively. Furthermore, the relationship between data composition and model performance is complex, complicating the identification of optimal domain data proportions. These challenges highlight the need for automated, scalable, and adaptive data selection methods.
Introducing CLIMB: A Solution for Data Mixture Optimization
To tackle these issues, NVIDIA researchers have developed CLIMB—CLustering-based Iterative Data Mixture Bootstrapping. This innovative framework automates the discovery and refinement of data mixtures tailored for language model pretraining, combining unsupervised clustering with iterative optimization.
How CLIMB Works
The CLIMB process begins by embedding large volumes of text data into a semantic space using pretrained encoders. K-means clustering organizes this data into coherent groups, which are then pruned and merged based on quality and redundancy. This step lays the groundwork for constructing candidate data mixtures.
After forming candidate mixtures, CLIMB employs proxy models to evaluate their effectiveness. A regression-based predictor, such as LightGBM, estimates the performance of these mixtures. Through an iterative bootstrapping process, CLIMB refines the sampling space, focusing on configurations that yield the best results, all while adhering to a fixed compute budget.
Technical Insights and Design Considerations
The optimization approach in CLIMB is structured as a bi-level problem. At the lower level, proxy models are trained on candidate mixtures, while the upper level involves learning a predictor to approximate performance outcomes. This dual approach enhances the efficiency of exploring the mixture space.
CLIMB promotes sparsity in mixture weights, facilitating the identification of compact, domain-relevant data subsets. By utilizing clustering based on embeddings rather than token-level features, CLIMB ensures semantic coherence within clusters. The iterative refinement process is designed to balance exploration breadth with predictive accuracy, and studies confirm that optimized compute allocation improves convergence and final model performance.
Empirical Evaluation and Results
CLIMB underwent rigorous evaluation across several general reasoning tasks, including PIQA, ARC, HellaSwag, and WinoGrande. A 1 billion-parameter model trained using CLIMB-discovered mixtures achieved an average accuracy of 60.41%, surpassing baseline models like DoReMi and RegMix.
When extended to a 400 billion-token pretraining, this model outperformed Llama-3.2-1B by 2% across a wide range of benchmarks. Additionally, in the sub-500 million model category, CLIMB-based pretraining consistently improved performance over models such as SmolLM and TinyLlama.
The benefits of CLIMB are particularly evident in domain-specific applications. For instance, in targeted MMLU benchmarks across STEM, humanities, and social sciences, CLIMB-trained models outperformed both random selection and exhaustive search baselines, demonstrating the framework’s efficacy in guiding the model training process.
Supporting Resources for Further Research
To promote reproducibility and facilitate further research, NVIDIA has released two valuable resources:
- ClimbLab: A 1.2 trillion-token corpus organized into 20 semantic clusters.
- ClimbMix: A 400 billion-token optimized mixture for efficient pretraining.
Models trained on ClimbMix have shown superior performance compared to those trained on datasets like Nemotron-CC and SmolLM, even under equivalent token budgets, highlighting improved scalability.
Conclusion
CLIMB represents a significant advancement in the optimization of data mixtures for LLM pretraining. By integrating semantic clustering with an iterative, proxy-based search, CLIMB eliminates the need for manual annotations or static heuristics. This adaptable framework caters to both generalist and specialist training objectives while accommodating varying compute and data constraints.
The empirical results underscore the critical role of data mixture optimization in maximizing model utility, particularly when operating within fixed resource budgets. As organizations increasingly recognize the importance of effective data management in AI, CLIMB offers a scalable and principled alternative to traditional data curation methods.