Itinai.com it company office background blured photography by 83d4babd 14b1 46f9 81ea 8a75bac63327 0
Itinai.com it company office background blured photography by 83d4babd 14b1 46f9 81ea 8a75bac63327 0

OpenThoughts: Revolutionizing SFT Data Curation for Advanced Reasoning Models

Understanding the Target Audience

The primary audience for OpenThoughts consists of researchers, data scientists, and AI practitioners who are focused on enhancing reasoning models. They often encounter challenges related to accessing comprehensive methodologies for developing these models. This includes high costs associated with teacher inference and model training, as well as limitations in current data curation methods. Their main goals involve developing more effective reasoning capabilities, optimizing data sourcing strategies, and boosting model performance. Typically, they prefer concise, data-driven content that showcases empirical results and case studies, seeking technical specifications and practical applications of AI in the business context.

The Growing Complexity of Reasoning Data Curation

Recent advancements in reasoning models, like DeepSeek-R1 and o3, have shown remarkable effectiveness in various domains such as mathematics, coding, and scientific inquiries. These successes come from employing techniques like supervised fine-tuning (SFT) and reinforcement learning (RL). However, the methodologies behind these models remain largely undisclosed, hindering further research and development. Current initiatives often rely on singular design choices, mainly human-written questions or a single teacher model, which incurs considerable costs in terms of teacher inference and model training while exploring the extensive design space for question-answer pairs.

OpenThoughts: A Scalable Framework for SFT Dataset Development

OpenThoughts represents a collaborative initiative involving researchers from Stanford University, the University of Washington, BespokeLabs.ai, the Toyota Research Institute, UC Berkeley, and 12 other organizations. This framework employs a progressive approach divided into three key iterations:

  • OpenThoughts-114K: This phase scales the Sky-T1 pipeline with automated verification.
  • OpenThoughts2-1M: This iteration enhances data scale by diversifying question types and employing synthetic generation strategies.
  • OpenThoughts3-1.2M: This final stage incorporates insights from over 1,000 ablation experiments to create a streamlined, scalable, and high-performing data curation pipeline.

The resulting model, OpenThinker3-7B, stands out with state-of-the-art performance among open-data models at the 7B scale.

Evaluation Insights and Benchmark Performance

The evaluation of the OpenThoughts pipeline offers crucial insights concerning question sourcing, mixing, filtering, and teacher models. Noteworthy findings include:

  • CodeGolf and competitive coding questions show the best performance in coding tasks, averaging scores between 25.3 and 27.5.
  • Questions generated by large language models (LLMs) and those written by humans perform well in mathematical inquiries, with scores of 58.8 and 58.5 respectively.
  • For scientific topics, questions sourced from Physics StackExchange paired with chemistry textbook extracts achieve the highest scores, ranging from 43.2 to 45.3.

Interestingly, combining diverse question sources can degrade performance, while optimal results demonstrate a 5% accuracy improvement over varied mixing strategies. In terms of teacher models, the QwQ-32B model has outperformed DeepSeek-R1 in knowledge distillation, achieving an accuracy enhancement of 1.9 to 2.6%.

Conclusion

The OpenThoughts project exemplifies how systematic experimentation can propel advancements in SFT data curation for reasoning models. The emergence of OpenThoughts3-1.2M presents a cutting-edge, open-data reasoning dataset across fields like science, mathematics, and coding. Moreover, the OpenThinker3-7B model showcases exemplary performance amid open-data reasoning models at its scale. Nevertheless, several challenges persist, including unexplored reinforcement learning strategies, staged fine-tuning approaches, and curriculum learning techniques. Moving forward, research should emphasize understanding cross-domain transfer effects as they relate to individual versus overall performance, as well as exploring scaling dynamics as student models approach teacher capabilities.

Further Reading and Resources

For more in-depth information, consider checking out the Paper, Project Page, and GitHub Page. This research owes its success to the dedicated researchers involved. Stay connected by following us on Twitter and joining our 99k+ ML SubReddit. Don’t forget to subscribe to our Newsletter.

Itinai.com office ai background high tech quantum computing 0002ba7c e3d6 4fd7 abd6 cfe4e5f08aeb 0

Vladimir Dyachkov, Ph.D
Editor-in-Chief itinai.com

I believe that AI is only as powerful as the human insight guiding it.

Unleash Your Creative Potential with AI Agents

Competitors are already using AI Agents

Business Problems We Solve

  • Automation of internal processes.
  • Optimizing AI costs without huge budgets.
  • Training staff, developing custom courses for business needs
  • Integrating AI into client work, automating first lines of contact

Large and Medium Businesses

Startups

Offline Business

100% of clients report increased productivity and reduced operati

AI news and solutions