OpenThoughts: Revolutionizing SFT Data Curation for Advanced Reasoning Models

Understanding the Target Audience

The primary audience for OpenThoughts consists of researchers, data scientists, and AI practitioners who are focused on enhancing reasoning models. They often encounter challenges related to accessing comprehensive methodologies for developing these models. This includes high costs associated with teacher inference and model training, as well as limitations in current data curation methods. Their main goals involve developing more effective reasoning capabilities, optimizing data sourcing strategies, and boosting model performance. Typically, they prefer concise, data-driven content that showcases empirical results and case studies, seeking technical specifications and practical applications of AI in the business context.

The Growing Complexity of Reasoning Data Curation

Recent advancements in reasoning models, like DeepSeek-R1 and o3, have shown remarkable effectiveness in various domains such as mathematics, coding, and scientific inquiries. These successes come from employing techniques like supervised fine-tuning (SFT) and reinforcement learning (RL). However, the methodologies behind these models remain largely undisclosed, hindering further research and development. Current initiatives often rely on singular design choices, mainly human-written questions or a single teacher model, which incurs considerable costs in terms of teacher inference and model training while exploring the extensive design space for question-answer pairs.

OpenThoughts: A Scalable Framework for SFT Dataset Development

OpenThoughts represents a collaborative initiative involving researchers from Stanford University, the University of Washington, BespokeLabs.ai, the Toyota Research Institute, UC Berkeley, and 12 other organizations. This framework employs a progressive approach divided into three key iterations:

OpenThoughts-114K: This phase scales the Sky-T1 pipeline with automated verification.
OpenThoughts2-1M: This iteration enhances data scale by diversifying question types and employing synthetic generation strategies.
OpenThoughts3-1.2M: This final stage incorporates insights from over 1,000 ablation experiments to create a streamlined, scalable, and high-performing data curation pipeline.

The resulting model, OpenThinker3-7B, stands out with state-of-the-art performance among open-data models at the 7B scale.

Evaluation Insights and Benchmark Performance

The evaluation of the OpenThoughts pipeline offers crucial insights concerning question sourcing, mixing, filtering, and teacher models. Noteworthy findings include:

CodeGolf and competitive coding questions show the best performance in coding tasks, averaging scores between 25.3 and 27.5.
Questions generated by large language models (LLMs) and those written by humans perform well in mathematical inquiries, with scores of 58.8 and 58.5 respectively.
For scientific topics, questions sourced from Physics StackExchange paired with chemistry textbook extracts achieve the highest scores, ranging from 43.2 to 45.3.

Interestingly, combining diverse question sources can degrade performance, while optimal results demonstrate a 5% accuracy improvement over varied mixing strategies. In terms of teacher models, the QwQ-32B model has outperformed DeepSeek-R1 in knowledge distillation, achieving an accuracy enhancement of 1.9 to 2.6%.

Conclusion

The OpenThoughts project exemplifies how systematic experimentation can propel advancements in SFT data curation for reasoning models. The emergence of OpenThoughts3-1.2M presents a cutting-edge, open-data reasoning dataset across fields like science, mathematics, and coding. Moreover, the OpenThinker3-7B model showcases exemplary performance amid open-data reasoning models at its scale. Nevertheless, several challenges persist, including unexplored reinforcement learning strategies, staged fine-tuning approaches, and curriculum learning techniques. Moving forward, research should emphasize understanding cross-domain transfer effects as they relate to individual versus overall performance, as well as exploring scaling dynamics as student models approach teacher capabilities.

Business Problems We Solve

Automation of internal processes.
Optimizing AI costs without huge budgets.
Training staff, developing custom courses for business needs
Integrating AI into client work, automating first lines of contact

Large and Medium Businesses

Startups

Offline Business

Get a plan to reduce routine and improve metrics

100% of clients report increased productivity and reduced operati

AI Agents

Localization Project Manager – Coordinating translation workflows, answering vendor or process-related questions.

Job Title: Localization Project Manager Overview The Localization Project Manager plays a vital role in coordinating translation workflows while addressing vendor and process-related queries. This position is crucial for ensuring that translation projects are executed efficiently…
AI Agents

Environmental Health & Safety Officer – Answering compliance-related questions, retrieving safety protocols or audit histories.

Professional Summary The AI-driven Environmental Health & Safety Officer is a reliable and effective digital team member that performs repetitive and time-consuming tasks with remarkable speed, accuracy, and stability. By automating these tasks, it frees up…
AI Agents

Legal Contract Reviewer – Auto-flagging clause inconsistencies or retrieving precedent cases for review.

Job Title: Legal Contract Reviewer – Auto-flagging Clause Inconsistencies or Retrieving Precedent Cases for Review The AI functions as a reliable and effective digital team member that excels in performing repetitive and time-consuming tasks. With remarkable…
AI Agents

Customer Retention Analyst – Creating customer summaries, identifying churn risk patterns, and suggesting retention steps.

Customer Retention Analyst Professional Summary A highly analytical and detail-oriented Customer Retention Analyst with a proven track record in creating comprehensive customer summaries, identifying churn risk patterns, and suggesting effective retention strategies. Adept at leveraging data-driven…

Itinai.com httpss.mj.runmrqch2uvtvo russian handsome charisma 9fdbb2d5 a55b 425d 8f3b 76d26f86710f 2

AI Business Accelerator

Start Your AI Business in Just a Week with itinai.com

You’re a great fit if you:

Have an audience (even 500+ followers in Instagram, email, etc.)
Have an idea, service, or product you want to scale
Can invest 2–3 hours a day
You’re motivated to earn with AI but don’t want to handle technical setup

AI news and solutions

NVIDIA AI Introduces NVILA: A Family of Open Visual Language Models VLMs Designed to Optimize both Efficiency and Accuracy

Introducing NVILA: Efficient Visual Language Models Visual language models (VLMs) are crucial for combining visual and text data, but they often require extensive resources for training and deployment. For example, training a large 7-billion-parameter model can…

AI Tech News
The Role of Artificial Intelligence in Contact Centers

Artificial Intelligence (AI) is revolutionizing contact centers by improving customer service and optimizing operations. AI can analyze customer data in real-time, providing agents with relevant information and enabling personalized recommendations. It can also automate repetitive tasks,…

Support Ai News
New approach could make large language models 300x faster

ETH Zurich researchers developed an approach using Fast Feedforward Networks (FFF) to increase the speed of Large Language Models (LLM). By engaging only a small fraction of neurons for individual inferences, their UltraFastBERT model could potentially…

AI Tech News
GaLiTe and AGaLiTe: Efficient Transformer Alternatives for Partially Observable Online Reinforcement Learning

Understanding the Challenges in Decision-Making for Agents In real-life situations, agents often struggle with limited visibility, making it hard to make decisions. For example, a self-driving car needs to remember road signs to adjust its speed,…

AI Tech News
This AI Paper from the Tsinghua University Propose T1 to Scale Reinforcement Learning by Encouraging Exploration and Understand Inference Scaling

Understanding Large Language Models (LLMs) Large Language Models (LLMs) are designed for tasks like math, programming, and autonomous agents. However, they need better reasoning skills during testing. Current methods involve generating reasoning steps or using sampling…

AI Tech News
deepsense.ai among top 50 AI providers in CEE

AI Tech News
ChatRex: A Multimodal Large Language Model (MLLM) with a Decoupled Perception Design

Understanding Multimodal Large Language Models (MLLMs) Multimodal Large Language Models (MLLMs) are advanced AI systems that can understand both text and visual information. However, they struggle with detailed tasks like object detection, which is essential for…

AI Tech News
Jina AI Introduces Reader API that Converts Any URL to an LLM-Friendly Input with a Simple Prefix

AI Tech News
Jina AI Introduced ‘Late Chunking’: A Simple AI Approach to Embed Short Chunks by Leveraging the Power of Long-Context Embedding Models

Practical Solutions and Value of Retrieval-Augmented Generation (RAG) in Natural Language Processing Efficient Information Retrieval and Processing Retrieval-augmented generation (RAG) breaks down large documents into smaller text chunks, stored in a vector database. This enables efficient…

AI Tech News
COMCAT: Enhancing Software Maintenance through Automated Code Documentation and Improved Developer Comprehension Using Advanced Language Models

The Value of Automated Code Documentation The field of software engineering is continuously evolving, focusing on improving software maintenance and code comprehension. Automated code documentation is crucial for enhancing software readability and maintainability through advanced tools…

AI Tech News
Intelligently search Drupal content using Amazon Kendra

Amazon Kendra is an intelligent search service that uses machine learning to quickly search enterprise data. The Amazon Kendra Drupal connector allows users to index and search Drupal content using intelligent search. This post provides a…

AI Tech News
Dr. GRPO: A Bias-Free Reinforcement Learning Method Enhancing Math Reasoning in Large Language Models

Advancements in Reinforcement Learning for Large Language Models Advancements in Reinforcement Learning for Large Language Models Introduction to Reinforcement Learning in LLMs Recent developments in artificial intelligence have highlighted the potential of reinforcement learning (RL) techniques…

AI Tech News
We judge White AI faces as real more often than human faces

Researchers at the Australian National University conducted a study revealing people’s difficulty in distinguishing between real and AI-generated faces. Hyperrealistic AI faces were often perceived as real, with AI faces misidentified 65.9% of the time and…

AI Tech News
AI Security Risks: Best Practices for Safeguarding Systems

The text discusses various AI security risks and strategies to mitigate them effectively. These risks include data breaches and privacy concerns, model poisoning, copyright infringement, vulnerabilities in the AI infrastructure, and model inversion attacks. To combat…

Support Ai News
Beyond GPT-4: Dive into Fudan University’s LONG AGENT and Its Revolutionary Approach to Text Analysis!

The “LONG AGENT” approach revolutionizes text analysis by enabling language models to efficiently navigate lengthy documents with up to 128,000 tokens. Developed by a team at Fudan University, its multi-agent architecture allows granular analysis and has…

AI Tech News
From Deep Knowledge Tracing to DKT2: A Leap Forward in Educational AI

Understanding Knowledge Tracing (KT) in Education Knowledge Tracing (KT) is essential in Intelligent Tutoring Systems (ITS). It helps track what students know and predict how they will perform in the future. Traditional models like Bayesian Knowledge…

AI Tech News
10 Use Cases of Claude 3.5 Sonnet: Unveiling the Future of Artificial Intelligence AI with Revolutionary Capabilities

Claude 3.5 Sonnet: Unveiling the Future of Artificial Intelligence AI with Revolutionary Capabilities N-Body Particle Animation: Unleashing Complex Simulations Claude 3.5 Sonnet can swiftly generate intricate n-body particle animations and simulate complex systems involving phenomena like…

AI Tech News
Evaluating LLM Trustworthiness: Insights from Harmoniticity Analysis Research from VISA Team

Practical AI Solutions for Evaluating LLM Trustworthiness Assessing Response Reliability Large Language Models (LLMs) often provide confident answers, but assessing their reliability for factual questions is challenging. We aim for LLMs to yield high trust scores,…

AI Tech News
Meet aMUSEd: An Open-Source and Lightweight Masked Image Model (MIM) for Text-to-Image Generation based on MUSE

Text-to-image generation technology merges language and visuals in AI, facing challenges in efficiency and computational resources. Traditional models like latent diffusion are computationally intense. However, aMUSEd, a new innovative model, addresses these challenges with a lightweight…

AI Tech News
NVIDIA AI Research Releases HelpSteer: A Multiple Attribute Helpfulness Preference Dataset for STEERLM with 37k Samples

NVIDIA has introduced the HELPSTEER dataset, a collection of annotated responses that influence helpfulness in language models. The dataset covers qualities such as accuracy, coherence, complexity, verbosity, and overall helpfulness. Researchers used the dataset to train…

AI Tech News