
Large-scale reinforcement learning (RL) training for language models is proving effective for solving complex problems. Recent models, such as OpenAI’s o1 and DeepSeek’s R1-Zero, have shown impressive scalability in training time and performance. This paper introduces a new approach called Reasoner-Zero training, which builds on these advancements.
Researchers from StepFun and Tsinghua University have developed Open-Reasoner-Zero (ORZ), an open-source framework for large-scale reasoning-oriented RL training. This initiative aims to make advanced RL techniques accessible to the research community. ORZ improves various reasoning skills, including arithmetic, logic, coding, and common-sense reasoning, while addressing challenges like training stability and performance optimization.
The ORZ framework uses the Qwen2.5-{7B, 32B} model as its base and employs direct large-scale RL training without prior fine-tuning. It utilizes an enhanced version of the standard PPO algorithm tailored for reasoning tasks. The training dataset consists of curated question-answer pairs focused on STEM and reasoning tasks, supported by a specialized prompt template to boost inference capabilities. The implementation is based on OpenRLHF, featuring a flexible trainer and advanced support mechanisms for efficient training.
Results indicate significant performance gains for both the 7B and 32B models of Open-Reasoner-Zero. The training process shows consistent improvements in reward metrics and response lengths, with a notable “step moment” phenomenon indicating sudden enhancements in reasoning skills. The Open-Reasoner-Zero-32B model achieves comparable response lengths to DeepSeek-R1-Zero while requiring only a fraction of the training steps, demonstrating the effectiveness of a simplified training approach.
Experimental results highlight that Open-Reasoner-Zero excels across various evaluation metrics, particularly in the 32B configuration, outperforming DeepSeek-R1-Zero with significantly fewer training steps. The 7B variant also displays interesting learning dynamics, with steady accuracy improvements and notable response length growth. The observed “step moment” phenomenon reflects sudden increases in performance during evaluation.
This research marks a significant step in democratizing large-scale reasoning-oriented RL training for language models. It shows that a straightforward approach using basic PPO and rule-based rewards can yield competitive results compared to more complex systems. The successful implementation without complex modifications suggests that simpler architectures can still achieve strong reasoning capabilities. By open-sourcing the training pipeline and sharing insights, this work lays the groundwork for future advancements in scaling language model reasoning abilities.
Explore how artificial intelligence can enhance your business processes. Identify areas for automation and customer interactions where AI can add value. Establish key performance indicators (KPIs) to measure the impact of your AI investments. Choose tools that align with your needs and allow for customization. Start with a small project, evaluate its effectiveness, and gradually expand your AI applications.
If you need assistance in managing AI for your business, contact us at hello@itinai.ru. Connect with us on Telegram, X, and LinkedIn.