Itinai.com a realistic user interface of a modern ai powered ba94bb85 c764 4faa 963c 3c93dfb87a10 1
Itinai.com a realistic user interface of a modern ai powered ba94bb85 c764 4faa 963c 3c93dfb87a10 1

Reinforcement Learning vs. Supervised Fine-Tuning: Minimizing Catastrophic Forgetting in AI

What is Catastrophic Forgetting in Foundation Models?

Foundation models, like large language models, have shown remarkable capabilities across various tasks. However, once deployed, they often become static. When these models are fine-tuned for new tasks, they can suffer from catastrophic forgetting, which refers to the loss of previously acquired knowledge. This issue hinders the development of AI systems that can learn continuously and adapt over time.

Why Does Online Reinforcement Learning Forget Less Than Supervised Fine-Tuning?

A recent MIT study highlights the differences between reinforcement learning (RL) and supervised fine-tuning (SFT). While both methods can achieve high performance on new tasks, SFT often overwrites prior knowledge, leading to forgetting. In contrast, RL tends to preserve earlier capabilities. The difference lies in how each approach modifies the model’s output distribution relative to its original policy.

How Can Forgetting Be Measured?

The research team introduced an empirical forgetting law that quantifies forgetting:

  • Forgetting ∝ KL(π0∣∣π)

In this equation, π0 represents the base model, and π is the fine-tuned model. The forward Kullback-Leibler (KL) divergence, measured on the new task, serves as a strong predictor of the extent of forgetting. This allows researchers to quantify forgetting without needing data from previous tasks.

What Do Experiments on Large Language Models Reveal?

In experiments using the Qwen 2.5 3B-Instruct model, fine-tuning was conducted on various tasks, including:

  • Math reasoning (Open-Reasoner-Zero)
  • Science Q&A (SciKnowEval subset)
  • Tool use (ToolAlpaca)

Results indicated that RL not only improved accuracy on new tasks but also maintained performance on previous benchmarks. In contrast, SFT often compromised prior knowledge for new-task performance.

How Does RL Compare to SFT in Robotics Tasks?

In robotic control experiments using the OpenVLA-7B model in pick-and-place scenarios, RL adaptation preserved general manipulation skills across tasks. While SFT succeeded in the new task, it degraded previous manipulation abilities, further demonstrating RL’s advantage in retaining knowledge.

What Insights Come from the ParityMNIST Study?

The research team designed a simplified problem called ParityMNIST to isolate mechanisms of forgetting. Both RL and SFT achieved high accuracy on new tasks, but SFT caused more significant declines on the FashionMNIST benchmark. Plotting forgetting against KL divergence revealed a consistent predictive curve, confirming KL divergence as a key factor in understanding forgetting.

Why Do On-Policy Updates Matter?

On-policy RL updates sample from the model’s own outputs and incrementally reweight them based on rewards. This process keeps learning close to the base model’s distribution. In contrast, SFT optimizes against fixed labels, which may be far from the model’s original state. Theoretical analysis shows that policy gradients in RL converge to KL-minimal optimal solutions, reinforcing RL’s advantage in minimizing forgetting.

Are Other Explanations Sufficient?

The research team explored various alternative explanations, such as weight-space changes and hidden representation drift. However, none of these factors matched the predictive power of forward KL divergence, underscoring the importance of distributional closeness in understanding forgetting.

What Are the Broader Implications?

Evaluating AI models should consider KL-conservatism alongside task accuracy. Hybrid methods that combine the efficiency of SFT with explicit KL minimization could yield optimal results. Additionally, RL’s principles can guide the design of adaptive agents capable of learning new skills without losing previous knowledge.

Conclusion

The MIT study reframes catastrophic forgetting as a distributional challenge governed by forward KL divergence. Reinforcement learning demonstrates a lower tendency to forget due to its on-policy updates, which favor KL-minimal solutions. This insight not only explains RL’s robustness but also provides a framework for developing post-training methods that support lifelong learning in foundation models.

Key Takeaways

  • Reinforcement learning preserves prior knowledge better than supervised fine-tuning.
  • Forgetting can be predicted using KL divergence, highlighting its importance in model evaluation.
  • RL’s on-policy updates ensure that learning remains close to the base model, reducing forgetting.
  • Experiments confirm RL’s robustness against forgetting across various domains.
  • Future designs for post-training should prioritize KL-conservatism to enhance learning capabilities.

FAQs

  • What is catastrophic forgetting? It is the phenomenon where a model loses previously learned knowledge when trained on new tasks.
  • How does reinforcement learning differ from supervised fine-tuning? RL tends to preserve prior knowledge better than SFT, which often overwrites it.
  • What is KL divergence? It is a measure of how one probability distribution diverges from a second, expected probability distribution.
  • Why is on-policy learning important? On-policy learning helps maintain the model’s output distribution close to its original state, reducing forgetting.
  • What implications does this research have for AI development? It suggests that future AI models should be designed with a focus on minimizing forgetting while learning new tasks.
Itinai.com office ai background high tech quantum computing 0002ba7c e3d6 4fd7 abd6 cfe4e5f08aeb 0

Vladimir Dyachkov, Ph.D
Editor-in-Chief itinai.com

I believe that AI is only as powerful as the human insight guiding it.

Unleash Your Creative Potential with AI Agents

Competitors are already using AI Agents

Business Problems We Solve

  • Automation of internal processes.
  • Optimizing AI costs without huge budgets.
  • Training staff, developing custom courses for business needs
  • Integrating AI into client work, automating first lines of contact

Large and Medium Businesses

Startups

Offline Business

100% of clients report increased productivity and reduced operati

AI news and solutions