Rethinking the Role of PPO in RLHF

Researchers propose Pairwise Proximal Policy Optimization (P3O), a new approach to Reinforcement Learning with Human Feedback (RLHF) that addresses the inconsistency between the reward learning and RL fine-tuning stages. By using a comparative training process, P3O improves alignment with human values and outperforms existing methods in terms of the KL-Reward frontier and GPT-4 win-rate. The paper provides a detailed explanation of the P3O algorithm and evaluates its performance on text generation tasks.

 Rethinking the Role of PPO in RLHF

Rethinking the Role of PPO in RLHF

TL;DR:

In Reinforcement Learning with Human Feedback (RLHF), there is a discrepancy between the reward learning phase and the RL fine-tuning phase. We propose a solution called Pairwise Proximal Policy Optimization (P3O) that harmonizes these stages and addresses this issue.

Background

In RLHF, Large Language Models (LLMs) like GPT-4 and Claude-2 have been used to power virtual assistants that can respond to complex queries and generate code or poetry. RLHF aims to align these models with human values and eliminate unintended behaviors that may arise from low-quality data during pretraining.

The RLHF pipeline consists of three stages:
1. Supervised Fine-Tuning Stage: The model learns to respond to human queries through mimicking.
2. Reward Modeling Stage: The model generates response pairs that are compared by human labellers to train a reward model.
3. RL Fine-Tuning Stage: The model is fine-tuned using an RL algorithm to maximize the reward while limiting deviation from the initial policy.

However, there is a challenge in the non-uniqueness of the reward function, which can lead to misleading optimization. To address this, we introduce P3O, an RL algorithm that learns in a comparative manner.

Derivation of P3O

P3O is derived from the vanilla policy gradient (VPG) algorithm. It uses the reward difference between two responses of the same prompt, bypassing the issue of reward translation. We also incorporate importance sampling and clipping techniques to improve performance.

Evaluation

We evaluate P3O on text generation tasks like summarization and question-answering. P3O outperforms other RL algorithms like PPO and DPO in terms of reward and KL-divergence from the reference policy. It also achieves higher win rates against baselines in terms of GPT-4 evaluation.

Conclusion

P3O provides a practical solution for aligning large language models with human preferences through RL. It improves the KL-Reward trade-off and aligns better with human evaluation. If you want to evolve your company with AI, consider rethinking the role of PPO in RLHF and explore AI solutions that can redefine your way of work.

List of Useful Links:

AI Products for Business or Try Custom Development

AI Sales Bot

Welcome AI Sales Bot, your 24/7 teammate! Engaging customers in natural language across all channels and learning from your materials, it’s a step towards efficient, enriched customer interactions and sales

AI Document Assistant

Unlock insights and drive decisions with our AI Insights Suite. Indexing your documents and data, it provides smart, AI-driven decision support, enhancing your productivity and decision-making.

AI Customer Support

Upgrade your support with our AI Assistant, reducing response times and personalizing interactions by analyzing documents and past engagements. Boost your team and customer satisfaction

AI Scrum Bot

Enhance agile management with our AI Scrum Bot, it helps to organize retrospectives. It answers queries and boosts collaboration and efficiency in your scrum processes.