Researchers propose Pairwise Proximal Policy Optimization (P3O), a new approach to Reinforcement Learning with Human Feedback (RLHF) that addresses the inconsistency between the reward learning and RL fine-tuning stages. By using a comparative training process, P3O improves alignment with human values and outperforms existing methods in terms of the KL-Reward frontier and GPT-4 win-rate. The paper provides a detailed explanation of the P3O algorithm and evaluates its performance on text generation tasks.
Rethinking the Role of PPO in RLHF
TL;DR:
In Reinforcement Learning with Human Feedback (RLHF), there is a discrepancy between the reward learning phase and the RL fine-tuning phase. We propose a solution called Pairwise Proximal Policy Optimization (P3O) that harmonizes these stages and addresses this issue.
Background
In RLHF, Large Language Models (LLMs) like GPT-4 and Claude-2 have been used to power virtual assistants that can respond to complex queries and generate code or poetry. RLHF aims to align these models with human values and eliminate unintended behaviors that may arise from low-quality data during pretraining.
The RLHF pipeline consists of three stages:
1. Supervised Fine-Tuning Stage: The model learns to respond to human queries through mimicking.
2. Reward Modeling Stage: The model generates response pairs that are compared by human labellers to train a reward model.
3. RL Fine-Tuning Stage: The model is fine-tuned using an RL algorithm to maximize the reward while limiting deviation from the initial policy.
However, there is a challenge in the non-uniqueness of the reward function, which can lead to misleading optimization. To address this, we introduce P3O, an RL algorithm that learns in a comparative manner.
Derivation of P3O
P3O is derived from the vanilla policy gradient (VPG) algorithm. It uses the reward difference between two responses of the same prompt, bypassing the issue of reward translation. We also incorporate importance sampling and clipping techniques to improve performance.
Evaluation
We evaluate P3O on text generation tasks like summarization and question-answering. P3O outperforms other RL algorithms like PPO and DPO in terms of reward and KL-divergence from the reference policy. It also achieves higher win rates against baselines in terms of GPT-4 evaluation.
Conclusion
P3O provides a practical solution for aligning large language models with human preferences through RL. It improves the KL-Reward trade-off and aligns better with human evaluation. If you want to evolve your company with AI, consider rethinking the role of PPO in RLHF and explore AI solutions that can redefine your way of work.