NVIDIA Polar Fixes Token Issues in GRPO for Codex, Claude, Qwen

Reinforcement learning for language agents is becoming more complex as agents handle multi‑turn tool use, long contexts and multi‑agent orchestration. The biggest engineering hurdle is hooking existing agent harnesses into RL pipelines without changing how those harnesses work. Traditional approaches require rewriting the harness to fit a framework‑owned environment API (env.init, env.step, env.reset). Every new harness needs new integration code, and that process can lose execution details that are crucial at evaluation time.

Polar solves this by placing a proxy at the model API boundary instead of inside the harness. The proxy does four things for each incoming model request: detects the provider (Anthropic, OpenAI Chat, OpenAI Responses or Google), normalizes the request to the shape used by the local inference server, captures token‑level data (messages, token IDs, log probabilities, finish reason) and returns the response in the provider‑specific shape the harness expects. For streaming, it builds a synthetic provider‑shaped stream while still capturing the full token trace. The only change required to an existing harness is pointing its model base URL at the Polar gateway.

Polar’s architecture splits work between a rollout server that creates sessions and gateway nodes that run the harness, build trajectories, evaluate output and tear down. Workers handle initialization, execution and post‑run stages independently, allowing CPU‑heavy prep to happen off the critical path. If a harness times out after model calls are captured, the gateway still enters post‑run so partial traces can be recovered.

Trajectory reconstruction can be done per request or with prefix merging. Prefix merging joins completions that share a strict token‑prefix relation, preserving multi‑turn context while marking only sampled assistant tokens as trainable. Ablation shows prefix merging reduces trainer updates from 1,185 to 218 and cuts wall‑clock time by a factor of 5.39×, raising average rollout GPU utilization from 20% to 88%.

Experiments on Qwen3.5‑4B with GRPO across Codex, Claude Code, Qwen Code and Pi harnesses give SWE‑Bench verified gains of up to 22.6 points (Codex) and consistent improvements on the others. Polar also works as an offline SFT data generator: on an 8×H100 server it produced 504 accepted trajectories from 1,638 attempts (~30.8% acceptance) in roughly 64 GPU‑hours.

No harness code changes, provider‑agnostic support, faster training, recoverable partial traces and dual use for online RL and offline SFT make Polar a practical solution for teams wanting to RL‑train language agents without disrupting their existing tooling.

#AI #Product #MachineLearning #ReinforcementLearning #LLM #DevOps