EAGLE 3.1 Stops Attention Drift in LLMs with Speculative Decoding

Speculative decoding speeds up large language model inference by using a small fast draft model to propose several tokens that a large target model verifies in parallel. When the proposals are accepted the system runs faster; when they are rejected it falls back gracefully without losing quality. In practice the EAGLE family of algorithms—EAGLE 1, EAGLE 2 and EAGLE 3—has been widely adopted for this purpose. However users observed that performance drops when the input changes: different chat templates, very long contexts, or unfamiliar system prompts cause the acceptance length to shrink and the output to become unstable.

Analysis traced the problem to attention drift. As the draft model speculates deeper, its attention shifts away from the original context tokens (sink tokens) toward the tokens it has just generated. This happens because the fused input representation becomes dominated by higher‑layer hidden states from the target model and because the hidden‑state magnitude grows unchecked across speculation steps due to the residual path. The draft model therefore receives increasingly unstable inputs, making its predictions less reliable.

EAGLE 3.1 introduces two targeted architectural fixes to stop this drift. First, FC normalization is applied after each target hidden state and before the fully connected layer. This keeps the magnitude of the hidden states bounded at every step, preventing the runaway growth that confused the draft model. Second, the normalized hidden states are fed back as the input for the next decoding step, a post‑norm design that makes the draft model behave like a recursively called module rather than a simple stack of extra layers. Together these changes stabilize the draft model’s attention across all speculation depths.

The results are clear. In long‑context workloads EAGLE 3.1 achieves up to two times longer acceptance length than EAGLE 3. Benchmarks on the Kimi‑K2.6‑NVFP4 model with vLLM show a 2.03× increase in per‑user output throughput at concurrency one, 1.71× at concurrency four, and 1.66× at concurrency sixteen, using the SPEED‑Bench coding dataset. The update is fully backward compatible—existing EAGLE 3 checkpoints work without change—and the feature is already merged into vLLM main, shipping in version v0.22.0. Teams can deploy it with a simple speculative‑config entry pointing to the new draft model.

For engineers and researchers needing reliable, faster LLM serving under varied prompts and long contexts, EAGLE 3.1 offers a practical, drop‑in solution that restores speculative decoding’s promised speed gains.

#AI #Product #LLM #SpeculativeDecoding #vLLM #Eagle31

EAGLE 3.1 Stops Attention Drift in LLMs with Speculative Decoding

Advertising

Terms of Use

Press releases

Copyright

FAQ

Disclaimer