Boost AI Speed: NVIDIA Gated DeltaNet‑2 Solves Attention Bottleneck

Linear attention models compress the unbounded key‑value cache into a fixed‑size recurrent state, which gives constant‑memory decoding but makes editing that compressed memory difficult. In earlier delta‑rule approaches a single scalar step size βₜ controlled both how much old content to erase and how much new content to write. Tying these two decisions together limits the model’s ability to selectively forget irrelevant information while committing useful updates, especially when the key and value spaces have different structures.

Gated DeltaNet‑2 solves this by splitting the scalar gate into two independent, channel‑wise gates. An erase gate bₜ operates on the key axis, deciding which dimensions of the decayed state are read and removed. A write gate wₜ operates on the value axis, deciding which dimensions of the incoming value are committed. Both gates are produced by sigmoid projections of the token representation, and the update applies a channel‑wise decay Dₜ before the active edit. The recurrence can be written as

Sₜ = (I − kₜ (bₜ ⊙ kₜ)ᵀ) Dₜ Sₜ₋₁ + kₜ (wₜ ⊙ vₜ)ᵀ

When both gates collapse to the same scalar the update recovers KDA; when the decay also collapses it recovers Gated DeltaNet, showing that the new formulation strictly generalises prior methods.

Training remains efficient because the recurrence admits a chunkwise WY form; channel‑wise decay is absorbed into asymmetric erase factors and a gate‑aware backward pass is fused in Triton kernels. The model uses the same recurrent state size as baselines, so any performance gain comes from the richer update rule, not from extra memory.

At 1.3 B parameters trained on 100 B FineWeb‑Edu tokens, Gated DeltaNet‑2 outperforms Mamba‑2, Gated DeltaNet, KDA, and Mamba‑3 on language modeling, commonsense reasoning, and real‑world retrieval benchmarks. The biggest improvements appear on long‑context retrieval tasks, where scores jump dramatically over KDA, demonstrating that decoupling erase and write yields more faithful memory editing without scrambling existing associations.

#AI #Product #MachineLearning #DeepLearning #NLP #LinearAttention