Itinai.com a modern office workspace featuring a computer wit 1806a220 be34 4644 a20a 7b02eb350167 2
Itinai.com a modern office workspace featuring a computer wit 1806a220 be34 4644 a20a 7b02eb350167 2

Boost AI Speed: NVIDIA Gated DeltaNet‑2 Solves Attention Bottleneck

Linear attention models compress the unbounded key‑value cache into a fixed‑size recurrent state, which gives constant‑memory decoding but makes editing that compressed memory difficult. In earlier delta‑rule approaches a single scalar step size βₜ controlled both how much old content to erase and how much new content to write. Tying these two decisions together limits the model’s ability to selectively forget irrelevant information while committing useful updates, especially when the key and value spaces have different structures.

Gated DeltaNet‑2 solves this by splitting the scalar gate into two independent, channel‑wise gates. An erase gate bₜ operates on the key axis, deciding which dimensions of the decayed state are read and removed. A write gate wₜ operates on the value axis, deciding which dimensions of the incoming value are committed. Both gates are produced by sigmoid projections of the token representation, and the update applies a channel‑wise decay Dₜ before the active edit. The recurrence can be written as

Sₜ = (I − kₜ (bₜ ⊙ kₜ)ᵀ) Dₜ Sₜ₋₁ + kₜ (wₜ ⊙ vₜ)ᵀ

When both gates collapse to the same scalar the update recovers KDA; when the decay also collapses it recovers Gated DeltaNet, showing that the new formulation strictly generalises prior methods.

Training remains efficient because the recurrence admits a chunkwise WY form; channel‑wise decay is absorbed into asymmetric erase factors and a gate‑aware backward pass is fused in Triton kernels. The model uses the same recurrent state size as baselines, so any performance gain comes from the richer update rule, not from extra memory.

At 1.3 B parameters trained on 100 B FineWeb‑Edu tokens, Gated DeltaNet‑2 outperforms Mamba‑2, Gated DeltaNet, KDA, and Mamba‑3 on language modeling, commonsense reasoning, and real‑world retrieval benchmarks. The biggest improvements appear on long‑context retrieval tasks, where scores jump dramatically over KDA, demonstrating that decoupling erase and write yields more faithful memory editing without scrambling existing associations.

#AI #Product #MachineLearning #DeepLearning #NLP #LinearAttention

Itinai.com office ai background high tech quantum computing 0002ba7c e3d6 4fd7 abd6 cfe4e5f08aeb 0

Vladimir Dyachkov, Ph.D
Editor-in-Chief itinai.com

I believe that AI is only as powerful as the human insight guiding it.