Itinai.com httpss.mj.rungdy7g1wsaug a cinematic still of a sc e1b0a79b d913 4bbc ab32 d5488e846719 2
Itinai.com httpss.mj.rungdy7g1wsaug a cinematic still of a sc e1b0a79b d913 4bbc ab32 d5488e846719 2

Solve LLM Long Context Memory Overload with OSCAR 2‑Bit KV Cache

Long-context LLM serving is limited by GPU memory taken up by the KV cache. During autoregressive decoding the cache grows with context length, batch size and model depth, and at long contexts and large batches it consumes a large fraction of memory, forcing users to lower batch size or accept high latency. Quantizing the KV cache to low precision seems the natural fix, but 2‑bit quantization fails: outlier channels dominate the scale, most values collapse to one or two levels and attention quality collapses. Simple rotations like Hadamard help at 4‑bit but not at 2‑bit because they are data‑oblivious and spread error uniformly instead of pushing it into low‑importance directions.

Together AI’s OSCAR solves this by deriving rotation matrices from attention statistics rather than from raw KV distribution. For keys it uses the query covariance; for values it uses the score‑weighted value covariance. The rotation is composed of three factors: an attention‑aware eigenbasis that diagonalizes the error‑weighting matrix, a Walsh‑Hadamard transform that equalizes channel importance, and a permuted‑bit‑reversal that distributes importance evenly across quantization groups. The result is an INT2 KV cache that retains near‑BF16 accuracy while cutting memory by roughly 8× and boosting decode speed up to 3× at 100 K tokens. Job‑level throughput can reach 7‑8× at large batch sizes.

OSCAR integrates into SGLang as an INT2 KV‑cache mode that works with paged attention and prefix caching. Users launch the server with --kv-cache-dtype int2 and point to pre‑computed rotation files (or run a one‑time offline calibration). Only the first 64 and last 256 tokens per request are kept in BF16 as sink and recent windows; the rest are stored as INT2 after OSCAR rotation and clipping. No client‑side changes are required.

Key takeaways: attention‑aware 2‑bit KV quantization, minimal accuracy loss, large memory and speed gains, compatible paged cache, ready‑to‑use rotations for popular models.

AI #Product #LLM #MachineLearning #Inference #Optimization

Itinai.com office ai background high tech quantum computing 0002ba7c e3d6 4fd7 abd6 cfe4e5f08aeb 0

Vladimir Dyachkov, Ph.D
Editor-in-Chief itinai.com

I believe that AI is only as powerful as the human insight guiding it.