Solve LLM Long Context Memory Overload with OSCAR 2‑Bit KV Cache

Long-context LLM serving is limited by GPU memory taken up by the KV cache. During autoregressive decoding the cache grows with context length, batch size and model depth, and at long contexts and large batches it consumes a large fraction of memory, forcing users to lower batch size or accept high latency. Quantizing the KV cache to low precision seems the natural fix, but 2‑bit quantization fails: outlier channels dominate the scale, most values collapse to one or two levels and attention quality collapses. Simple rotations like Hadamard help at 4‑bit but not at 2‑bit because they are data‑oblivious and spread error uniformly instead of pushing it into low‑importance directions.

Together AI’s OSCAR solves this by deriving rotation matrices from attention statistics rather than from raw KV distribution. For keys it uses the query covariance; for values it uses the score‑weighted value covariance. The rotation is composed of three factors: an attention‑aware eigenbasis that diagonalizes the error‑weighting matrix, a Walsh‑Hadamard transform that equalizes channel importance, and a permuted‑bit‑reversal that distributes importance evenly across quantization groups. The result is an INT2 KV cache that retains near‑BF16 accuracy while cutting memory by roughly 8× and boosting decode speed up to 3× at 100 K tokens. Job‑level throughput can reach 7‑8× at large batch sizes.

OSCAR integrates into SGLang as an INT2 KV‑cache mode that works with paged attention and prefix caching. Users launch the server with --kv-cache-dtype int2 and point to pre‑computed rotation files (or run a one‑time offline calibration). Only the first 64 and last 256 tokens per request are kept in BF16 as sink and recent windows; the rest are stored as INT2 after OSCAR rotation and clipping. No client‑side changes are required.

Key takeaways: attention‑aware 2‑bit KV quantization, minimal accuracy loss, large memory and speed gains, compatible paged cache, ready‑to‑use rotations for popular models.

AI #Product #LLM #MachineLearning #Inference #Optimization