April 25, 2026 AI News Digest: Breakthroughs in Long-Context Models and Resilient AI Training
DeepSeek AI Releases DeepSeek-V4: Compressed Sparse Attention and Heavily Compressed Attention Enable One-Million-Token Contexts
DeepSeek-AI has released preview versions of the DeepSeek-V4 series, consisting of two Mixture-of-Experts (MoE) language models designed to make one-million-token context windows practical and affordable. The DeepSeek-V4-Pro model features 1.6T total parameters with 49B activated per token, while DeepSeek-V4-Flash has 284B total parameters with 13B activated per token. Both models natively support context lengths of one million tokens.
The key innovation is a hybrid attention architecture combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA), which reduces KV cache requirements to just 10% of DeepSeek-V3.2 levels at 1M tokens. The model also introduces Manifold-Constrained Hyper-Connections (mHC) to replace standard residual connections for improved training stability, adopts the Muon optimizer for faster convergence, and uses On-Policy Distillation from multiple domain experts in post-training.
Technical Paper: DeepSeek-V4 (Hugging Face)
Google DeepMind Introduces Decoupled DiLoCo: An Asynchronous Training Architecture Achieving 88% Goodput Under High Hardware Failure Rates
Google DeepMind researchers have introduced Decoupled DiLoCo (Distributed Low-Communication), a distributed training architecture that addresses the fragility of conventional distributed training by decoupling compute into asynchronous, fault-isolated ‘islands’ called learner units. This approach allows large language model pre-training across geographically distant data centers without requiring tight synchronization that causes bottlenecks in standard methods.
The architecture reduces inter-datacenter bandwidth requirements from 198 Gbps to just 0.84 Gbps across eight data centers, making globally distributed training feasible over standard internet infrastructure. In simulations with 1.2 million chips under high failure rates, Decoupled DiLoCo maintained 88% goodput compared to 27% for standard Data-Parallel methods, demonstrating self-healing capabilities through chaos engineering. The approach was validated by training a 12B parameter model across four U.S. regions more than 20 times faster than conventional synchronization methods.



























