NVIDIA AI Releases Star Elastic: One Checkpoint that Contains 30B, 23B, and 12B Reasoning Models with Zero-Shot Slicing
NVIDIA researchers have introduced Star Elastic, a post-training method that embeds multiple nested reasoning models—at 30B, 23B, and 12B parameter scales—inside a single checkpoint using a single training run. Applied to Nemotron Nano v3 (a hybrid Mamba–Transformer–MoE model with 30B total parameters and 3.6B active parameters), Star Elastic produces 23B (2.8B active) and 12B (2.0B active) nested variants trained with approximately 160B tokens. All three variants coexist in one checkpoint and can be extracted without any additional fine-tuning, eliminating the need to train or store separate model variants.
The method uses importance estimation to score model components (embedding channels, attention heads, Mamba SSM heads, MoE experts, and FFN channels) by their contribution to accuracy, then ranks and sorts them so smaller-budget submodels use the highest-ranked contiguous subset of components from the larger model—a property called nested weight-sharing. Star Elastic employs an end-to-end trainable router that takes a target budget as a one-hot input and outputs differentiable masks selecting active components, trained jointly with the model via Gumbel-Softmax to allow gradient flow through discrete architectural decisions. The loss combines knowledge distillation (with the non-elastified parent as teacher) and a router loss penalizing deviation from the target resource budget.
Star Elastic enables elastic budget control by using different nested submodels for different reasoning phases: the optimal configuration (ℳS → ℳL) uses a cheaper model for extended reasoning traces and reserves the full-capacity model for synthesizing the final answer. The 23B → 30B configuration advances the accuracy–latency Pareto frontier, achieving up to 16% higher accuracy and 1.9× lower latency compared to default Nemotron Nano v3 budget control. Quantization-Aware Distillation (QAD) applied directly to the elastic checkpoint preserves the nested mask hierarchy, allowing zero-shot slicing of quantized variants; for NVFP4, a short QAD phase brings 30B variant recovery to 97.79% of BF16 accuracy. Storage efficiency is significant: storing separate 12B, 23B, and 30B BF16 checkpoints requires 126.1 GB, while the single elastic checkpoint requires 58.9 GB, and the 30B NVFP4 elastic checkpoint fits in 18.7 GB, enabling the 12B NVFP4 variant to run on an RTX 5080 where every BF16 configuration runs out of memory.
Research paper: Star Elastic: One Checkpoint that Contains Multiple Reasoning Models (PDF)

























