NVIDIA X‑Token Beats GOLD, Boosts Llama‑3.2‑1B by 3.8 Points

Knowledge distillation lets a small student learn from a large teacher’s full output distribution, but standard KD requires the student and teacher to share the same tokenizer. When a practitioner wants to distill from strong models like Qwen3‑4B or Phi‑4‑mini into a Llama‑3.2‑1B student, the token vocabularies do not align, making per‑position KL divergence impossible. Prior cross‑tokenizer methods such as GOLD try to work around this by splitting tokens into a “common” set trained with KL and an “uncommon” set trained with rank‑based matching. This creates two serious problems: first, critical tokens—like multi‑digit numbers that Llama treats as a single token but Qwen splits into digits—fall into the uncommon set, where they receive identity‑agnostic noise and suppressive gradients that drive their probabilities down regardless of the correct answer. Second, GOLD’s strict string‑equal matching discards useful alignments such as “Hundreds” ↔ “Hund”+“reds”, throwing away valuable signal.

X‑Token solves these opposite failures with three lightweight, drop‑in components. First, a dynamic‑programming span alignment groups teacher and student tokens into chunks that decode to the same text substring, handling tokenizer offsets and special‑token mismatches without per‑step cost. Second, a deterministic projection matrix W maps each student token to a weighted combination of teacher tokens. W is built in two passes: exact matches get weight 1, and unmatched tokens receive exponentially decayed weights for the teacher’s re‑tokenization (up to four sub‑tokens), then each row is normalized so Wᵀ preserves probabilities. Third, two complementary loss modes are chosen automatically by a coverage audit of critical token categories. P‑KL removes the partition entirely, projecting the student distribution into teacher space via W to eliminate suppressive gradients and rank noise—ideal when important tokens lie outside the common set (e.g., Qwen3‑4B). H‑KL retains the partition but relaxes matching to the top‑1 teacher token under W, preserving fine‑grained supervision when the partition is already sound (e.g., Phi‑4‑mini). Dynamic KD/CE scaling rescales the distillation loss each step to match cross‑entropy magnitude, removing the need for hand‑tuned weights. For multiple teachers, each gets its own Wₘ and loss mode, and the total loss is a weighted sum; static weighting works best and gains come from teacher complementarity, not just adding more models.

The result: on Llama‑3.2‑1B, X‑Token with Qwen3‑4B (P‑KL) lifts average accuracy from 35.03 (GOLD) to 38.85 and recovers GSM8k from 2.56 to 15.54, surpassing same‑tokenizer KD from a stronger Llama‑3.2‑3B teacher. With Phi‑4‑mini (H‑KL) the average reaches 39.18, and a Phi‑mini + Llama‑3B multi‑teacher setup hits 40.48.

#AI #Product #MachineLearning #KnowledgeDistillation #NLP #LLM