Why was a new multilingual encoder needed?
The field of multilingual natural language processing (NLP) has seen significant advancements over the past five years, with models like XLM-RoBERTa (XLM-R) leading the charge. However, as research has shifted towards decoder-based generative models, the development of efficient multilingual encoders stagnated. Despite their efficiency in tasks like embedding, retrieval, and classification, encoders were left behind. To fill this gap, researchers from Johns Hopkins University introduced mmBERT as a modern solution that outshines XLM-R and competes with large models such as OpenAI’s o3 and Google’s Gemini 2.5 Pro.
Understanding the architecture of mmBERT
mmBERT is offered in two configurations:
- Base model: 22 transformer layers with 1152 hidden dimensions, containing approximately 307 million parameters.
- Small model: Around 140 million parameters.
This model uses the Gemma 2 tokenizer, which boasts a vocabulary of 256,000 words, and features advanced techniques like rotary position embeddings (RoPE) and FlashAttention2 to enhance efficiency. One of its significant improvements is the extended sequence length—rising from 1024 to a remarkable 8192 tokens. This allows mmBERT to handle extensive contexts much better than XLM-R, all while achieving faster inference speeds.
What training data and phases were used?
The training of mmBERT involved an extensive dataset comprising 3 trillion tokens sourced from 1,833 languages. Notably, data sources included FineWeb2, Dolma, MegaWika v2, and others, with English representing only about 10% to 34% of the entire corpus depending on the training phase. The training process was divided into three major stages:
- Pre-training: Utilizing 2.3 trillion tokens across 60 languages and code.
- Mid-training: Consisting of 600 billion tokens across 110 languages, focusing on higher-quality data.
- Decay phase: Covering 100 billion tokens across all 1,833 languages, emphasizing the adaptation of low-resource languages.
What new training strategies were introduced?
mmBERT employs three innovative training strategies that significantly boost its performance:
- Annealed Language Learning (ALL): This approach gradually introduces languages, starting from 60 and increasing to 1,833, allowing low-resource languages to gain influence without overfitting.
- Inverse Masking Schedule: The initial masking ratio of 30% decreases to 5%, fostering coarse-grained learning at the start and shifting to fine-grained refinements as training progresses.
- Model Merging Across Decay Variants: Utilizing TIES merging, multiple models from the decay phase are combined to leverage strengths without the need for retraining from scratch.
How does mmBERT perform on benchmarks?
When tested against various benchmarks, mmBERT has delivered impressive results:
- In the English NLU (GLUE) benchmark, mmBERT base achieved a score of 86.3, outperforming XLM-R’s score of 83.3 and nearly matching ModernBERT’s 87.4.
- For multilingual NLU (XTREME), mmBERT base received a score of 72.8, surpassing XLM-R’s 70.4.
- In embedding tasks (MTEB v2), mmBERT base tied ModernBERT in English and outperformed XLM-R in multilingual tasks.
- In code retrieval measures (CoIR), mmBERT exceeded XLM-R by approximately 9 points, though it still fell short against EuroBERT on proprietary data.
How does mmBERT handle low-resource languages?
Thanks to its annealed training schedule, mmBERT provides substantial support for low-resource languages. On specific benchmarks like Faroese FoQA and Tigrinya TiQuAD, mmBERT outperformed both o3 and Gemini 2.5 Pro. These outcomes illustrate that, with careful training, encoder models can effectively generalize even in scenarios with limited resources.
What efficiency gains does mmBERT achieve?
Among the notable improvements, mmBERT operates 2 to 4 times faster than XLM-R and MiniLM, while still accommodating inputs of up to 8192 tokens. Remarkably, it maintains this speed even with longer sequences compared to older encoders that only managed shorter inputs. This efficiency stems from the ModernBERT training strategy and the optimization of attention mechanisms and embeddings.
Summary
In conclusion, mmBERT represents a significant advancement in multilingual encoders, ideally suited to meet the needs of modern NLP applications. With its capabilities running 2-4 times faster than previous models and the ability to process longer sequences, mmBERT not only surpasses its predecessors but also provides a strong foundation for upcoming multilingual NLP systems. The innovative training methods employed demonstrate how strategic design can lead to broad generalization and improved performance without unnecessary redundancy.
Frequently Asked Questions
- What makes mmBERT different from other multilingual models? mmBERT utilizes a unique training strategy that emphasizes low-resource languages and efficient processing of long sequences.
- Can mmBERT handle rare languages effectively? Yes, it has been specifically designed to support low-resource languages using its annealed learning approach.
- How does mmBERT compare to XLM-R? mmBERT outperforms XLM-R on multiple benchmarks, achieving higher scores in both English and multilingual tasks.
- What types of tasks is mmBERT best suited for? It excels in embedding, retrieval, and classification tasks, making it versatile for various applications in NLP.
- Where can I access mmBERT for my projects? You can find mmBERT on platforms like Hugging Face and GitHub, where tutorials and technical details are also available.


























