Introducing mmBERT: The Next-Gen Multilingual Encoder Model for NLP Enthusiasts

Why was a new multilingual encoder needed?

The field of multilingual natural language processing (NLP) has seen significant advancements over the past five years, with models like XLM-RoBERTa (XLM-R) leading the charge. However, as research has shifted towards decoder-based generative models, the development of efficient multilingual encoders stagnated. Despite their efficiency in tasks like embedding, retrieval, and classification, encoders were left behind. To fill this gap, researchers from Johns Hopkins University introduced mmBERT as a modern solution that outshines XLM-R and competes with large models such as OpenAI’s o3 and Google’s Gemini 2.5 Pro.

Understanding the architecture of mmBERT

mmBERT is offered in two configurations:

Base model: 22 transformer layers with 1152 hidden dimensions, containing approximately 307 million parameters.
Small model: Around 140 million parameters.

This model uses the Gemma 2 tokenizer, which boasts a vocabulary of 256,000 words, and features advanced techniques like rotary position embeddings (RoPE) and FlashAttention2 to enhance efficiency. One of its significant improvements is the extended sequence length—rising from 1024 to a remarkable 8192 tokens. This allows mmBERT to handle extensive contexts much better than XLM-R, all while achieving faster inference speeds.

What training data and phases were used?

The training of mmBERT involved an extensive dataset comprising 3 trillion tokens sourced from 1,833 languages. Notably, data sources included FineWeb2, Dolma, MegaWika v2, and others, with English representing only about 10% to 34% of the entire corpus depending on the training phase. The training process was divided into three major stages:

Pre-training: Utilizing 2.3 trillion tokens across 60 languages and code.
Mid-training: Consisting of 600 billion tokens across 110 languages, focusing on higher-quality data.
Decay phase: Covering 100 billion tokens across all 1,833 languages, emphasizing the adaptation of low-resource languages.

What new training strategies were introduced?

mmBERT employs three innovative training strategies that significantly boost its performance:

Annealed Language Learning (ALL): This approach gradually introduces languages, starting from 60 and increasing to 1,833, allowing low-resource languages to gain influence without overfitting.
Inverse Masking Schedule: The initial masking ratio of 30% decreases to 5%, fostering coarse-grained learning at the start and shifting to fine-grained refinements as training progresses.
Model Merging Across Decay Variants: Utilizing TIES merging, multiple models from the decay phase are combined to leverage strengths without the need for retraining from scratch.

How does mmBERT perform on benchmarks?

When tested against various benchmarks, mmBERT has delivered impressive results:

In the English NLU (GLUE) benchmark, mmBERT base achieved a score of 86.3, outperforming XLM-R’s score of 83.3 and nearly matching ModernBERT’s 87.4.
For multilingual NLU (XTREME), mmBERT base received a score of 72.8, surpassing XLM-R’s 70.4.
In embedding tasks (MTEB v2), mmBERT base tied ModernBERT in English and outperformed XLM-R in multilingual tasks.
In code retrieval measures (CoIR), mmBERT exceeded XLM-R by approximately 9 points, though it still fell short against EuroBERT on proprietary data.

How does mmBERT handle low-resource languages?

Thanks to its annealed training schedule, mmBERT provides substantial support for low-resource languages. On specific benchmarks like Faroese FoQA and Tigrinya TiQuAD, mmBERT outperformed both o3 and Gemini 2.5 Pro. These outcomes illustrate that, with careful training, encoder models can effectively generalize even in scenarios with limited resources.

What efficiency gains does mmBERT achieve?

Among the notable improvements, mmBERT operates 2 to 4 times faster than XLM-R and MiniLM, while still accommodating inputs of up to 8192 tokens. Remarkably, it maintains this speed even with longer sequences compared to older encoders that only managed shorter inputs. This efficiency stems from the ModernBERT training strategy and the optimization of attention mechanisms and embeddings.

Summary

In conclusion, mmBERT represents a significant advancement in multilingual encoders, ideally suited to meet the needs of modern NLP applications. With its capabilities running 2-4 times faster than previous models and the ability to process longer sequences, mmBERT not only surpasses its predecessors but also provides a strong foundation for upcoming multilingual NLP systems. The innovative training methods employed demonstrate how strategic design can lead to broad generalization and improved performance without unnecessary redundancy.

Frequently Asked Questions

What makes mmBERT different from other multilingual models? mmBERT utilizes a unique training strategy that emphasizes low-resource languages and efficient processing of long sequences.
Can mmBERT handle rare languages effectively? Yes, it has been specifically designed to support low-resource languages using its annealed learning approach.
How does mmBERT compare to XLM-R? mmBERT outperforms XLM-R on multiple benchmarks, achieving higher scores in both English and multilingual tasks.
What types of tasks is mmBERT best suited for? It excels in embedding, retrieval, and classification tasks, making it versatile for various applications in NLP.
Where can I access mmBERT for my projects? You can find mmBERT on platforms like Hugging Face and GitHub, where tutorials and technical details are also available.

Unleash Your Creative Potential with AI Agents

Competitors are already using AI Agents

Business Problems We Solve

Automation of internal processes.
Optimizing AI costs without huge budgets.
Training staff, developing custom courses for business needs
Integrating AI into client work, automating first lines of contact

Large and Medium Businesses

Startups

Offline Business

Get a plan to reduce routine and improve metrics

100% of clients report increased productivity and reduced operati

AI Agents

Localization Project Manager – Coordinating translation workflows, answering vendor or process-related questions.

Job Title: Localization Project Manager Overview The Localization Project Manager plays a vital role in coordinating translation workflows while addressing vendor and process-related queries. This position is crucial for ensuring that translation projects are executed efficiently…
AI Agents

Environmental Health & Safety Officer – Answering compliance-related questions, retrieving safety protocols or audit histories.

Professional Summary The AI-driven Environmental Health & Safety Officer is a reliable and effective digital team member that performs repetitive and time-consuming tasks with remarkable speed, accuracy, and stability. By automating these tasks, it frees up…
AI Agents

Legal Contract Reviewer – Auto-flagging clause inconsistencies or retrieving precedent cases for review.

Job Title: Legal Contract Reviewer – Auto-flagging Clause Inconsistencies or Retrieving Precedent Cases for Review The AI functions as a reliable and effective digital team member that excels in performing repetitive and time-consuming tasks. With remarkable…
AI Agents

Customer Retention Analyst – Creating customer summaries, identifying churn risk patterns, and suggesting retention steps.

Customer Retention Analyst Professional Summary A highly analytical and detail-oriented Customer Retention Analyst with a proven track record in creating comprehensive customer summaries, identifying churn risk patterns, and suggesting effective retention strategies. Adept at leveraging data-driven…

Itinai.com httpss.mj.runmrqch2uvtvo russian handsome charisma 9fdbb2d5 a55b 425d 8f3b 76d26f86710f 2

AI Business Accelerator

Start Your AI Business in Just a Week with itinai.com

You’re a great fit if you:

Have an audience (even 500+ followers in Instagram, email, etc.)
Have an idea, service, or product you want to scale
Can invest 2–3 hours a day
You’re motivated to earn with AI but don’t want to handle technical setup

AI news and solutions

Revolutionizing AI Chat: How FUSECHAT Merges Multiple Language Models into a Superior, Memory-Efficient LLM

The emergence of Large Language Models (LLMs) like GPT and LLaMA has prompted a growing need for proprietary LLMs, but their resource-intensive development remains a challenge. FUSECHAT, a novel chat-based LLM integration approach, leverages knowledge fusion…

AI Tech News
Elon Musk Says “No One Will Have to Work” Due to AI

During an “in conversation” event at the Business Connect Summit, UK Prime Minister Rishi Sunak and Tesla CEO Elon Musk discussed the future of artificial intelligence (AI) and its impact on society. Musk stated that AI…

AI Tech News
How AI Bots Can Change Competitive Advantage Across Different Businesses

Artificial intelligence (AI) bots, also known as chatbots or virtual assistants, are becoming increasingly popular in the business world. They offer a number of benefits, such as improved customer service, increased efficiency, and reduced costs. But…

AI Document Assistant
The State of Sustainability in Agile – Reflections on SoSA 2023

The SoSA 2023 conference brought together the Agile community to address sustainability in social, environmental, and economic areas, setting a direction for global responsibility. This update was originally published on Agile Alliance. (51 words)

Scrum Agile News
Microsoft Released VoiceRAG: An Advanced Voice Interface Using GPT-4 and Azure AI Search for Real-Time Conversational Applications

Practical Solutions and Value of VoiceRAG by Microsoft Architecture and Key Features VoiceRAG combines voice input and output with data retrieval using Azure OpenAI GPT-4o-realtime-preview model. Function calling and real-time middle-tier architecture enhance dynamic interaction and…

AI Tech News
Meet Decisional AI: An AI Agent for Financial Analysts

Meet Decisional AI: An AI Agent for Financial Analysts Decisional is an AI financial analyst tool designed to simplify the work of financial analysts by reading and understanding data from various sources. It eliminates data silos…

AI Tech News
Enhancing Language Models with Retrieval-Augmented Generation: A Comprehensive Guide

** Retrieval Augmented Generation (RAG) in AI ** ** Practical Solutions and Value: ** Retrieval Augmented Generation (RAG) enhances Large Language Models (LLMs) by referencing external knowledge sources, improving accuracy and relevance of AI-generated text. By…

AI Tech News
Top 6 Inference Runtimes for LLM Serving in 2025: A Comprehensive Comparison for AI Professionals

Understanding Inference Runtimes for LLM Serving Large language models (LLMs) are becoming essential in various applications, but their efficiency in serving tokens under real traffic conditions is critical. This article explores the top inference runtimes for…

AI Tech News
The Real Deal on Language Model Optimizers: Performance and Practicality

Optimizing Large-Scale Language Models Challenges and Solutions Training large-scale language models faces challenges due to increasing computational costs and energy consumption. Optimizing training efficiency is crucial for advancing AI research. Efficient optimization methods enhance performance and…

AI Tech News
Researchers from Stanford University and FAIR Meta Unveil CHOIS: A Groundbreaking AI Method for Synthesizing Realistic 3D Human-Object Interactions Guided by Language

Researchers from Stanford University and FAIR Meta have introduced CHOIS, a system for generating synchronized 3D human-object interactions based on language descriptions and sparse object waypoints. Leveraging large-scale motion capture datasets, CHOIS advances human motion modeling…

AI Tech News
This AI Paper from Intel Presents a SYCL Implementation of Fully Fused Multi-Layer Perceptrons (MLPs) on Intel Data Center GPU Max

AI Tech News
This AI Paper from UC Berkeley Introduces a Data-Efficient Approach to Long Chain-of-Thought Reasoning for Large Language Models

Understanding Large Language Models (LLMs) Large Language Models (LLMs) analyze vast amounts of data to produce clear and logical responses. They use a method called Chain-of-Thought (CoT) reasoning to break down complex problems into manageable steps,…

AI Tech News
Create an AI Agent with Google ADK: A Step-by-Step Guide

Creating an AI Agent with Google ADK: A Practical Guide Creating an AI Agent with Google ADK: A Practical Guide The Agent Development Kit (ADK) is a powerful, open-source Python framework designed for developers to create,…

AI News
Creating your own code writing agent. How to get results fast and avoid the most common pitfalls

AI Tech News
DeepMind Research Introduces The FACTS Grounding Leaderboard: Benchmarking LLMs’ Ability to Ground Responses to Long-Form Input

Understanding the FACTS Grounding Leaderboard Large language models (LLMs) have transformed how we process language, enabling tasks from automated writing to complex decision-making. However, ensuring these models provide accurate information is a major challenge. Sometimes, LLMs…

AI Tech News
New approach could make large language models 300x faster

ETH Zurich researchers developed an approach using Fast Feedforward Networks (FFF) to increase the speed of Large Language Models (LLM). By engaging only a small fraction of neurons for individual inferences, their UltraFastBERT model could potentially…

AI Tech News
Microsoft Researchers Release AIOpsLab: An Open-Source Comprehensive AI Framework for AIOps Agents

Understanding the Challenges of Cloud Computing The growing complexity of cloud computing presents both opportunities and challenges for businesses. Companies rely on complex cloud systems to keep their operations running smoothly. Site Reliability Engineers (SREs) and…

AI Tech News
Alibaba AI Researchers Released a New gte-Qwen2-7B-Instruct Embedding Model Based on the Qwen2-7B Model with Better Performance

Introducing gte-Qwen2-7B-Instruct: A New AI Embedding Model from Alibaba Research Alibaba’s latest gte-Qwen2-7B-instruct model offers high-performance text embeddings for natural language processing tasks. It presents a significant leap forward in text representation, enhancing contextual understanding, efficiency,…

AI Tech News
Advancing Medical AI: Evaluating OpenAI’s o1-Preview Model and Optimizing Inference Strategies

Medprompt: Enhancing AI for Medical Applications What is Medprompt? Medprompt is a strategy that improves general AI models, like GPT-4, for specialized fields such as medicine. It uses structured techniques to guide the AI in making…

AI Tech News
Revolutionizing Long-Term Multivariate Time-Series Forecasting: Introducing PDETime, a Novel Machine Learning Approach Leveraging Neural PDE Solvers for Unparalleled Accuracy

PDETime, a new approach to long-term multivariate time series forecasting, reimagines the problem by treating the data as spatiotemporal phenomena sampled from continuous dynamical systems. It outperforms traditional models, incorporating spatial and temporal information through a…

AI Tech News