Understanding MLPerf Inference v5.1
MLPerf Inference v5.1 is a crucial benchmark for evaluating the performance of AI systems across various hardware configurations, including GPUs, CPUs, and specialized AI accelerators. This benchmark is particularly relevant for AI researchers, data scientists, IT decision-makers, and business leaders who are deeply involved in AI and machine learning implementations. The results help these professionals understand how different systems perform under specific workloads, making it easier to make informed decisions.
What MLPerf Inference Measures
MLPerf Inference quantifies the speed at which a complete system executes fixed, pre-trained models while adhering to strict latency and accuracy constraints. The results are categorized into two main suites: Datacenter and Edge. Each suite uses standardized request patterns generated by LoadGen, ensuring that results are comparable across different architectures. The Closed division allows for direct comparisons by fixing the model and preprocessing, while the Open division permits model changes that may not be directly comparable.
Key Changes in v5.1
The v5.1 update, released on September 9, 2025, introduces three new workloads and expands interactive serving capabilities. The new benchmarks include:
- DeepSeek-R1: A benchmark focused on reasoning tasks.
- Llama-3.1-8B: A summarization model replacing GPT-J.
- Whisper Large V3: An automatic speech recognition (ASR) model.
This round saw participation from 27 submitters, including new entries from AMD, Intel, and NVIDIA, reflecting the growing diversity in AI hardware.
Understanding the Scenarios
MLPerf defines four serving patterns that correspond to real-world workloads:
- Offline: Focuses on maximizing throughput without latency constraints.
- Server: Mimics chat or agent backends with specific latency bounds.
- Single-Stream: Emphasizes strict latency for individual streams.
- Multi-Stream: Stresses concurrency with fixed inter-arrival intervals.
Each scenario has defined metrics, such as maximum throughput for Server scenarios and overall throughput for Offline scenarios.
Latencies in Large Language Models (LLMs)
In v5.1, LLM tests report two critical latency metrics: TTFT (time-to-first-token) and TPOT (time-per-output-token). For instance, the Llama-2-70B model has specific latency targets that reflect user-perceived responsiveness. The new Llama-3.1-405B model has higher latency limits due to its size and context length, illustrating the trade-offs involved in model complexity.
Power Efficiency and Energy Claims
MLPerf also reports system wall-plug energy for the same runs, allowing for comparisons of energy efficiency. It’s important to note that only measured runs are valid for these comparisons. The v5.1 results include both datacenter and edge power submissions, encouraging broader participation in energy efficiency reporting.
Interpreting the Results
When analyzing the results, it’s crucial to compare Closed division entries against each other, as Open runs may utilize different models. Additionally, accuracy targets can significantly affect throughput, so it’s important to normalize cautiously. Filtering by availability and including power columns can provide a clearer picture of efficiency.
Practical Selection Playbook
To effectively choose hardware based on MLPerf results, consider the following:
- For interactive chat or agents, focus on Server-Interactive benchmarks with Llama-2-70B or Llama-3.1-8B.
- For batch summarization, look at Offline benchmarks with Llama-3.1-8B.
- For ASR applications, use Whisper V3 Server with strict latency bounds.
- For long-context analytics, evaluate the Llama-3.1-405B model, keeping in mind its latency limits.
Conclusion
MLPerf Inference v5.1 offers actionable insights for comparing AI system performance. By aligning with the benchmark’s rules and focusing on the Closed division, users can make informed decisions based on scenario-specific metrics and energy efficiency. The introduction of new workloads and broader hardware participation signals a significant step forward in understanding AI performance across various applications.
FAQ
- What is MLPerf Inference? MLPerf Inference is a benchmark that measures the performance of AI systems executing pre-trained models under specific latency and accuracy constraints.
- Who benefits from MLPerf Inference results? AI researchers, data scientists, IT decision-makers, and business leaders can all benefit from understanding how different hardware configurations perform.
- What are the key changes in v5.1? The v5.1 update introduces new workloads, including DeepSeek-R1, Llama-3.1-8B, and Whisper Large V3, expanding the scope of benchmarking.
- How should I interpret the results? Focus on Closed division comparisons, match accuracy targets, and consider power efficiency when evaluating performance.
- What are the main latency metrics reported for LLMs? The main latency metrics are TTFT (time-to-first-token) and TPOT (time-per-output-token), which reflect user-perceived responsiveness.



























