SWE-Perf: The First Benchmark for Optimizing Code Performance in Real-World Repositories

As artificial intelligence continues to evolve, particularly in the realm of software engineering, the need for effective performance optimization is becoming increasingly critical. Researchers from TikTok and their collaborators have taken a significant step forward by introducing SWE-Perf, the first benchmark specifically designed to assess the performance optimization capabilities of large language models (LLMs) at the repository level. This innovation is essential for understanding how LLMs can enhance code performance in real-world applications.

Why SWE-Perf Matters

Traditional benchmarks have primarily focused on correctness or function-level efficiency, which often overlooks the complexities involved in optimizing large, modular codebases. Real-world software projects consist of interdependent components, where performance tuning requires a deep understanding of cross-file interactions and execution paths. SWE-Perf addresses this gap by providing a comprehensive framework to evaluate LLMs in a more realistic context.

Building the SWE-Perf Dataset

The SWE-Perf dataset is constructed from over 100,000 pull requests across notable GitHub repositories. This extensive dataset includes:

140 curated instances demonstrating measurable and stable performance improvements.
Complete codebases before and after optimization.
Target functions categorized as oracle (file-level) or realistic (repo-level).
Unit tests and Docker environments to ensure reproducibility.
Expert-authored patches serving as gold standards.

To validate the effectiveness of the patches, each unit test must pass both before and after the optimization, showing statistically significant runtime gains. This rigorous approach ensures that the performance improvements are genuine and not just statistical noise.

Benchmark Settings: Oracle vs. Realistic

SWE-Perf operates under two distinct settings:

Oracle Setting: The model is provided with only the target functions and corresponding files, focusing on localized optimization skills.
Realistic Setting: The model receives the entire repository, requiring it to autonomously identify and optimize performance-critical paths, mirroring the work of human engineers.

Evaluation Metrics

The evaluation framework of SWE-Perf is three-tiered, assessing:

Apply: Can the model-generated patch be applied cleanly?
Correctness: Does the patch maintain functional integrity?
Performance: Does the patch lead to measurable runtime improvements?

This independent reporting of metrics allows for a nuanced understanding of the trade-offs between syntactic correctness and performance gains.

Experimental Results

The benchmark has been tested on several leading LLMs, yielding the following performance results:

Claude-4-opus (Oracle): 1.28%
GPT-4o (Oracle): 0.60%
Gemini-2.5-Pro (Oracle): 1.48%
Claude-3.7 (Agentless, Realistic): 0.41%
Claude-3.7 (OpenHands, Realistic): 2.26%
Expert (Human Patch): 10.85%

These results highlight a significant gap between LLM performance and human expertise, with even the best LLM configurations falling short of expert-level optimization.

Key Observations

Several important insights emerged from the SWE-Perf evaluations:

Agent-based frameworks, such as OpenHands, are more effective for complex, multi-step optimizations compared to direct model prompts.
LLMs struggle with broader optimization scopes, especially as the number of target functions increases.
Expert systems continue to outperform LLMs in long-runtime scenarios, indicating a limitation in LLM scalability.
LLMs tend to focus on low-level code structures, while human experts prioritize high-level semantic abstractions for performance tuning.

Conclusion

SWE-Perf marks a significant advancement in the evaluation of LLMs for performance optimization in software engineering. By highlighting the existing capability gap between AI models and human experts, it sets a foundation for future research aimed at enhancing repository-scale performance tuning. As LLMs continue to develop, benchmarks like SWE-Perf will be crucial in guiding their evolution toward practical, production-ready software enhancements.

FAQ

What is SWE-Perf? SWE-Perf is the first benchmark designed to evaluate the performance optimization capabilities of large language models at the repository level.
Why is repository-level optimization important? Repository-level optimization considers the complexities of real-world codebases, which are often large and interdependent, requiring a broader understanding than isolated function-level optimizations.
How was the SWE-Perf dataset created? The dataset was constructed from over 100,000 pull requests across high-profile GitHub repositories, including curated instances of performance improvements and expert-authored patches.
What are the evaluation metrics used in SWE-Perf? The evaluation metrics include the ability to apply patches, correctness of the patches, and measurable performance improvements.
What did the experimental results reveal? The results showed that even the best-performing LLMs significantly lag behind human experts in performance optimization capabilities.

Unleash Your Creative Potential with AI Agents

Competitors are already using AI Agents

Business Problems We Solve

Automation of internal processes.
Optimizing AI costs without huge budgets.
Training staff, developing custom courses for business needs
Integrating AI into client work, automating first lines of contact

Large and Medium Businesses

Startups

Offline Business

Get a plan to reduce routine and improve metrics

100% of clients report increased productivity and reduced operati

AI Agents

Localization Project Manager – Coordinating translation workflows, answering vendor or process-related questions.

Job Title: Localization Project Manager Overview The Localization Project Manager plays a vital role in coordinating translation workflows while addressing vendor and process-related queries. This position is crucial for ensuring that translation projects are executed efficiently…
AI Agents

Environmental Health & Safety Officer – Answering compliance-related questions, retrieving safety protocols or audit histories.

Professional Summary The AI-driven Environmental Health & Safety Officer is a reliable and effective digital team member that performs repetitive and time-consuming tasks with remarkable speed, accuracy, and stability. By automating these tasks, it frees up…
AI Agents

Legal Contract Reviewer – Auto-flagging clause inconsistencies or retrieving precedent cases for review.

Job Title: Legal Contract Reviewer – Auto-flagging Clause Inconsistencies or Retrieving Precedent Cases for Review The AI functions as a reliable and effective digital team member that excels in performing repetitive and time-consuming tasks. With remarkable…
AI Agents

Customer Retention Analyst – Creating customer summaries, identifying churn risk patterns, and suggesting retention steps.

Customer Retention Analyst Professional Summary A highly analytical and detail-oriented Customer Retention Analyst with a proven track record in creating comprehensive customer summaries, identifying churn risk patterns, and suggesting effective retention strategies. Adept at leveraging data-driven…

Itinai.com httpss.mj.runmrqch2uvtvo russian handsome charisma 9fdbb2d5 a55b 425d 8f3b 76d26f86710f 2

AI Business Accelerator

Start Your AI Business in Just a Week with itinai.com

You’re a great fit if you:

Have an audience (even 500+ followers in Instagram, email, etc.)
Have an idea, service, or product you want to scale
Can invest 2–3 hours a day
You’re motivated to earn with AI but don’t want to handle technical setup

AI news and solutions

New AI Video App by Pika Labs Makes a Big Splash, Boosts Chinese Company’s Stock

Pika Labs, an AI video generator startup, has caused a stir with its product, Pika 1.0, leading to a stock increase for Sunyard Technology, a firm with familial ties to co-founder Demi Guo. The startup raised…

AI Tech News
Improving RLHF (Reinforcement Learning from Human Feedback) with Critique-Generated Reward Models

Practical Solutions for Improving RLHF with Critique-Generated Reward Models Overview Language models in reinforcement learning from human feedback (RLHF) face challenges in accurately capturing human preferences. Traditional reward models struggle to reason explicitly about response quality,…

AI Tech News
Are LLMs Ready for Real-World Path Planning? A Critical Evaluation

Understanding Large Language Models (LLMs) in Vehicle Navigation Large Language Models (LLMs) are sophisticated AI systems designed to understand and generate human-like language by learning from vast amounts of data. As these models become more common…

AI Tech News
Enhancing Diffusion Models: The Role of Sparsity and Regularization in Efficient Generative AI

Understanding Diffusion Models in Generative AI Diffusion models are essential in generative AI, excelling in creating images, videos, and translating text to images. They work through two processes: 1. Forward Process: This process adds noise to…

AI Tech News
EuroCropsML: An Analysis-Ready Remote Sensing Machine Learning Dataset for Time Series Crop Type Classification of Agricultural Parcels in Europe

Value of EUROCROPSML Dataset for Agriculture and Remote Sensing Practical Solutions for Agriculture and Remote Sensing Remote sensing using satellite and aerial sensors aids in environmental monitoring, agricultural management, and natural resource conservation. The EUROCROPSML dataset…

AI Tech News
Exploring Data Mapping as a Search Problem

Data Mapping as a Search Problem Data mapping is a critical process in data management, enabling the integration and transformation of data from various sources into a unified format. This approach provides a novel and effective…

AI Tech News
NuminaMath 7B TIR Released: Transforming Mathematical Problem-Solving with Advanced Tool-Integrated Reasoning and Python REPL for Competition-Level Accuracy

NuminaMath 7B TIR: Advanced Mathematical Problem-Solving Practical Solutions and Value Numina has released NuminaMath 7B TIR, an advanced language model designed for solving mathematical problems. With 6.91 billion parameters, it efficiently handles complex mathematical queries through…

AI Tech News
Causal Diagram: Confronting the Achilles’ Heel in Observational Data

“The Book of Why” Chapters 3&4 are part of the Read with Me series and can be found on Towards Data Science.

AI Tech News
Intel AI Research Releases FastDraft: A Cost-Effective Method for Pre-Training and Aligning Draft Models with Any LLM for Speculative Decoding

Transforming Natural Language Processing with AI Solutions Transformer architectures have transformed Natural Language Processing (NLP), making it easier for machines to understand and generate human language. Large Language Models (LLMs) built on these architectures excel in…

AI Tech News
ChatGPT for E-commerce: Crafting Product Descriptions that Rank and Convert

Innovate Your E-commerce with AI Enhancing Product Descriptions with ChatGPT In the world of e-commerce, product descriptions play a crucial role in driving sales and attracting potential buyers. With the increasing reliance on online shopping, it’s…

AI Tech News
Policy Learning with Large World Models: Advancing Multi-Task Reinforcement Learning Efficiency and Performance

Advancing Multi-Task Reinforcement Learning Efficiency and Performance Practical Solutions and Value Model-Based Reinforcement Learning (MBRL) Innovation – Policy Learning with Large World Models (PWM) offers scalable solutions for multitasking in robotics. – Pretrains world models on…

AI Tech News
DIstributed PAth COmposition (DiPaCo): A Modular Architecture and Training Approach for Machine Learning ML Models

AI Tech News
How I Won Singapore’s GPT-4 Prompt Engineering Competition

The text discusses the strategies and takeaways from a learning experience, with further details available on the Towards Data Science platform.

AI Tech News
MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Models (MLLMs)

Practical Solutions and Value of MaVEn Framework for MLLMs Challenges Addressed The existing Multimodal Large Language Models (MLLMs) face limitations in handling tasks involving multiple images, such as Knowledge-Based Visual Question Answering, Visual Relation Inference, and…

AI Tech News
Chatbots Caught in the (Legal) Crossfire

The article discusses the challenges of implementing chatbots within the European regulatory framework, covering aspects such as bot selection, finetuning, disclaimers, outputs, and prioritizing quality over speed. It highlights considerations such as data protection, legal obligations,…

AI Tech News
Adept AI Open-Sources Fuyu-8B: A Multimodal Architecture for Artificial Intelligence Agents

Adept AI has launched Fuyu-8B, an innovative solution that simplifies the comprehension of multimodal images for digital agents. Unlike other models, Fuyu-8B uses a basic decoder-only transformer which eliminates the need for a specialized image encoder.…

AI Tech News
OpenAI Researchers Introduce MLE-bench: A New Benchmark for Measuring How Well AI Agents Perform at Machine Learning Engineering

Introduction to MLE-bench Machine Learning (ML) models can perform various coding tasks, but there is a need to better evaluate their capabilities in ML engineering. Current benchmarks often focus on basic coding skills, neglecting complex tasks…

AI Tech News
How to Style Plots with Matplotlib

This article discusses various methods to style plots using Matplotlib. It covers topics such as changing runtime configuration parameters, creating and using style files, applying style sheets, and limiting styling to code blocks. These techniques allow…

AI Tech News
Quickly Evaluate your RAG Without Manually Labeling Test Data

Automate RAG evaluation without manual intervention. Understand RAG importance and its impact on production. Learn to generate a synthetic test set and compute RAG metrics using Ragas package. Navigate through the implementation details in the accompanying…

AI Tech News
The Inflation of AI: Is More Always Better?

Hypothesis-driven development can mitigate the drawbacks of the rapid emergence of new ML models, as new models are being developed hourly.

AI Tech News