Run Mixtral-8x7B on Consumer Hardware with Expert Offloading

Mixtral-8x7B, a large language model, faces challenges due to its large size. The model’s mixture of experts doesn’t efficiently use GPU memory, hindering inference speed. Mixtral-offloading proposes an efficient solution, combining expert-aware quantization and expert offloading. These methods significantly reduce VRAM consumption while maintaining efficient inference on consumer hardware.

“`html

Finding the right trade-off between memory usage and inference speed

Activation pattern of Mixtral-8x7B’s expert sub-networks — source (CC-BY)

While Mixtral-8x7B is one of the best open large language models (LLM), it is also a huge model with 46.7B parameters. Even when quantized to 4-bit, the model can’t be fully loaded on a consumer GPU (e.g., an RTX 3090 with 24 GB of VRAM is not enough).

Mixtral-8x7B is a mixture of experts (MoE). It is made of 8 expert sub-networks of 6 billion parameters each.

Since only 2 experts among 8 are effective during decoding, the 6 remaining experts can be moved, or offloaded, to another device, e.g., the CPU RAM, to free up some of the GPU VRAM. In practice, this offloading is complicated.

Choosing which one of the experts to activate is a decision taken at inference time for each input token and each layer of the model. Naively moving some parts of the model to the CPU RAM, as with Accelerate’s device_map, would create a communication bottleneck between the CPU and the GPU.

Mixtral-offloading (MIT license) is a project that proposes a much more efficient solution to reduce VRAM consumption while preserving a reasonable inference speed.

In this article, I explain how mixtral-offloading implements expert-aware quantization and expert offloading to save memory and maintain a good inference speed. Using this framework, we will see how to run Mixtral-8x7B on consumer hardware and benchmark its inference speed.

Caching & Speculative Offloading

MoE language models often allocate distinct experts to sub-tasks, but not consistently across long token sequences. Some experts are active in short 2–4 token sequences, while others have intermittent “gaps” in their activation.

To capitalize on this pattern, the authors of mixtral-offloading suggest keeping active experts in GPU memory as a “cache” for future tokens. This ensures quick availability if the same experts are needed again. GPU memory limits the number of stored experts, and a simple Least Recently Used (LRU) cache is employed, maintaining the k least recently used experts uniformly across all layers.

Despite its simplicity, the LRU cache strategy significantly speeds up inference for MoE models like Mixtral-8x7B.

However, while LRU caching improves the average expert loading time, a significant portion of inference time still involves waiting for the next expert to load. MoE offloading lacks effective overlap between expert loading and computation.

In standard (non-MoE) models, efficient offloading schedules involve pre-loading the next layer while the previous one runs. However, this advantage isn’t feasible for MoE models, as experts are selected just in time for computation. The system can’t pre-fetch the next layer until it determines which experts to load. Despite the inability to reliably pre-fetch, the authors found that speculative loading can be used to guess the next experts while processing the previous layer, accelerating the next layer’s inference if the guess is correct.

To sum up, an LRU cache and speculative offloading can save VRAM while keeping inference efficient by offloading the experts that are the less likely to be used.

Expert-Aware Aggressive Quantization

In addition to expert offloading, we need to quantize the model to make it run on consumer hardware. Naive 4-bit quantization with bitsandbytes’ NF4 reduces the size of the model to 23.5 GB. This is not enough if we assume that a consumer-grade GPU has at most 24 GB of VRAM.

Previous studies showed that experts in MoE can be aggressively quantized to lower precision without much impact on the model performance. For instance, the authors of mixtral-offloading mentioned in their technical report that they have tried 1-bit quantization methods such as the ones proposed by QMoE but observed a significant drop in performance.

Instead, they applied a mixed-precision quantization keeping the non-experts’ parameters to 4-bit.

After applying quantization and expert offloading, inference is between 2 and 3 times faster than with the offloading implemented by Accelerate (device_map).

Running Mixtral-7x8B with 16 GB of GPU VRAM

For this tutorial, I used the T4 GPU of Google Colab which is old and has only 15 GB of VRAM available. It’s a good baseline configuration to test the generation speed with offloaded experts.

mixtral-offloading is a young project but it’s already working very well. It combines two ideas to significantly reduce memory usage while preserving inference speed: mixed-precision quantization and expert offloading.

Following the success of Mixtral-8x7b, I expect MoE models to become more popular in the future. Frameworks optimizing inference for consumer hardware like mixtral-offloading will be essential to make MoEs more accessible.

To support my work, consider subscribing to my newsletter:

The Kaitchup – AI on a Budget | Benjamin Marie | Substack

If you want to evolve your company with AI, stay competitive, use for your advantage Run Mixtral-8x7B on Consumer Hardware with Expert Offloading.

Discover how AI can redefine your way of work. Identify Automation Opportunities: Locate key customer interaction points that can benefit from AI.

Define KPIs: Ensure your AI endeavors have measurable impacts on business outcomes.

Select an AI Solution: Choose tools that align with your needs and provide customization.

Implement Gradually: Start with a pilot, gather data, and expand AI usage judiciously.

For AI KPI management advice, connect with us at hello@itinai.com. And for continuous insights into leveraging AI, stay tuned on our Telegram t.me/itinainews or Twitter @itinaicom.

Spotlight on a Practical AI Solution:

Consider the AI Sales Bot from itinai.com/aisalesbot designed to automate customer engagement 24/7 and manage interactions across all customer journey stages.

Discover how AI can redefine your sales processes and customer engagement. Explore solutions at itinai.com.

“`

List of Useful Links:

AI Lab in Telegram @aiscrumbot – free consultation

Run Mixtral-8x7B on Consumer Hardware with Expert Offloading

Towards Data Science – Medium

Twitter – @itinaicom

Unleash Your Creative Potential with AI Agents

Competitors are already using AI Agents

Business Problems We Solve

Automation of internal processes.
Optimizing AI costs without huge budgets.
Training staff, developing custom courses for business needs
Integrating AI into client work, automating first lines of contact

Large and Medium Businesses

Startups

Offline Business

Get a plan to reduce routine and improve metrics

100% of clients report increased productivity and reduced operati

AI Agents

Localization Project Manager – Coordinating translation workflows, answering vendor or process-related questions.

Job Title: Localization Project Manager Overview The Localization Project Manager plays a vital role in coordinating translation workflows while addressing vendor and process-related queries. This position is crucial for ensuring that translation projects are executed efficiently…
AI Agents

Environmental Health & Safety Officer – Answering compliance-related questions, retrieving safety protocols or audit histories.

Professional Summary The AI-driven Environmental Health & Safety Officer is a reliable and effective digital team member that performs repetitive and time-consuming tasks with remarkable speed, accuracy, and stability. By automating these tasks, it frees up…
AI Agents

Legal Contract Reviewer – Auto-flagging clause inconsistencies or retrieving precedent cases for review.

Job Title: Legal Contract Reviewer – Auto-flagging Clause Inconsistencies or Retrieving Precedent Cases for Review The AI functions as a reliable and effective digital team member that excels in performing repetitive and time-consuming tasks. With remarkable…
AI Agents

Customer Retention Analyst – Creating customer summaries, identifying churn risk patterns, and suggesting retention steps.

Customer Retention Analyst Professional Summary A highly analytical and detail-oriented Customer Retention Analyst with a proven track record in creating comprehensive customer summaries, identifying churn risk patterns, and suggesting effective retention strategies. Adept at leveraging data-driven…

Itinai.com httpss.mj.runmrqch2uvtvo russian handsome charisma 9fdbb2d5 a55b 425d 8f3b 76d26f86710f 2

AI Business Accelerator

Start Your AI Business in Just a Week with itinai.com

You’re a great fit if you:

Have an audience (even 500+ followers in Instagram, email, etc.)
Have an idea, service, or product you want to scale
Can invest 2–3 hours a day
You’re motivated to earn with AI but don’t want to handle technical setup

AI news and solutions

Effector: A Python-based Machine Learning Library Dedicated to Regional Feature Effects

AI Tech News
“Unlocking Reliable AI: VERINA’s Benchmark for Verifiable Code Generation”

When it comes to leveraging artificial intelligence in software development, the integration of Large Language Models (LLMs) into code generation tools is a game-changer. However, while these models, such as GitHub Copilot, can significantly enhance productivity,…

AI Tech News
Researchers from Stanford University Propose MLAgentBench: A Suite of Machine Learning Tasks for Benchmarking AI Research Agents

Stanford University researchers have introduced MLAgentBench, the first benchmark of its kind, to evaluate AI research agents with free-form decision-making capabilities. The framework allows agents to execute research tasks similar to human researchers, collecting data on…

AI Tech News
Anthropic Open Sourced Model Context Protocol (MCP): Transforming AI Integration with Universal Data Connectivity for Smarter, Context-Aware, and Scalable Applications Across Industries

Anthropic’s Model Context Protocol (MCP) Anthropic has open-sourced the Model Context Protocol (MCP), a significant advancement in how AI systems connect with real-world data. MCP provides a universal standard that simplifies the integration of AI with…

AI Tech News
Agnostically Learning Single-Index Models using Omnipredictors

This text introduces a new approach to agnostically learning Single-Index Models (SIMs) with arbitrary monotone and Lipschitz activations. Unlike previous methods, it does not rely on predetermined settings or knowledge of the activation function. Additionally, it…

AI Tech News
New embedding models and API updates

Summary: The company is introducing new embedding models, GPT-4 Turbo, moderation models, and API usage management tools. Additionally, they plan to lower pricing for GPT-3.5 Turbo in the near future.

AI Tech News
FuXi-2.0: Advancement in Machine Learning ML-based Weather Forecasting for Practical Applications

Practical Advancements in Weather Forecasting with FuXi-2.0 Enhanced Accuracy and Practical Value Machine learning (ML) models like FuXi-2.0 are revolutionizing weather forecasting by offering 1-hourly predictions with a broad range of meteorological variables. This advancement improves…

AI Tech News
Build Robust Data Pipelines with Dagster: A Guide for Data Engineers and ML Practitioners

Understanding the Importance of Data Pipelines Data pipelines are essential for organizations that rely on data-driven decision-making. They enable the seamless flow of data from various sources to analytical tools, ensuring that insights are derived from…

AI Tech News
GitHub Copilot vs. ChatGPT: Which AI Tool is Better for Software Development?

The article compares GitHub Copilot and ChatGPT, highlighting their functionalities, advantages, and disadvantages for software development. GitHub Copilot excels in real-time code suggestions, while ChatGPT offers versatile text generation, customer support, and content creation. The choice…

AI Tech News
Poro 34B: A 34B Parameter AI Model Trained for 1T Tokens of Finnish, English, and Programming languages, Including 8B Tokens of Finnish-English Translation Pairs

AI Tech News
Google AI Proposes Re-Invoke: An Unsupervised AI Tool Retrieval Method that Effectively and Efficiently Retrieves the Most Relevant Tools from a Large Toolset

Revolutionizing AI with Large Language Models (LLMs) Large Language Models (LLMs) have transformed artificial intelligence by showcasing impressive abilities across various tasks. To maximize their effectiveness, LLMs need to interact with real-world tools. As the number…

AI Tech News
How to Set Up an AI Assistant That Knows Your Business Inside Out

How to Set Up an AI Assistant That Knows Your Business Inside Out Many businesses today struggle with the common issue of time-consuming document search and misaligned team collaboration. Imagine spending countless hours sifting through a…

AI Document Assistant
Exploring Memory Options for Agent-Based Systems: A Comprehensive Overview

Transforming Agent-Based Systems with Memory Management Large language models (LLMs) are changing the way we develop agent-based systems. However, managing memory in these systems is still a challenge. Effective memory allows agents to maintain context, remember…

AI Tech News
Amazon Researchers Propose a New Method to Measure the Task-Specific Accuracy of Retrieval-Augmented Large Language Models (RAG)

Practical Solutions for Evaluating Large Language Models (LLMs) Assessing Retrieval-Augmented Generation (RAG) Systems Evaluating the correctness of RAG systems can be challenging, but a team of Amazon researchers has introduced an exam-based evaluation approach powered by…

AI Tech News
OpenAI Releases SimpleQA: A New AI Benchmark that Measures the Factuality of Language Models

The Challenge of Factual Accuracy in AI The emergence of large language models has brought challenges, especially regarding the accuracy of their responses. These models sometimes produce factually incorrect information, a problem known as “hallucination.” This…

AI Tech News
Salesforce AI Research Unveils APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets

APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets Function-calling agent models, a significant advancement within large language models (LLMs), interpret natural language instructions to execute API calls, crucial for real-time interactions with digital services.…

AI Tech News
Meet The Matrix: A New AI Approach to Infinite-Length and Real-Time Video Generation

Challenges in Video Simulation Creating high-quality, real-time video simulations is difficult, especially for longer videos without losing quality. Traditional video generation models face issues like high costs, short durations, and limited interactivity. Manual asset creation, common…

AI Tech News
Future-Proofing the Past: AI’s Role in Protecting Cultural Legacies

The Power of AI in Protecting Cultural Heritage The world’s cultural heritage is at risk due to conflicts and natural disasters, threatening ancient sites and artifacts. AI offers sophisticated tools to document, analyze, and safeguard cultural…

AI Tech News
NVIDIA Llama Nemotron Super v1.5: Revolutionizing AI Reasoning for Developers and Enterprises

Understanding the Target Audience for Llama Nemotron Super v1.5 The Llama Nemotron Super v1.5 from NVIDIA is designed for a specific group of individuals who are at the forefront of artificial intelligence development. This audience primarily…

AI Tech News
Detection of Multicollinearity in Data sets using Statistical Testing.

Detecting multicollinearity in data sets is both important and challenging.

AI Tech News