Run Mixtral-8x7B on Consumer Hardware with Expert Offloading

Mixtral-8x7B, a large language model, faces challenges due to its large size. The model’s mixture of experts doesn’t efficiently use GPU memory, hindering inference speed. Mixtral-offloading proposes an efficient solution, combining expert-aware quantization and expert offloading. These methods significantly reduce VRAM consumption while maintaining efficient inference on consumer hardware.

 Run Mixtral-8x7B on Consumer Hardware with Expert Offloading

“`html

Finding the right trade-off between memory usage and inference speed

Activation pattern of Mixtral-8x7B’s expert sub-networks — source (CC-BY)

While Mixtral-8x7B is one of the best open large language models (LLM), it is also a huge model with 46.7B parameters. Even when quantized to 4-bit, the model can’t be fully loaded on a consumer GPU (e.g., an RTX 3090 with 24 GB of VRAM is not enough).

Mixtral-8x7B is a mixture of experts (MoE). It is made of 8 expert sub-networks of 6 billion parameters each.

Since only 2 experts among 8 are effective during decoding, the 6 remaining experts can be moved, or offloaded, to another device, e.g., the CPU RAM, to free up some of the GPU VRAM. In practice, this offloading is complicated.

Choosing which one of the experts to activate is a decision taken at inference time for each input token and each layer of the model. Naively moving some parts of the model to the CPU RAM, as with Accelerate’s device_map, would create a communication bottleneck between the CPU and the GPU.

Mixtral-offloading (MIT license) is a project that proposes a much more efficient solution to reduce VRAM consumption while preserving a reasonable inference speed.

In this article, I explain how mixtral-offloading implements expert-aware quantization and expert offloading to save memory and maintain a good inference speed. Using this framework, we will see how to run Mixtral-8x7B on consumer hardware and benchmark its inference speed.

Caching & Speculative Offloading

MoE language models often allocate distinct experts to sub-tasks, but not consistently across long token sequences. Some experts are active in short 2–4 token sequences, while others have intermittent “gaps” in their activation.

To capitalize on this pattern, the authors of mixtral-offloading suggest keeping active experts in GPU memory as a “cache” for future tokens. This ensures quick availability if the same experts are needed again. GPU memory limits the number of stored experts, and a simple Least Recently Used (LRU) cache is employed, maintaining the k least recently used experts uniformly across all layers.

Despite its simplicity, the LRU cache strategy significantly speeds up inference for MoE models like Mixtral-8x7B.

However, while LRU caching improves the average expert loading time, a significant portion of inference time still involves waiting for the next expert to load. MoE offloading lacks effective overlap between expert loading and computation.

In standard (non-MoE) models, efficient offloading schedules involve pre-loading the next layer while the previous one runs. However, this advantage isn’t feasible for MoE models, as experts are selected just in time for computation. The system can’t pre-fetch the next layer until it determines which experts to load. Despite the inability to reliably pre-fetch, the authors found that speculative loading can be used to guess the next experts while processing the previous layer, accelerating the next layer’s inference if the guess is correct.

To sum up, an LRU cache and speculative offloading can save VRAM while keeping inference efficient by offloading the experts that are the less likely to be used.

Expert-Aware Aggressive Quantization

In addition to expert offloading, we need to quantize the model to make it run on consumer hardware. Naive 4-bit quantization with bitsandbytes’ NF4 reduces the size of the model to 23.5 GB. This is not enough if we assume that a consumer-grade GPU has at most 24 GB of VRAM.

Previous studies showed that experts in MoE can be aggressively quantized to lower precision without much impact on the model performance. For instance, the authors of mixtral-offloading mentioned in their technical report that they have tried 1-bit quantization methods such as the ones proposed by QMoE but observed a significant drop in performance.

Instead, they applied a mixed-precision quantization keeping the non-experts’ parameters to 4-bit.

After applying quantization and expert offloading, inference is between 2 and 3 times faster than with the offloading implemented by Accelerate (device_map).

Running Mixtral-7x8B with 16 GB of GPU VRAM

For this tutorial, I used the T4 GPU of Google Colab which is old and has only 15 GB of VRAM available. It’s a good baseline configuration to test the generation speed with offloaded experts.

mixtral-offloading is a young project but it’s already working very well. It combines two ideas to significantly reduce memory usage while preserving inference speed: mixed-precision quantization and expert offloading.

Following the success of Mixtral-8x7b, I expect MoE models to become more popular in the future. Frameworks optimizing inference for consumer hardware like mixtral-offloading will be essential to make MoEs more accessible.

To support my work, consider subscribing to my newsletter:

The Kaitchup – AI on a Budget | Benjamin Marie | Substack

If you want to evolve your company with AI, stay competitive, use for your advantage Run Mixtral-8x7B on Consumer Hardware with Expert Offloading.

Discover how AI can redefine your way of work. Identify Automation Opportunities: Locate key customer interaction points that can benefit from AI.

Define KPIs: Ensure your AI endeavors have measurable impacts on business outcomes.

Select an AI Solution: Choose tools that align with your needs and provide customization.

Implement Gradually: Start with a pilot, gather data, and expand AI usage judiciously.

For AI KPI management advice, connect with us at hello@itinai.com. And for continuous insights into leveraging AI, stay tuned on our Telegram t.me/itinainews or Twitter @itinaicom.

Spotlight on a Practical AI Solution:

Consider the AI Sales Bot from itinai.com/aisalesbot designed to automate customer engagement 24/7 and manage interactions across all customer journey stages.

Discover how AI can redefine your sales processes and customer engagement. Explore solutions at itinai.com.

“`

List of Useful Links:

AI Products for Business or Try Custom Development

AI Sales Bot

Welcome AI Sales Bot, your 24/7 teammate! Engaging customers in natural language across all channels and learning from your materials, it’s a step towards efficient, enriched customer interactions and sales

AI Document Assistant

Unlock insights and drive decisions with our AI Insights Suite. Indexing your documents and data, it provides smart, AI-driven decision support, enhancing your productivity and decision-making.

AI Customer Support

Upgrade your support with our AI Assistant, reducing response times and personalizing interactions by analyzing documents and past engagements. Boost your team and customer satisfaction

AI Scrum Bot

Enhance agile management with our AI Scrum Bot, it helps to organize retrospectives. It answers queries and boosts collaboration and efficiency in your scrum processes.