Boost inference performance for LLMs with new Amazon SageMaker containers

Amazon SageMaker has released a new version (0.25.0) of Large Model Inference (LMI) Deep Learning Containers (DLCs) with support for NVIDIA’s TensorRT-LLM Library. This upgrade provides improved performance and efficiency for large language models (LLMs) on SageMaker. The new LMI DLCs offer features such as continuous batching support, efficient inference collective operations, and quantization techniques. Benchmarks show reduced latency and increased throughput compared to the previous version. Deploying LLMs on SageMaker using LMI DLCs is straightforward with no code changes required. The latest release includes two containers: DeepSpeed for LMI Distributed Inference Library and TensorRT-LLM for accelerated LLM inference.

 Boost inference performance for LLMs with new Amazon SageMaker containers

Boost Inference Performance for Large Language Models with Amazon SageMaker

Amazon SageMaker has launched a new version (0.25.0) of Large Model Inference (LMI) Deep Learning Containers (DLCs) with added support for NVIDIA’s TensorRT-LLM Library. This update provides you with state-of-the-art tools to optimize large language models (LLMs) on SageMaker and achieve significant price-performance benefits.

The latest LMI DLCs reduce latency by 33% on average and improve throughput by 60% on average for Llama2-70B, Falcon-40B, and CodeLlama-34B models compared to the previous version.

New Features with SageMaker LMI DLCs

SageMaker LMI now supports TensorRT-LLM: SageMaker now offers NVIDIA’s TensorRT-LLM as part of the latest LMI DLC release. This enables powerful optimizations like SmoothQuant, FP8, and continuous batching for LLMs when using NVIDIA GPUs. TensorRT-LLM significantly improves inference speed and supports deployments ranging from single-GPU to multi-GPU configurations.

Efficient inference collective operations: SageMaker introduces a new collective operation that speeds up communication between GPUs in LLM deployments. This reduces latency and increases throughput with the latest LMI DLCs compared to previous versions.

Quantization support: SageMaker LMI DLCs now support the latest quantization techniques, including GPTQ, AWQ, and SmoothQuant. These techniques optimize model weights, improve inference speed, and reduce memory footprint and computational cost while maintaining accuracy.

Using SageMaker LMI DLCs

You can deploy your LLMs on SageMaker using the new LMI DLCs 0.25.0 without any changes to your code. SageMaker LMI DLCs use DJL serving to serve your model for inference. Simply create a configuration file specifying settings like model parallelization and inference optimization libraries to use.

Performance Benchmarking Results

Performance benchmarks show significant improvements with the latest SageMaker LMI DLCs compared to previous versions. For example, latency reduced by 28-36% and throughput increased by 44-77% for concurrency of 16.

Recommended Configuration and Container

SageMaker provides two containers: 0.25.0-deepspeed and 0.25.0-tensorrtllm. The DeepSpeed container contains DeepSpeed, the LMI Distributed Inference Library, while the TensorRT-LLM container includes NVIDIA’s TensorRT-LLM Library. These containers offer optimized deployment configurations for hosting LLMs.

For more details on using SageMaker LMI DLCs and to explore practical AI solutions, visit itinai.com. Discover how AI can redefine your sales processes and customer engagement with the AI Sales Bot from itinai.com/aisalesbot.

List of Useful Links:

AI Products for Business or Try Custom Development

AI Sales Bot

Welcome AI Sales Bot, your 24/7 teammate! Engaging customers in natural language across all channels and learning from your materials, it’s a step towards efficient, enriched customer interactions and sales

AI Document Assistant

Unlock insights and drive decisions with our AI Insights Suite. Indexing your documents and data, it provides smart, AI-driven decision support, enhancing your productivity and decision-making.

AI Customer Support

Upgrade your support with our AI Assistant, reducing response times and personalizing interactions by analyzing documents and past engagements. Boost your team and customer satisfaction

AI Scrum Bot

Enhance agile management with our AI Scrum Bot, it helps to organize retrospectives. It answers queries and boosts collaboration and efficiency in your scrum processes.