Amazon SageMaker has released a new version (0.25.0) of Large Model Inference (LMI) Deep Learning Containers (DLCs) with support for NVIDIA’s TensorRT-LLM Library. This upgrade provides improved performance and efficiency for large language models (LLMs) on SageMaker. The new LMI DLCs offer features such as continuous batching support, efficient inference collective operations, and quantization techniques. Benchmarks show reduced latency and increased throughput compared to the previous version. Deploying LLMs on SageMaker using LMI DLCs is straightforward with no code changes required. The latest release includes two containers: DeepSpeed for LMI Distributed Inference Library and TensorRT-LLM for accelerated LLM inference.
Boost Inference Performance for Large Language Models with Amazon SageMaker
Amazon SageMaker has launched a new version (0.25.0) of Large Model Inference (LMI) Deep Learning Containers (DLCs) with added support for NVIDIA’s TensorRT-LLM Library. This update provides you with state-of-the-art tools to optimize large language models (LLMs) on SageMaker and achieve significant price-performance benefits.
The latest LMI DLCs reduce latency by 33% on average and improve throughput by 60% on average for Llama2-70B, Falcon-40B, and CodeLlama-34B models compared to the previous version.
New Features with SageMaker LMI DLCs
SageMaker LMI now supports TensorRT-LLM: SageMaker now offers NVIDIA’s TensorRT-LLM as part of the latest LMI DLC release. This enables powerful optimizations like SmoothQuant, FP8, and continuous batching for LLMs when using NVIDIA GPUs. TensorRT-LLM significantly improves inference speed and supports deployments ranging from single-GPU to multi-GPU configurations.
Efficient inference collective operations: SageMaker introduces a new collective operation that speeds up communication between GPUs in LLM deployments. This reduces latency and increases throughput with the latest LMI DLCs compared to previous versions.
Quantization support: SageMaker LMI DLCs now support the latest quantization techniques, including GPTQ, AWQ, and SmoothQuant. These techniques optimize model weights, improve inference speed, and reduce memory footprint and computational cost while maintaining accuracy.
Using SageMaker LMI DLCs
You can deploy your LLMs on SageMaker using the new LMI DLCs 0.25.0 without any changes to your code. SageMaker LMI DLCs use DJL serving to serve your model for inference. Simply create a configuration file specifying settings like model parallelization and inference optimization libraries to use.
Performance Benchmarking Results
Performance benchmarks show significant improvements with the latest SageMaker LMI DLCs compared to previous versions. For example, latency reduced by 28-36% and throughput increased by 44-77% for concurrency of 16.
Recommended Configuration and Container
SageMaker provides two containers: 0.25.0-deepspeed and 0.25.0-tensorrtllm. The DeepSpeed container contains DeepSpeed, the LMI Distributed Inference Library, while the TensorRT-LLM container includes NVIDIA’s TensorRT-LLM Library. These containers offer optimized deployment configurations for hosting LLMs.
For more details on using SageMaker LMI DLCs and to explore practical AI solutions, visit itinai.com. Discover how AI can redefine your sales processes and customer engagement with the AI Sales Bot from itinai.com/aisalesbot.