Itinai.com httpss.mj.runmrqch2uvtvo a professional business c 5c960a86 0303 4318 b075 77a4749ac322 2
Itinai.com httpss.mj.runmrqch2uvtvo a professional business c 5c960a86 0303 4318 b075 77a4749ac322 2

Boost inference performance for LLMs with new Amazon SageMaker containers

Amazon SageMaker has released a new version (0.25.0) of Large Model Inference (LMI) Deep Learning Containers (DLCs) with support for NVIDIA’s TensorRT-LLM Library. This upgrade provides improved performance and efficiency for large language models (LLMs) on SageMaker. The new LMI DLCs offer features such as continuous batching support, efficient inference collective operations, and quantization techniques. Benchmarks show reduced latency and increased throughput compared to the previous version. Deploying LLMs on SageMaker using LMI DLCs is straightforward with no code changes required. The latest release includes two containers: DeepSpeed for LMI Distributed Inference Library and TensorRT-LLM for accelerated LLM inference.

 Boost inference performance for LLMs with new Amazon SageMaker containers

Boost Inference Performance for Large Language Models with Amazon SageMaker

Amazon SageMaker has launched a new version (0.25.0) of Large Model Inference (LMI) Deep Learning Containers (DLCs) with added support for NVIDIA’s TensorRT-LLM Library. This update provides you with state-of-the-art tools to optimize large language models (LLMs) on SageMaker and achieve significant price-performance benefits.

The latest LMI DLCs reduce latency by 33% on average and improve throughput by 60% on average for Llama2-70B, Falcon-40B, and CodeLlama-34B models compared to the previous version.

New Features with SageMaker LMI DLCs

SageMaker LMI now supports TensorRT-LLM: SageMaker now offers NVIDIA’s TensorRT-LLM as part of the latest LMI DLC release. This enables powerful optimizations like SmoothQuant, FP8, and continuous batching for LLMs when using NVIDIA GPUs. TensorRT-LLM significantly improves inference speed and supports deployments ranging from single-GPU to multi-GPU configurations.

Efficient inference collective operations: SageMaker introduces a new collective operation that speeds up communication between GPUs in LLM deployments. This reduces latency and increases throughput with the latest LMI DLCs compared to previous versions.

Quantization support: SageMaker LMI DLCs now support the latest quantization techniques, including GPTQ, AWQ, and SmoothQuant. These techniques optimize model weights, improve inference speed, and reduce memory footprint and computational cost while maintaining accuracy.

Using SageMaker LMI DLCs

You can deploy your LLMs on SageMaker using the new LMI DLCs 0.25.0 without any changes to your code. SageMaker LMI DLCs use DJL serving to serve your model for inference. Simply create a configuration file specifying settings like model parallelization and inference optimization libraries to use.

Performance Benchmarking Results

Performance benchmarks show significant improvements with the latest SageMaker LMI DLCs compared to previous versions. For example, latency reduced by 28-36% and throughput increased by 44-77% for concurrency of 16.

Recommended Configuration and Container

SageMaker provides two containers: 0.25.0-deepspeed and 0.25.0-tensorrtllm. The DeepSpeed container contains DeepSpeed, the LMI Distributed Inference Library, while the TensorRT-LLM container includes NVIDIA’s TensorRT-LLM Library. These containers offer optimized deployment configurations for hosting LLMs.

For more details on using SageMaker LMI DLCs and to explore practical AI solutions, visit itinai.com. Discover how AI can redefine your sales processes and customer engagement with the AI Sales Bot from itinai.com/aisalesbot.

List of Useful Links:

Itinai.com office ai background high tech quantum computing 0002ba7c e3d6 4fd7 abd6 cfe4e5f08aeb 0

Vladimir Dyachkov, Ph.D
Editor-in-Chief itinai.com

I believe that AI is only as powerful as the human insight guiding it.

Unleash Your Creative Potential with AI Agents

Competitors are already using AI Agents

Business Problems We Solve

  • Automation of internal processes.
  • Optimizing AI costs without huge budgets.
  • Training staff, developing custom courses for business needs
  • Integrating AI into client work, automating first lines of contact

Large and Medium Businesses

Startups

Offline Business

100% of clients report increased productivity and reduced operati

AI news and solutions