Stream large language model responses in Amazon SageMaker JumpStart

Amazon SageMaker JumpStart now supports token streaming for large language model (LLM) inference responses. This feature allows users to see the model response output as it is being generated, providing a perception of low latency. Streaming is available for select LLMs in SageMaker JumpStart, including Mistral AI 7B, Falcon 180B, Falcon 40B, and Rinna Japanese GPT NeoX models. Users can deploy and stream the response from these models using the provided code.

Stream Large Language Model Responses with Amazon SageMaker JumpStart

We are excited to announce that Amazon SageMaker JumpStart now supports streaming of large language model (LLM) inference responses. This new feature allows you to see the model response output as it is being generated, providing a perception of low latency to the end-user. This can greatly improve the user experience of your applications.

Supported LLMs for Streaming

SageMaker JumpStart currently supports streaming for the following LLMs:

Mistral AI 7B, Mistral AI 7B Instruct
Falcon 180B, Falcon 180B Chat
Falcon 40B, Falcon 40B Instruct
Falcon 7B, Falcon 7B Instruct
Rinna Japanese GPT NeoX 4B Instruction PPO
Rinna Japanese GPT NeoX 3.6B Instruction PPO

To stay updated on the list of models supporting streaming in SageMaker JumpStart, search for “huggingface-llm” at Built-in Algorithms with pre-trained Model Table.

Foundation Models in SageMaker

SageMaker JumpStart provides access to a range of foundation models from popular model hubs like Hugging Face, PyTorch Hub, and TensorFlow Hub. These models are trained on billions of parameters and can be adapted to various use cases such as text summarization, digital art generation, and language translation. By using pre-trained foundation models, you can save time and resources compared to training models from scratch.

Within SageMaker JumpStart, you can find foundation models from different providers and easily review their characteristics and usage terms. You can also try these models using a test UI widget. When you need to use a foundation model at scale, you can leverage prebuilt notebooks from model providers. Hosting and deployment of these models on AWS ensures the security and privacy of your data.

Token Streaming

Token streaming allows the inference response to be returned incrementally as it is being generated by the model. This means you can start seeing the output immediately without waiting for the complete response. Streaming can significantly improve the perceived latency for the end-user, even though the overall end-to-end latency remains the same.

To utilize token streaming in SageMaker JumpStart, you can use models that utilize Hugging Face LLM Text Generation Inference DLC.

Solution Overview

In this post, we will demonstrate the streaming capability of SageMaker JumpStart using the Falcon 7B Instruct model.

You can find other models in SageMaker JumpStart that support streaming using the following code:

from sagemaker.jumpstart.notebook_utils import list_jumpstart_models
from sagemaker.jumpstart.filters import And

filter_value = And("task == llm", "framework == huggingface")
model_ids = list_jumpstart_models(filter=filter_value)
print(model_ids)

Before running the notebook, make sure to run the necessary setup commands:

%pip install --upgrade sagemaker –quiet

To deploy the model, use SageMaker JumpStart and the following code:

from sagemaker.jumpstart.model import JumpStartModel

my_model = JumpStartModel(model_id="huggingface-llm-falcon-7b-instruct-bf16")
predictor = my_model.deploy()

To query the endpoint and stream the response, construct a payload with the “stream” parameter set to True:

payload = {
    "inputs": "How do I build a website?",
    "parameters": {"max_new_tokens": 256},
    "stream": True
}

Create a TokenIterator to parse the streaming response:

import io
import json

class TokenIterator:
    def __init__(self, stream):
        self.byte_iterator = iter(stream)
        self.buffer = io.BytesIO()
        self.read_pos = 0

    def __iter__(self):
        return self

    def __next__(self):
        while True:
            self.buffer.seek(self.read_pos)
            line = self.buffer.readline()
            if line and line[-1] == ord("n"):
                self.read_pos += len(line) + 1
                full_line = line[:-1].decode("utf-8")
                line_data = json.loads(full_line.lstrip("data:").rstrip("/n"))
                return line_data["token"]["text"]
            chunk = next(self.byte_iterator)
            self.buffer.seek(0, io.SEEK_END)
            self.buffer.write(chunk["PayloadPart"]["Bytes"])

Invoke the endpoint and enable streaming using the TokenIterator:

import boto3

client = boto3.client("runtime.sagemaker")
response = client.invoke_endpoint_with_response_stream(
    EndpointName=predictor.endpoint_name,
    Body=json.dumps(payload),
    ContentType="application/json",
)

for token in TokenIterator(response["Body"]):
    print(token, end="")

Remember to clean up your deployed model and endpoint when you’re done:

predictor.delete_model()
predictor.delete_endpoint()

In conclusion, the streaming capability in SageMaker JumpStart allows you to build applications with low latency and better user experience. Explore the various foundation models and leverage token streaming to enhance your AI solutions.

For more information on leveraging AI for your company, connect with us at hello@itinai.com.

Spotlight on a Practical AI Solution:

Consider using the AI Sales Bot from itinai.com/aisalesbot to automate customer engagement and manage interactions across all stages of the customer journey. This AI solution can redefine your sales processes and customer engagement.

Discover how AI can transform your business. Visit itinai.com for more information.

List of Useful Links:

AI Lab in Telegram @aiscrumbot – free consultation

Stream large language model responses in Amazon SageMaker JumpStart

AWS Machine Learning Blog

Twitter – @itinaicom

Unleash Your Creative Potential with AI Agents

Competitors are already using AI Agents

Business Problems We Solve

Automation of internal processes.
Optimizing AI costs without huge budgets.
Training staff, developing custom courses for business needs
Integrating AI into client work, automating first lines of contact

Large and Medium Businesses

Startups

Offline Business

Get a plan to reduce routine and improve metrics

100% of clients report increased productivity and reduced operati

AI Agents

Localization Project Manager – Coordinating translation workflows, answering vendor or process-related questions.

Job Title: Localization Project Manager Overview The Localization Project Manager plays a vital role in coordinating translation workflows while addressing vendor and process-related queries. This position is crucial for ensuring that translation projects are executed efficiently…
AI Agents

Environmental Health & Safety Officer – Answering compliance-related questions, retrieving safety protocols or audit histories.

Professional Summary The AI-driven Environmental Health & Safety Officer is a reliable and effective digital team member that performs repetitive and time-consuming tasks with remarkable speed, accuracy, and stability. By automating these tasks, it frees up…
AI Agents

Legal Contract Reviewer – Auto-flagging clause inconsistencies or retrieving precedent cases for review.

Job Title: Legal Contract Reviewer – Auto-flagging Clause Inconsistencies or Retrieving Precedent Cases for Review The AI functions as a reliable and effective digital team member that excels in performing repetitive and time-consuming tasks. With remarkable…
AI Agents

Customer Retention Analyst – Creating customer summaries, identifying churn risk patterns, and suggesting retention steps.

Customer Retention Analyst Professional Summary A highly analytical and detail-oriented Customer Retention Analyst with a proven track record in creating comprehensive customer summaries, identifying churn risk patterns, and suggesting effective retention strategies. Adept at leveraging data-driven…

Itinai.com httpss.mj.runmrqch2uvtvo russian handsome charisma 9fdbb2d5 a55b 425d 8f3b 76d26f86710f 2

AI Business Accelerator

Start Your AI Business in Just a Week with itinai.com

You’re a great fit if you:

Have an audience (even 500+ followers in Instagram, email, etc.)
Have an idea, service, or product you want to scale
Can invest 2–3 hours a day
You’re motivated to earn with AI but don’t want to handle technical setup

AI news and solutions

Researchers from China Introduce ControlLLM: An Artificial Intelligence Framework that Enables Large Language Models (LLMs) to Utilize Multi-Modal Tools for Solving Complex Real-World Task

The ControlLLM framework, developed by researchers from The Hong Kong University of Science and Technology, OpenGVLab, Shanghai AI Laboratory, Tsinghua University, and SenseTime, enables large language models (LLMs) to utilize multi-modal tools for solving complex real-world…

AI Tech News
FinRobot: A Novel Open-Source AI Agent Platform Supporting Multiple Financially Specialized AI Agents Powered by LLMs

Practical AI Solutions in Finance AI’s Role in Financial Analysis Financial analysis has increasingly turned to artificial intelligence (AI) and algorithmic methods to handle vast and complex data, automating tasks and enhancing accuracy and efficiency. Challenges…

AI Tech News
Towards Smarter Code Comprehension: Hierarchical Summarization with Business Relevance

Understanding and Managing Large Software Repositories Managing large software repositories is a common challenge in software development today. Current tools excel at summarizing small code elements, like functions, but struggle with larger components such as files…

AI Tech News
Google’s Gemini is now in everything. Here’s how you can try it out.

Google is launching Gemini, its large language model, across its products, offering a subscription plan for Gemini Ultra. It is replacing its ChatGPT rival with Bard, powered by Gemini. Gemini outperforms GPT-4 and is integrated into…

AI Tech News
This AI Paper from MIT Introduces a Novel Approach to Robotic Manipulation: Bridging the 2D-to-3D Gap with Distilled Feature Fields and Vision-Language Models

Researchers from MIT and IAIFI have developed a framework called Feature Fields for Robotic Manipulation (F3RM), which addresses the challenge of enabling robots to manipulate objects in cluttered environments. F3RM leverages distilled feature fields to combine…

AI Tech News
Fortress: An Orchestration Platform for SaaS Applications, Allowing them to Manage a Multi-Instance Database Architecture in their Own Cloud Easily

Practical Solutions for SaaS Companies Shifting to Cloud-Based Database Architecture For cost, latency, and data control, SaaS companies transition from third-party managed database platforms to cloud providers like Amazon Web Services (AWS), Google Cloud Platform (GCP),…

AI Tech News
Kyutai Launches Advanced 2B Parameter TTS with 220ms Latency for AI Developers and Businesses

Understanding the Target Audience Kyutai’s new streaming Text-to-Speech (TTS) model targets several key groups. Primarily, it caters to AI researchers who are deeply involved in the exploration of speech synthesis technologies. Additionally, developers and engineers creating…

AI Tech News
How to Monetize a Small Audience on Social Media

Monetizing Your Small Social Media Audience: A Lean Business Plan This plan outlines how to turn a modest social media following (500-5000) into a revenue stream using AI, specifically leveraging the AI Business Accelerator platform at…

AI Business
This AI Paper from Durham University Evaluates GPT-3.5 and GPT-4’s Performance Against Student Coders in Physics

AI Tech News
Generative AI deployment: Strategies for smooth scaling

Generative AI is the next big technology trend that executives are preparing for, but it also comes with risks. The technology is challenging legal frameworks, creating cybersecurity threats, and causing workforce automation concerns. Organizations need to…

AI Tech News
Meta AI Releases New Quantized Versions of Llama 3.2 (1B & 3B): Delivering Up To 2-4x Increases in Inference Speed and 56% Reduction in Model Size

Introduction to AI Advancements The rapid growth of large language models (LLMs) has led to many improvements in different fields, but it also brings challenges. Models like Llama 3 excel in understanding and generating language, but…

AI Tech News
CodeMaker AI Breakthrough in Software Development: Achieves 91% Accuracy in Recreating 90,000 Lines of Code, Setting a New Benchmark for AI-driven code Generation and Fine-Tuned Model

Practical Solutions and Value of CodeMaker AI Breakthrough in Software Development Accelerated Development Cycles CodeMaker AI autonomously recreates large-scale codebases, reducing manual coding efforts and accelerating development timelines drastically. Cost Efficiency CodeMaker AI generates code with…

AI Tech News
Researchers from Lebanese American University and UAE Present the Solutions of the Learning Language Differential Model by Applying the Deep Learning Approach

Researchers from Lebanese American University and United Arab Emirates University used artificial intelligence for language-based learning models through the Scale Conjugate Gradient Neural Network (SCJGNN). The study categorizes language models and validates the AI model’s accuracy,…

AI Tech News
TensorOpera Unveils Fox Foundation Model: A Unique Step in Small Language Models Enhancing Scalability and Efficiency for Cloud and Edge Computing

TensorOpera Unveils Fox Foundation Model: A Unique Step in Small Language Models Enhancing Scalability and Efficiency for Cloud and Edge Computing Practical Solutions and Value Highlights Groundbreaking Small Language Model TensorOpera has launched Fox-1, a small…

AI Tech News
Researchers from the University of Bordeaux, France Developed Pyfiber: An Open-Source Python Library that Facilitates the Merge of Fiber Photometry (FP) with Operant Behavior

A Python library called Pyfiber, developed by researchers from the University of Bordeaux and UCL Sainsbury Wellcome Centre, seamlessly integrates fiber photometry with complex behavioral paradigms in behavioral neuroscience research. It offers versatility, ease of use,…

AI Tech News
CrewAI: A Guide to Agentic AI Collaboration and Workflow Optimization with Code Implementation

CrewAI: Transforming AI Collaboration CrewAI is a groundbreaking platform that changes the way AI agents work together to tackle complex challenges. It allows users to create and manage teams of specialized AI agents, each designed for…

AI Tech News
HybridNorm: Optimizing Transformer Architectures with Hybrid Normalization Strategies

Transforming Natural Language Processing with HybridNorm Transformers have significantly advanced natural language processing, serving as the backbone for large language models (LLMs). They excel at understanding long-range dependencies using self-attention mechanisms. However, as these models become…

AI Tech News
HQQ Llama-3.1-70B Released: A Groundbreaking AI Model that Achieves 99% of the Base Model Performance Across Various Benchmarks

Mobius Labs Unveils HQQ Llama-3.1-70B: A Revolutionary AI Model Enhancing AI Capabilities in NLP, Image Recognition, and Data Analysis The HQQ Llama-3.1-70B by Mobius Labs introduces 70 billion parameters, boosting performance in natural language processing (NLP),…

AI Tech News
Stacklock Releases Promptwright: A Python Library for Synthetic Dataset Generation Using an LLM (Local or Hosted)

Access to Quality Data for Machine Learning In today’s data-driven world, having high-quality and diverse datasets is essential for building reliable machine learning models. However, obtaining these datasets can be challenging due to privacy issues and…

AI Tech News
SMART Filtering: Enhancing Benchmark Quality and Efficiency for NLP Model Evaluation

Understanding the Challenges in Evaluating NLP Models Evaluating Natural Language Processing (NLP) models is becoming more complicated. Key issues include: Benchmark Saturation: Many models now perform at near-human levels, making it hard to distinguish between them.…

AI Tech News