Amazon SageMaker JumpStart now supports token streaming for large language model (LLM) inference responses. This feature allows users to see the model response output as it is being generated, providing a perception of low latency. Streaming is available for select LLMs in SageMaker JumpStart, including Mistral AI 7B, Falcon 180B, Falcon 40B, and Rinna Japanese GPT NeoX models. Users can deploy and stream the response from these models using the provided code.
Stream Large Language Model Responses with Amazon SageMaker JumpStart
We are excited to announce that Amazon SageMaker JumpStart now supports streaming of large language model (LLM) inference responses. This new feature allows you to see the model response output as it is being generated, providing a perception of low latency to the end-user. This can greatly improve the user experience of your applications.
Supported LLMs for Streaming
SageMaker JumpStart currently supports streaming for the following LLMs:
- Mistral AI 7B, Mistral AI 7B Instruct
- Falcon 180B, Falcon 180B Chat
- Falcon 40B, Falcon 40B Instruct
- Falcon 7B, Falcon 7B Instruct
- Rinna Japanese GPT NeoX 4B Instruction PPO
- Rinna Japanese GPT NeoX 3.6B Instruction PPO
To stay updated on the list of models supporting streaming in SageMaker JumpStart, search for “huggingface-llm” at Built-in Algorithms with pre-trained Model Table.
Foundation Models in SageMaker
SageMaker JumpStart provides access to a range of foundation models from popular model hubs like Hugging Face, PyTorch Hub, and TensorFlow Hub. These models are trained on billions of parameters and can be adapted to various use cases such as text summarization, digital art generation, and language translation. By using pre-trained foundation models, you can save time and resources compared to training models from scratch.
Within SageMaker JumpStart, you can find foundation models from different providers and easily review their characteristics and usage terms. You can also try these models using a test UI widget. When you need to use a foundation model at scale, you can leverage prebuilt notebooks from model providers. Hosting and deployment of these models on AWS ensures the security and privacy of your data.
Token Streaming
Token streaming allows the inference response to be returned incrementally as it is being generated by the model. This means you can start seeing the output immediately without waiting for the complete response. Streaming can significantly improve the perceived latency for the end-user, even though the overall end-to-end latency remains the same.
To utilize token streaming in SageMaker JumpStart, you can use models that utilize Hugging Face LLM Text Generation Inference DLC.
Solution Overview
In this post, we will demonstrate the streaming capability of SageMaker JumpStart using the Falcon 7B Instruct model.
You can find other models in SageMaker JumpStart that support streaming using the following code:
from sagemaker.jumpstart.notebook_utils import list_jumpstart_models
from sagemaker.jumpstart.filters import And
filter_value = And("task == llm", "framework == huggingface")
model_ids = list_jumpstart_models(filter=filter_value)
print(model_ids)
Before running the notebook, make sure to run the necessary setup commands:
%pip install --upgrade sagemaker –quiet
To deploy the model, use SageMaker JumpStart and the following code:
from sagemaker.jumpstart.model import JumpStartModel
my_model = JumpStartModel(model_id="huggingface-llm-falcon-7b-instruct-bf16")
predictor = my_model.deploy()
To query the endpoint and stream the response, construct a payload with the “stream” parameter set to True:
payload = {
"inputs": "How do I build a website?",
"parameters": {"max_new_tokens": 256},
"stream": True
}
Create a TokenIterator to parse the streaming response:
import io
import json
class TokenIterator:
def __init__(self, stream):
self.byte_iterator = iter(stream)
self.buffer = io.BytesIO()
self.read_pos = 0
def __iter__(self):
return self
def __next__(self):
while True:
self.buffer.seek(self.read_pos)
line = self.buffer.readline()
if line and line[-1] == ord("\n"):
self.read_pos += len(line) + 1
full_line = line[:-1].decode("utf-8")
line_data = json.loads(full_line.lstrip("data:").rstrip("/n"))
return line_data["token"]["text"]
chunk = next(self.byte_iterator)
self.buffer.seek(0, io.SEEK_END)
self.buffer.write(chunk["PayloadPart"]["Bytes"])
Invoke the endpoint and enable streaming using the TokenIterator:
import boto3
client = boto3.client("runtime.sagemaker")
response = client.invoke_endpoint_with_response_stream(
EndpointName=predictor.endpoint_name,
Body=json.dumps(payload),
ContentType="application/json",
)
for token in TokenIterator(response["Body"]):
print(token, end="")
Remember to clean up your deployed model and endpoint when you’re done:
predictor.delete_model()
predictor.delete_endpoint()
In conclusion, the streaming capability in SageMaker JumpStart allows you to build applications with low latency and better user experience. Explore the various foundation models and leverage token streaming to enhance your AI solutions.
For more information on leveraging AI for your company, connect with us at hello@itinai.com.
Spotlight on a Practical AI Solution:
Consider using the AI Sales Bot from itinai.com/aisalesbot to automate customer engagement and manage interactions across all stages of the customer journey. This AI solution can redefine your sales processes and customer engagement.
Discover how AI can transform your business. Visit itinai.com for more information.