LLMS using llmpf|load test LLMS|towards data science
Language Model (LLM) is not necessarily the last step in producing AI applications. A often forgotten but crucial part of the MLOPS lifecycle is loading your LLM properly and making sure it can withstand the expected production flow. High-level load testing is the practice of testing your application, or in this case you can ensure its performance by expecting traffic in a production environment.
In the past, we discussed load testing of traditional ML models using open source Python tools such as Locust. Locusts help capture general performance metrics such as requests per second (RPS) and delay percentiles. Although this works for more traditional API and ML models, it doesn’t capture the full story of LLM.
Traditionally, LLMS has a much lower RPS than traditional ML models and is due to its size and larger computing requirements. Generally, RPS metrics don’t really provide the most accurate picture, as requests may vary considerably based on the input of the LLM. For example, you might have a query that requires summarizing a lot of text and another query that may require a word response.
This is why tokens are seen as a more accurate representation of LLM performance. Whenever the LLM processes your input “token” input, it is a large chunk of text in the high-level A token. The difference between a token is the LLM you use, but you can think of it as a word, a sequence of words, or essential characters.
What we will do in this article is to explore how to generate token-based metrics so that we can understand the performance of your LLM from a service/deployment perspective. After this article, you will learn how to set up a load testing tool specifically for different LLMS benchmarks to evaluate many models or different deployment configurations or a combination of both.
Let’s get started! If you’re more of a video-based learner
notes: This article has a basic understanding of Python, LLM and Amazon Bedrock/Sagemaker. If you’re new to Amazon Bedrock, please refer to my get-together guide here. If you want to learn more about SageMaker JumpStart LLM deployment, please refer to the video here.
Disclaimer: I am a machine learning architect at AWS and my opinion is my own.
Table of contents
- LLM specific indicators
- Introduction to llmperf
- Apply llmpf to Amazon bedrock
- Other resources and conclusions
LLM specific indicators
As we briefly discussed in the introduction about LLM hosting, token-based metrics can often better represent how your LLM responds to queries of different payload sizes or types (summary vs. QNA).
Traditionally, we’ve been tracking RP and latency, and we’ll still see it here, but even more so at the token level. Here are some metrics we should pay attention to before starting load testing:
- It’s time to first token: This is the duration required for the first token generation. This is especially convenient when streaming. For example, when using ChatGpt, when we start processing information, the first piece of text (token) appears.
- Total output token per second: This is the total number of tokens generated per second, which you can think of as a finer alternative to our traditionally tracking requirements.
These are the main metrics we will focus on, and there are some other metrics like Internal Incubation Period that will also be displayed as part of the load test. Remember that parameters that also affect these metrics include the expected input and output token sizes. We use these parameters specifically to understand exactly how our LLM performs in response to different generation tasks.
Now let’s take a look at a tool that allows us to switch these parameters and display the required relevant metrics.
Introduction to llmperf
llmpf is built on top of Ray, a popular distributed computing Python framework. llmpf specializes in creating distributed load tests using rays, where we can simulate real-time production-level flow.
Note that if the client on the end machine has enough compute power to match your expected load, any load testing tool can only generate the expected traffic. For example, when you scale the concurrency or throughput expected by the model, you also need to scale the client computer running load tests.
Now, especially in llmpf, there are several parameters that are tailored for LLM load testing when exposed to what we are discussing:
- Model: This is the model provider you are using and your hosted model. For our use case, it will be Amazon Bedrock and Cloud 3 sonnets.
- LLM API: This is the API format in which the payload should be constructed. We use LITELLM, which provides a standardized payload structure among different model providers, simplifying our setup process, especially if we want to test different models hosted on different platforms.
- Enter a token: Average input token length, you can also specify the standard deviation of this number.
- Output token: Average output token length, you can also specify the standard deviation of this number.
- Concurrent requests: The number of requests to perform load tests simultaneously.
- Test duration: You can control the duration of the test, this parameter is enabled within seconds.
llmpf exposes all these parameters specifically through its token_benchmark_ray.py script, which we configure with specific values. Now let’s look at how to specifically configure Amazon bedrock.
Apply llmpf to Amazon bedrock
set up
For this example, we will work in Sagemaker Classic Notebook instance conda_python3 kernel and ml.g5.12xlarge Example. Note that you want to select an instance with sufficient computation to generate the traffic load to simulate. Make sure you also have AWS credentials for llmperf to access managed models on bedrock or sagmasque manufacturers.
LITELLM Configuration
In this case, we first configure our LLM API selection structure, which is LITELLM. With Litellm, various model providers have support, in which case we will configure the completion API to work with Amazon Bedrock:
import os
from litellm import completion
os.environ["AWS_ACCESS_KEY_ID"] = "Enter your access key ID"
os.environ["AWS_SECRET_ACCESS_KEY"] = "Enter your secret access key"
os.environ["AWS_REGION_NAME"] = "us-east-1"
response = completion(
model="anthropic.claude-3-sonnet-20240229-v1:0",
messages=[{ "content": "Who is Roger Federer?","role": "user"}]
)
output = response.choices[0].message.content
print(output)
To use with bedrock, we configured the model ID to point to Claude 3 sonnet and pass our prompts. The neat part of LITELLM is that the message key has a consistent format between model providers.
After execution, we can configure LLMPERF specifically for bedrock.
llmperf bedrock integration
To perform load testing using llmperf, we can simply use the provided token_benchmark_ray.py script and pass the following parameters we talked about earlier:
- Enter the token mean and standard deviation
- Output token mean and standard deviation
- Maximum number of test requests
- Test duration
- Concurrent requests
In this case, we also specify the API format as LITELLM, and load testing can be performed using a simple shell script:
%%sh
python llmperf/token_benchmark_ray.py
--model bedrock/anthropic.claude-3-sonnet-20240229-v1:0
--mean-input-tokens 1024
--stddev-input-tokens 200
--mean-output-tokens 1024
--stddev-output-tokens 200
--max-num-completed-requests 30
--num-concurrent-requests 1
--timeout 300
--llm-api litellm
--results-dir bedrock-outputs
In this case, we keep a low concurrency, but can switch this number at will depending on what you expect in production. Our test will run for 300 seconds, the delay duration you should see an output directory with two files representing the statistics for each inferred, and an average metric for all requests in all requests during the test.
We can make it look a bit tidy by parsing the summary file with pandas:
import json
from pathlib import Path
import pandas as pd
# Load JSON files
individual_path = Path("bedrock-outputs/bedrock-anthropic-claude-3-sonnet-20240229-v1-0_1024_1024_individual_responses.json")
summary_path = Path("bedrock-outputs/bedrock-anthropic-claude-3-sonnet-20240229-v1-0_1024_1024_summary.json")
with open(individual_path, "r") as f:
individual_data = json.load(f)
with open(summary_path, "r") as f:
summary_data = json.load(f)
# Print summary metrics
df = pd.DataFrame(individual_data)
summary_metrics = {
"Model": summary_data.get("model"),
"Mean Input Tokens": summary_data.get("mean_input_tokens"),
"Stddev Input Tokens": summary_data.get("stddev_input_tokens"),
"Mean Output Tokens": summary_data.get("mean_output_tokens"),
"Stddev Output Tokens": summary_data.get("stddev_output_tokens"),
"Mean TTFT (s)": summary_data.get("results_ttft_s_mean"),
"Mean Inter-token Latency (s)": summary_data.get("results_inter_token_latency_s_mean"),
"Mean Output Throughput (tokens/s)": summary_data.get("results_mean_output_throughput_token_per_s"),
"Completed Requests": summary_data.get("results_num_completed_requests"),
"Error Rate": summary_data.get("results_error_rate")
}
print("Claude 3 Sonnet - Performance Summary:n")
for k, v in summary_metrics.items():
print(f"{k}: {v}")
The final load test results look like this:
As we can see, we see the input parameters we configured, and then we see the corresponding results, and arrive at the token(s) and throughput over time, which represents the average output token per second.
In the real world, you can use llmpf in many different model providers and run tests on these platforms. With this tool, you can use it comprehensively to determine the correct model and deployment stack for use cases when used in size.
Other resources and conclusions
The entire code for the sample can be found in this associated GitHub repository. If you also want to work with the SageMaker endpoint, you can find the Llama Jumpstart deployment load test sample here.
All in all, all load testing and evaluation is critical to ensuring your LLM performs to your expected traffic before it is pushed into production. In future articles, we will cover not only the evaluation section, but also how we create a holistic test with both components.
Thank you as always for reading and keep any feedback and contact me on Linkedln and X.