Understanding MLflow for Evaluating Large Language Models
MLflow has emerged as a robust tool for managing the machine learning lifecycle, and its recent enhancements now allow for the evaluation of Large Language Models (LLMs). This guide will walk you through the process of using MLflow to evaluate the performance of Google’s Gemini model on factual prompts, detailing each step along the way.
Identifying the Audience
This article targets data scientists, machine learning engineers, and business analysts keen on LLM evaluations. These professionals often face challenges such as:
- Inconsistent assessment of model performance.
- Lack of established methodologies for evaluating LLM outputs.
- Integration difficulties with various APIs and tools in their workflows.
They seek practical, hands-on tutorials that offer clear instructions and relevant metrics, ultimately to enhance their understanding and improve deployment outcomes.
Setting Up Your Environment
To get started, you will need access to both the OpenAI and Google Gemini APIs. MLflow uses OpenAI models to gauge the quality of responses generated by Gemini, so obtaining the API keys is essential:
- Get your OpenAI API key from the OpenAI API Keys.
- Acquire your Google Gemini API key from the Google Gemini API Docs.
Installing Required Libraries
Run the following command to install the necessary libraries:
pip install mlflow openai pandas google-genai
Setting Environment Variables
Next, you need to set your API keys as environment variables using the following code:
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass('Enter OpenAI API Key:')
os.environ["GOOGLE_API_KEY"] = getpass('Enter Google API Key:')
Preparing Your Evaluation Dataset
Now, let’s create a dataset containing factual prompts and their corresponding correct answers. This structured dataset will serve as a benchmark against which we can compare the responses generated by Gemini.
eval_data = pd.DataFrame(
{
"inputs": [
"Who developed the theory of general relativity?",
"What are the primary functions of the liver in the human body?",
"Explain what HTTP status code 404 means.",
"What is the boiling point of water at sea level in Celsius?",
"Name the largest planet in our solar system.",
"What programming language is primarily used for developing iOS apps?",
],
"ground_truth": [
"Albert Einstein developed the theory of general relativity.",
"The liver helps in detoxification, protein synthesis, and production of biochemicals necessary for digestion.",
"HTTP 404 means 'Not Found' -- the server can't find the requested resource.",
"The boiling point of water at sea level is 100 degrees Celsius.",
"Jupiter is the largest planet in our solar system.",
"Swift is the primary programming language used for iOS app development."
]
}
)
Fetching Responses from Gemini
We will define a function to send prompts to the Gemini model and retrieve the generated responses. Each response will be stored in a new column of our evaluation dataset.
client = genai.Client()
def gemini_completion(prompt: str) -> str:
response = client.models.generate_content(
model="gemini-1.5-flash",
contents=prompt
)
return response.text.strip()
eval_data["predictions"] = eval_data["inputs"].apply(gemini_completion)
Evaluating the Outputs with MLflow
We will evaluate the responses generated by Gemini using MLflow’s evaluation metrics. This process involves initiating an MLflow run and applying various metrics to assess the model’s performance.
mlflow.set_tracking_uri("mlruns")
mlflow.set_experiment("Gemini Simple Metrics Eval")
with mlflow.start_run():
results = mlflow.evaluate(
model_type="question-answering",
data=eval_data,
predictions="predictions",
targets="ground_truth",
extra_metrics=[
mlflow.metrics.genai.answer_similarity(),
mlflow.metrics.exact_match(),
mlflow.metrics.latency(),
mlflow.metrics.token_count()
]
)
print("Aggregated Metrics:")
print(results.metrics)
# Save detailed table
results.tables["eval_results_table"].to_csv("gemini_eval_results.csv", index=False)
Reviewing Evaluation Results
To analyze the evaluation results, load the saved CSV file into a DataFrame. This will allow you to inspect individual prompts, the generated responses, and their corresponding metric scores.
results = pd.read_csv('gemini_eval_results.csv')
pd.set_option('display.max_colwidth', None)
results
Conclusion
Using MLflow for evaluating LLMs like Google’s Gemini model streamlines the assessment process, making it easier to track performance metrics. By following this guide, you can leverage MLflow’s capabilities to enhance your understanding of LLM outputs and improve your machine learning projects.
FAQs
- What is MLflow? MLflow is an open-source platform designed to manage the machine learning lifecycle, including experimentation, reproducibility, and deployment.
- How do I get started with MLflow? You can start by installing MLflow and setting up your environment for your specific use case, such as LLM evaluation.
- What APIs do I need for evaluating LLMs? You will need both OpenAI and Google Gemini API keys to evaluate LLMs effectively with MLflow.
- Can I use MLflow for models other than LLMs? Yes, MLflow is versatile and can be used to manage a variety of machine learning models across different domains.
- What metrics can I evaluate with MLflow? MLflow supports various evaluation metrics, including answer similarity, exact match, latency, and token count.