Itinai.com llm large language model graph clusters multidimen a45382e4 b934 4682 aa99 cb71b6342efa 3
Itinai.com llm large language model graph clusters multidimen a45382e4 b934 4682 aa99 cb71b6342efa 3

Getting Started with MLFlow: A Practical Guide for Evaluating Large Language Models

Understanding MLflow for Evaluating Large Language Models

MLflow has emerged as a robust tool for managing the machine learning lifecycle, and its recent enhancements now allow for the evaluation of Large Language Models (LLMs). This guide will walk you through the process of using MLflow to evaluate the performance of Google’s Gemini model on factual prompts, detailing each step along the way.

Identifying the Audience

This article targets data scientists, machine learning engineers, and business analysts keen on LLM evaluations. These professionals often face challenges such as:

  • Inconsistent assessment of model performance.
  • Lack of established methodologies for evaluating LLM outputs.
  • Integration difficulties with various APIs and tools in their workflows.

They seek practical, hands-on tutorials that offer clear instructions and relevant metrics, ultimately to enhance their understanding and improve deployment outcomes.

Setting Up Your Environment

To get started, you will need access to both the OpenAI and Google Gemini APIs. MLflow uses OpenAI models to gauge the quality of responses generated by Gemini, so obtaining the API keys is essential:

Installing Required Libraries

Run the following command to install the necessary libraries:

pip install mlflow openai pandas google-genai

Setting Environment Variables

Next, you need to set your API keys as environment variables using the following code:

import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass('Enter OpenAI API Key:')
os.environ["GOOGLE_API_KEY"] = getpass('Enter Google API Key:')

Preparing Your Evaluation Dataset

Now, let’s create a dataset containing factual prompts and their corresponding correct answers. This structured dataset will serve as a benchmark against which we can compare the responses generated by Gemini.

eval_data = pd.DataFrame(
    {
        "inputs": [
            "Who developed the theory of general relativity?",
            "What are the primary functions of the liver in the human body?",
            "Explain what HTTP status code 404 means.",
            "What is the boiling point of water at sea level in Celsius?",
            "Name the largest planet in our solar system.",
            "What programming language is primarily used for developing iOS apps?",
        ],
        "ground_truth": [
            "Albert Einstein developed the theory of general relativity.",
            "The liver helps in detoxification, protein synthesis, and production of biochemicals necessary for digestion.",
            "HTTP 404 means 'Not Found' -- the server can't find the requested resource.",
            "The boiling point of water at sea level is 100 degrees Celsius.",
            "Jupiter is the largest planet in our solar system.",
            "Swift is the primary programming language used for iOS app development."
        ]
    }
)

Fetching Responses from Gemini

We will define a function to send prompts to the Gemini model and retrieve the generated responses. Each response will be stored in a new column of our evaluation dataset.

client = genai.Client()
def gemini_completion(prompt: str) -> str:
    response = client.models.generate_content(
        model="gemini-1.5-flash",
        contents=prompt
    )
    return response.text.strip()

eval_data["predictions"] = eval_data["inputs"].apply(gemini_completion)

Evaluating the Outputs with MLflow

We will evaluate the responses generated by Gemini using MLflow’s evaluation metrics. This process involves initiating an MLflow run and applying various metrics to assess the model’s performance.

mlflow.set_tracking_uri("mlruns")
mlflow.set_experiment("Gemini Simple Metrics Eval")

with mlflow.start_run():
    results = mlflow.evaluate(
        model_type="question-answering",
        data=eval_data,
        predictions="predictions",
        targets="ground_truth",
        extra_metrics=[
          mlflow.metrics.genai.answer_similarity(),
          mlflow.metrics.exact_match(),
          mlflow.metrics.latency(),
          mlflow.metrics.token_count()
      ]
    )
    print("Aggregated Metrics:")
    print(results.metrics)

    # Save detailed table
    results.tables["eval_results_table"].to_csv("gemini_eval_results.csv", index=False)

Reviewing Evaluation Results

To analyze the evaluation results, load the saved CSV file into a DataFrame. This will allow you to inspect individual prompts, the generated responses, and their corresponding metric scores.

results = pd.read_csv('gemini_eval_results.csv')
pd.set_option('display.max_colwidth', None)
results

Conclusion

Using MLflow for evaluating LLMs like Google’s Gemini model streamlines the assessment process, making it easier to track performance metrics. By following this guide, you can leverage MLflow’s capabilities to enhance your understanding of LLM outputs and improve your machine learning projects.

FAQs

  • What is MLflow? MLflow is an open-source platform designed to manage the machine learning lifecycle, including experimentation, reproducibility, and deployment.
  • How do I get started with MLflow? You can start by installing MLflow and setting up your environment for your specific use case, such as LLM evaluation.
  • What APIs do I need for evaluating LLMs? You will need both OpenAI and Google Gemini API keys to evaluate LLMs effectively with MLflow.
  • Can I use MLflow for models other than LLMs? Yes, MLflow is versatile and can be used to manage a variety of machine learning models across different domains.
  • What metrics can I evaluate with MLflow? MLflow supports various evaluation metrics, including answer similarity, exact match, latency, and token count.
Itinai.com office ai background high tech quantum computing 0002ba7c e3d6 4fd7 abd6 cfe4e5f08aeb 0

Vladimir Dyachkov, Ph.D
Editor-in-Chief itinai.com

I believe that AI is only as powerful as the human insight guiding it.

Unleash Your Creative Potential with AI Agents

Competitors are already using AI Agents

Business Problems We Solve

  • Automation of internal processes.
  • Optimizing AI costs without huge budgets.
  • Training staff, developing custom courses for business needs
  • Integrating AI into client work, automating first lines of contact

Large and Medium Businesses

Startups

Offline Business

100% of clients report increased productivity and reduced operati

AI news and solutions