Getting Started with MLFlow: A Practical Guide for Evaluating Large Language Models

Understanding MLflow for Evaluating Large Language Models

MLflow has emerged as a robust tool for managing the machine learning lifecycle, and its recent enhancements now allow for the evaluation of Large Language Models (LLMs). This guide will walk you through the process of using MLflow to evaluate the performance of Google’s Gemini model on factual prompts, detailing each step along the way.

Identifying the Audience

This article targets data scientists, machine learning engineers, and business analysts keen on LLM evaluations. These professionals often face challenges such as:

Inconsistent assessment of model performance.
Lack of established methodologies for evaluating LLM outputs.
Integration difficulties with various APIs and tools in their workflows.

They seek practical, hands-on tutorials that offer clear instructions and relevant metrics, ultimately to enhance their understanding and improve deployment outcomes.

Setting Up Your Environment

To get started, you will need access to both the OpenAI and Google Gemini APIs. MLflow uses OpenAI models to gauge the quality of responses generated by Gemini, so obtaining the API keys is essential:

Get your OpenAI API key from the OpenAI API Keys.
Acquire your Google Gemini API key from the Google Gemini API Docs.

Installing Required Libraries

Run the following command to install the necessary libraries:

pip install mlflow openai pandas google-genai

Setting Environment Variables

Next, you need to set your API keys as environment variables using the following code:

import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass('Enter OpenAI API Key:')
os.environ["GOOGLE_API_KEY"] = getpass('Enter Google API Key:')

Preparing Your Evaluation Dataset

Now, let’s create a dataset containing factual prompts and their corresponding correct answers. This structured dataset will serve as a benchmark against which we can compare the responses generated by Gemini.

eval_data = pd.DataFrame(
    {
        "inputs": [
            "Who developed the theory of general relativity?",
            "What are the primary functions of the liver in the human body?",
            "Explain what HTTP status code 404 means.",
            "What is the boiling point of water at sea level in Celsius?",
            "Name the largest planet in our solar system.",
            "What programming language is primarily used for developing iOS apps?",
        ],
        "ground_truth": [
            "Albert Einstein developed the theory of general relativity.",
            "The liver helps in detoxification, protein synthesis, and production of biochemicals necessary for digestion.",
            "HTTP 404 means 'Not Found' -- the server can't find the requested resource.",
            "The boiling point of water at sea level is 100 degrees Celsius.",
            "Jupiter is the largest planet in our solar system.",
            "Swift is the primary programming language used for iOS app development."
        ]
    }
)

Fetching Responses from Gemini

We will define a function to send prompts to the Gemini model and retrieve the generated responses. Each response will be stored in a new column of our evaluation dataset.

client = genai.Client()
def gemini_completion(prompt: str) -> str:
    response = client.models.generate_content(
        model="gemini-1.5-flash",
        contents=prompt
    )
    return response.text.strip()

eval_data["predictions"] = eval_data["inputs"].apply(gemini_completion)

Evaluating the Outputs with MLflow

We will evaluate the responses generated by Gemini using MLflow’s evaluation metrics. This process involves initiating an MLflow run and applying various metrics to assess the model’s performance.

mlflow.set_tracking_uri("mlruns")
mlflow.set_experiment("Gemini Simple Metrics Eval")

with mlflow.start_run():
    results = mlflow.evaluate(
        model_type="question-answering",
        data=eval_data,
        predictions="predictions",
        targets="ground_truth",
        extra_metrics=[
          mlflow.metrics.genai.answer_similarity(),
          mlflow.metrics.exact_match(),
          mlflow.metrics.latency(),
          mlflow.metrics.token_count()
      ]
    )
    print("Aggregated Metrics:")
    print(results.metrics)

    # Save detailed table
    results.tables["eval_results_table"].to_csv("gemini_eval_results.csv", index=False)

Reviewing Evaluation Results

To analyze the evaluation results, load the saved CSV file into a DataFrame. This will allow you to inspect individual prompts, the generated responses, and their corresponding metric scores.

results = pd.read_csv('gemini_eval_results.csv')
pd.set_option('display.max_colwidth', None)
results

Conclusion

Using MLflow for evaluating LLMs like Google’s Gemini model streamlines the assessment process, making it easier to track performance metrics. By following this guide, you can leverage MLflow’s capabilities to enhance your understanding of LLM outputs and improve your machine learning projects.

FAQs

What is MLflow? MLflow is an open-source platform designed to manage the machine learning lifecycle, including experimentation, reproducibility, and deployment.
How do I get started with MLflow? You can start by installing MLflow and setting up your environment for your specific use case, such as LLM evaluation.
What APIs do I need for evaluating LLMs? You will need both OpenAI and Google Gemini API keys to evaluate LLMs effectively with MLflow.
Can I use MLflow for models other than LLMs? Yes, MLflow is versatile and can be used to manage a variety of machine learning models across different domains.
What metrics can I evaluate with MLflow? MLflow supports various evaluation metrics, including answer similarity, exact match, latency, and token count.

Unleash Your Creative Potential with AI Agents

Competitors are already using AI Agents

Business Problems We Solve

Automation of internal processes.
Optimizing AI costs without huge budgets.
Training staff, developing custom courses for business needs
Integrating AI into client work, automating first lines of contact

Large and Medium Businesses

Startups

Offline Business

Get a plan to reduce routine and improve metrics

100% of clients report increased productivity and reduced operati

AI Agents

Localization Project Manager – Coordinating translation workflows, answering vendor or process-related questions.

Job Title: Localization Project Manager Overview The Localization Project Manager plays a vital role in coordinating translation workflows while addressing vendor and process-related queries. This position is crucial for ensuring that translation projects are executed efficiently…
AI Agents

Environmental Health & Safety Officer – Answering compliance-related questions, retrieving safety protocols or audit histories.

Professional Summary The AI-driven Environmental Health & Safety Officer is a reliable and effective digital team member that performs repetitive and time-consuming tasks with remarkable speed, accuracy, and stability. By automating these tasks, it frees up…
AI Agents

Legal Contract Reviewer – Auto-flagging clause inconsistencies or retrieving precedent cases for review.

Job Title: Legal Contract Reviewer – Auto-flagging Clause Inconsistencies or Retrieving Precedent Cases for Review The AI functions as a reliable and effective digital team member that excels in performing repetitive and time-consuming tasks. With remarkable…
AI Agents

Customer Retention Analyst – Creating customer summaries, identifying churn risk patterns, and suggesting retention steps.

Customer Retention Analyst Professional Summary A highly analytical and detail-oriented Customer Retention Analyst with a proven track record in creating comprehensive customer summaries, identifying churn risk patterns, and suggesting effective retention strategies. Adept at leveraging data-driven…

Itinai.com httpss.mj.runmrqch2uvtvo russian handsome charisma 9fdbb2d5 a55b 425d 8f3b 76d26f86710f 2

AI Business Accelerator

Start Your AI Business in Just a Week with itinai.com

You’re a great fit if you:

Have an audience (even 500+ followers in Instagram, email, etc.)
Have an idea, service, or product you want to scale
Can invest 2–3 hours a day
You’re motivated to earn with AI but don’t want to handle technical setup

AI news and solutions

Almost Everything You Want to Know About Partition Size of Dask Dataframes

Colleagues utilized Dask for partitioning data efficiently in training XGBoost models, allowing parallel processing across cores without overloading RAM. Experimentation indicated optimal partition size depends on dataset size, CPU, and RAM, with recommendations for handling data…

AI Tech News
Image Search in 5 Minutes

This post describes the implementation of text-to-image search and image-to-image search using a pre-trained model called uform, which is inspired by Contrastive Language Image Pre-Training (CLIP). The post provides code snippets for implementing these search functions…

AI Tech News
Evaluating Brain Alignment in Large Language Models for Linguistic Competence Insights

Understanding Language Models and Their Connection to Human Cognition Large Language Models (LLMs) show similarities to how the human brain processes language, but the exact features behind these connections are not fully understood. Insights into how…

AI Tech News
Convolutional Kolmogorov-Arnold Networks (Convolutional KANs): An Innovative Alternative to the Standard Convolutional Neural Networks (CNNs)

Practical Solutions in Computer Vision with Convolutional KANs Introduction to Convolutional KANs Computer vision, a key area of AI, focuses on enabling machines to interpret visual data. Convolutional KANs offer an innovative alternative to traditional CNNs,…

AI Tech News
XTuner: An Efficient, Flexible, and Full-Featured AI Toolkit for Fine-Tuning Large Models

Fine-Tuning Large Language Models Made Easy with XTuner Fine-tuning large language models (LLMs) efficiently and effectively is a common challenge. Imagine you have a massive LLM that needs adjustments or training for specific tasks, but the…

AI Tech News
Google AI Introduces CardBench: A Comprehensive Benchmark Featuring Over 20 Real-World Databases and Thousands of Queries to Revolutionize Learned Cardinality Estimation

Cardinality Estimation – Driving Database Performance Practical Solutions for Improved Query Performance Cardinality estimation (CE) plays a crucial role in optimizing query performance in relational databases. It predicts the number of results a database query will…

AI Tech News
AI-Assisted Debugging with Serverless MCP for AWS Workflows in Modern IDEs

Serverless MCP: Enhancing AI-Assisted Debugging for AWS Workflows Serverless computing has transformed the development and deployment of applications on cloud platforms like AWS. However, debugging and managing complex architectures—such as AWS Lambda, DynamoDB, API Gateway, and…

AI Tech News
miniG Released by CausalLM: A Groundbreaking Scalable AI-Language Model Trained on a Synthesis Dataset of 120 Million Entries

CausalLM Releases miniG: A Revolutionary AI Language Model Bringing Advanced AI Technology to a Wider Audience CausalLM has introduced miniG, a groundbreaking language model that balances performance and efficiency. This compact yet powerful model makes advanced…

AI Tech News
Congress concerned about RAND’s influence on AI safety body

President Biden issued an executive order tasking NIST with researching AI model safety. RAND Corporation’s influence on NIST is under scrutiny due to its advisory role in shaping the order. Concerns have been raised about NIST’s…

AI Tech News
DrBenchmark: The First-Ever Publicly Available French Biomedical Large Language Understanding Benchmark

AI Tech News
This AI Paper from KAIST AI Unveils ORPO: Elevating Preference Alignment in Language Models to New Heights

The KAIST AI team has introduced Odds Ratio Preference Optimization (ORPO), a novel method enhancing the alignment of language models with human preferences. This innovative approach eliminates the complexities of traditional alignment methods, promising improved model…

AI Tech News
This AI Paper Introduces Quilt-1M: Harnessing YouTube to Create the Largest Vision-Language Histopathology Dataset

The research team behind QUILT-1M has introduced a groundbreaking solution to the scarcity of comprehensive datasets in histopathology. By leveraging educational histopathology videos on YouTube, they have curated a dataset of 1 million paired image-text samples.…

AI Tech News
This AI Research Unveils LSS Transformer: A Revolutionary AI Approach for Efficient Long Sequence Training in Transformers

The Long Short-Sequence Transformer (LSS Transformer) is a new efficient distributed training method for transformer models with extended sequences. It segments sequences among GPUs, resulting in faster training and improved memory efficiency. The LSS Transformer outperforms…

AI Tech News
Defog AI Introspect: Open Source MIT-Licensed Tool for Streamlined Internal Data Research

Challenges in Internal Data Research Modern businesses encounter numerous obstacles in internal data research. Data is often dispersed across various sources such as spreadsheets, databases, PDFs, and online platforms, complicating the extraction of coherent insights. Organizations…

AI Tech News
TRANSMI: A Machine Learning Framework to Create Baseline Models Adapted for Transliterated Data from Existing Multilingual Pretrained Language Models mPLMs without Any Training

The Challenge in Multilingual NLP The increasing availability of digital text in diverse languages and scripts presents a significant challenge for natural language processing (NLP). Multilingual pre-trained language models (mPLMs) often struggle to handle transliterated data…

AI Tech News
Alibaba Researchers Introduce Ditto: A Revolutionary Self-Alignment Method to Enhance Role-Play in Large Language Models Beyond GPT-4 Standards

Alibaba researchers introduce DITTO, a self-alignment method enhancing large language models’ role-play capabilities, addressing the limitations of open-source models compared to proprietary ones. Leveraging extensive character knowledge, DITTO outperforms existing baselines, showcasing proficiency in multi-turn role-play…

AI Tech News
What’s next for generative video

OpenAI’s generative video model, Sora, showcases advancements in video generation. Competitors like Haiper are working on similar technologies. The potential for generative video is vast, impacting fields from marketing to filmmaking. However, challenges like control and…

AI Tech News
Microsoft AI Introduces LazyGraphRAG: A New AI Approach to Graph-Enabled RAG that Needs No Prior Summarization of Source Data

Enhancing AI Efficiency for Unstructured Data In AI, a major challenge is making systems better at processing unstructured data to gain useful insights. This involves improving Retrieval-Augmented Generation (RAG) tools, which blend traditional search methods with…

AI Tech News
Stanford Researchers Unveil FramePack: A Revolutionary AI Framework for Efficient Long-Sequence Video Generation

FramePack: A Solution for Video Generation Challenges FramePack: A Compression-Based AI Framework for Video Generation Overview of Video Generation Challenges Video generation, a critical area in computer vision, involves creating sequences of images that simulate motion…

AI Tech News
In-Page Links for Content Navigation

Summary: In-page links, also known as jump or anchor links, enable users to navigate to specific sections on the same page. Often used in tables of contents, they allow users to click and go directly to…

UX News