Salesforce AI Research Propose Programmatic VLM Evaluation (PROVE): A New Benchmarking Paradigm for Evaluating VLM Responses to Open-Ended Queries

Salesforce AI Research Propose Programmatic VLM Evaluation (PROVE): A New Benchmarking Paradigm for Evaluating VLM Responses to Open-Ended Queries

Understanding Vision-Language Models (VLMs)

Vision-Language Models (VLMs) are tools that help generate answers to questions about images. However, they often produce answers that sound plausible but are incorrect, a problem known as hallucination. This can reduce trust in these systems, especially in critical situations.

The Challenge of Evaluating VLMs

Evaluating how helpful and truthful VLM responses are is difficult. It requires understanding the visual content and verifying each claim made. Traditional methods have limitations, either focusing on simple questions or lacking the necessary context for more complex queries.

Introducing PROVE: A New Evaluation Method

Researchers from Salesforce AI Research have developed a new method called Programmatic VLM Evaluation (PROVE). This method assesses VLM responses to open-ended visual questions using a detailed scene graph representation derived from comprehensive image captions.

How PROVE Works

PROVE uses a large language model (LLM) to create diverse question-answer pairs and executable programs to verify these pairs. This results in a dataset of 10.5k challenging and visually grounded QA pairs. The evaluation measures both the helpfulness and truthfulness of VLM responses using a unified framework based on scene graph comparisons.

Benefits of the PROVE Benchmark

The PROVE benchmark enhances the evaluation of VLMs by using detailed scene graphs and verification programs. This ensures that only verifiable QA pairs are included, leading to a high-quality dataset. The evaluation process involves comparing scene graph representations from model responses and correct answers to assess helpfulness and truthfulness.

Key Findings

Current VLMs often struggle to balance helpfulness and truthfulness. While models like GPT-4o and Phi-3.5-Vision show high helpfulness, they do not always provide truthful answers. Interestingly, smaller models like LLaVA-1.5 have achieved better truthfulness scores, suggesting that size does not always equate to accuracy.

Conclusion

PROVE marks a significant step forward in evaluating VLM responses. By using detailed representations and programmatic verification, it offers a more reliable assessment method. The findings highlight the importance of developing VLMs that can provide both informative and accurate responses, especially as their applications grow.

Get Involved

Check out the Paper and Dataset Card for more details. Follow us on Twitter, join our Telegram Channel, and LinkedIn Group for updates. If you appreciate our work, subscribe to our newsletter and join our 55k+ ML SubReddit community.

Upcoming Webinar

Upcoming Live Webinar – Oct 29, 2024: Discover the Best Platform for Serving Fine-Tuned Models: Predibase Inference Engine.

Transform Your Business with AI

Stay competitive by leveraging AI solutions. Here’s how:

  • Identify Automation Opportunities: Find key customer interactions that can benefit from AI.
  • Define KPIs: Ensure measurable impacts from your AI initiatives.
  • Select an AI Solution: Choose tools that fit your needs and allow customization.
  • Implement Gradually: Start small, gather data, and expand wisely.

For AI KPI management advice, connect with us at hello@itinai.com. For ongoing insights, follow us on Telegram t.me/itinainews or Twitter @itinaicom.

Explore AI Solutions

Discover how AI can enhance your sales processes and customer engagement at itinai.com.

List of Useful Links:

AI Products for Business or Try Custom Development

AI Sales Bot

Welcome AI Sales Bot, your 24/7 teammate! Engaging customers in natural language across all channels and learning from your materials, it’s a step towards efficient, enriched customer interactions and sales

AI Document Assistant

Unlock insights and drive decisions with our AI Insights Suite. Indexing your documents and data, it provides smart, AI-driven decision support, enhancing your productivity and decision-making.

AI Customer Support

Upgrade your support with our AI Assistant, reducing response times and personalizing interactions by analyzing documents and past engagements. Boost your team and customer satisfaction

AI Scrum Bot

Enhance agile management with our AI Scrum Bot, it helps to organize retrospectives. It answers queries and boosts collaboration and efficiency in your scrum processes.