Itinai.com llm large language model structure neural network f4a47649 bac3 4c47 9657 40c8c084d268 2
Itinai.com llm large language model structure neural network f4a47649 bac3 4c47 9657 40c8c084d268 2

Benchmarking MFMs: Evaluating GPT-4o’s Visual Comprehension Skills

Understanding Multimodal Foundation Models (MFMs)

Multimodal foundation models (MFMs) like GPT-4o, Gemini, and Claude have gained attention for their ability to process both text and visual information. While their language capabilities are well-established, their visual comprehension skills are still being evaluated. This article explores the current state of MFMs in vision tasks, highlighting their strengths and weaknesses.

The Challenge of Visual Comprehension

Current benchmarks for MFMs primarily focus on text-based tasks, such as Visual Question Answering (VQA) and classification. These assessments often emphasize language processing over genuine visual understanding. Important aspects of visual tasks, including 3D perception and segmentation, are frequently overlooked. This raises questions about the true capabilities of MFMs in visual comprehension.

Performance in Integrated Tasks

MFMs have shown promising results in tasks that require both visual and language understanding, such as image captioning and VQA. However, their effectiveness in more complex visual tasks remains uncertain. Many existing benchmarks rely on text outputs, making it difficult to compare MFMs with specialized vision models. Some researchers have attempted to adapt vision datasets for MFMs by converting annotations into text, but this limits the evaluation scope.

Case Study: Evaluation at EPFL

Researchers at EPFL conducted a comprehensive evaluation of several leading MFMs, including GPT-4o, Gemini 2.0 Flash, and Claude 3.5. They focused on fundamental computer vision tasks such as segmentation, object detection, and depth prediction, using datasets like COCO and ImageNet. The study revealed that while MFMs are competent generalists, they do not outperform specialized vision models, particularly in geometric tasks.

Prompt-Chaining Framework

To assess MFMs on vision tasks, the researchers introduced a prompt-chaining strategy. This approach simplifies complex tasks into manageable subtasks. For example, instead of directly predicting bounding boxes, the model first identifies objects and then locates them through recursive image cropping. This modular design leverages the strengths of MFMs in classification and similarity, while ensuring fair comparisons through calibration controls.

Results and Findings

The evaluation showed that GPT-4o achieved notable scores, including 77.2% on ImageNet and 60.62 Average Precision at 50% Intersection over Union (AP50) for object detection. However, it was outperformed by specialized models like ViT-G and Co-DETR. In semantic segmentation, GPT-4o scored 44.89 mean Intersection over Union (mIoU), while OneFormer led with 65.52. Although MFMs managed distribution shifts reasonably well, they struggled with precise visual reasoning.

Conclusion and Future Directions

The study establishes a benchmarking framework for evaluating the visual capabilities of MFMs, revealing that they perform better on semantic tasks than geometric ones. While GPT-4o leads in overall performance among MFMs, all models significantly lag behind task-specific vision models. Despite their limitations, such as high inference costs and prompt sensitivity, MFMs show promise, especially with advancements in reasoning models for 3D tasks.

FAQs

  • What are Multimodal Foundation Models (MFMs)? MFMs are AI models designed to process and understand both text and visual information.
  • How do MFMs perform in visual tasks compared to specialized models? MFMs generally do not match the performance of specialized vision models, particularly in geometric tasks.
  • What is the prompt-chaining strategy? It is a method that breaks down complex visual tasks into simpler, manageable subtasks for better evaluation.
  • What datasets were used in the EPFL study? The study utilized datasets like COCO and ImageNet for evaluating the MFMs.
  • What are the limitations of MFMs in visual comprehension? Limitations include high inference costs, prompt sensitivity, and challenges in precise visual reasoning.

In summary, while MFMs like GPT-4o show significant advancements in integrating visual and language understanding, they still face challenges in specialized visual tasks. The ongoing research and development in this field will likely lead to improved models that can better comprehend and interpret visual information.

Itinai.com office ai background high tech quantum computing 0002ba7c e3d6 4fd7 abd6 cfe4e5f08aeb 0

Vladimir Dyachkov, Ph.D
Editor-in-Chief itinai.com

I believe that AI is only as powerful as the human insight guiding it.

Unleash Your Creative Potential with AI Agents

Competitors are already using AI Agents

Business Problems We Solve

  • Automation of internal processes.
  • Optimizing AI costs without huge budgets.
  • Training staff, developing custom courses for business needs
  • Integrating AI into client work, automating first lines of contact

Large and Medium Businesses

Startups

Offline Business

100% of clients report increased productivity and reduced operati

AI news and solutions