Itinai.com white little cute ai bot light office background e60eb759 e204 4e54 9e8a 464d03b4e2cf 3
Itinai.com white little cute ai bot light office background e60eb759 e204 4e54 9e8a 464d03b4e2cf 3

Salesforce Unveils VLM2VEC and MMEB: A Breakthrough in Universal Multimodal Embeddings

Salesforce Unveils VLM2VEC and MMEB: A Breakthrough in Universal Multimodal Embeddings



Understanding VLM2VEC and MMEB: A New Era in Multimodal AI

Understanding VLM2VEC and MMEB: A New Era in Multimodal AI

Introduction to Multimodal Embeddings

Multimodal embeddings integrate visual and textual data, allowing systems to interpret and relate images and language in a meaningful way. This technology is crucial for various applications, including:

  • Visual Question Answering
  • Information Retrieval
  • Classification
  • Visual Grounding

These capabilities are essential for AI models that analyze real-world content, such as digital assistants and visual search engines.

The Challenge of Generalization

A significant challenge in the field has been the difficulty of existing models to generalize across different tasks and modalities. Most models are designed for specific tasks and struggle with unfamiliar datasets. Additionally, the lack of a unified benchmark leads to inconsistent evaluations, limiting the models’ effectiveness in real-world applications.

Existing Solutions and Their Limitations

Current tools like CLIP, BLIP, and SigLIP generate visual-textual embeddings but face limitations in cross-modal reasoning. These models typically use separate encoders for images and text, merging their outputs through basic methods. As a result, they often underperform in zero-shot scenarios due to shallow integration and insufficient task-specific training.

Introducing VLM2VEC and MMEB

A collaboration between Salesforce Research and the University of Waterloo has led to the development of VLM2VEC, paired with a comprehensive benchmark known as MMEB. This benchmark includes:

  • 36 datasets
  • Four major tasks: classification, visual question answering, retrieval, and visual grounding
  • 20 datasets for training and 16 for evaluation, including out-of-distribution tasks

The VLM2VEC framework utilizes contrastive training to convert any vision-language model into an effective embedding model, enabling it to process diverse combinations of text and images.

How VLM2VEC Works

The research team employed backbone models such as Phi-3.5-V and LLaVA-1.6. The process involves:

  1. Creating task-specific queries and targets.
  2. Using a vision-language model to generate embeddings.
  3. Applying contrastive training with the InfoNCE loss function to enhance alignment of embeddings.
  4. Utilizing GradCache for efficient memory management during training.

This structured approach allows VLM2VEC to adapt its encoding based on the task, significantly improving generalization.

Performance Outcomes

The results indicate a substantial improvement in performance. The best version of VLM2VEC achieved:

  • Precision@1 score of 62.9% across all MMEB datasets.
  • Strong zero-shot performance with a score of 57.1% on out-of-distribution datasets.
  • Improvement of 18.2 points over the best baseline model without fine-tuning.

These findings highlight the effectiveness of VLM2VEC in comparison to traditional models, demonstrating its potential for scalable and adaptable multimodal AI applications.

Conclusion

The introduction of VLM2VEC and MMEB addresses the limitations of existing multimodal embedding tools by providing a robust framework for generalization across tasks. This advancement represents a significant leap forward in the development of multimodal AI, making it more versatile and efficient for real-world applications.


Itinai.com office ai background high tech quantum computing 0002ba7c e3d6 4fd7 abd6 cfe4e5f08aeb 0

Vladimir Dyachkov, Ph.D
Editor-in-Chief itinai.com

I believe that AI is only as powerful as the human insight guiding it.

Unleash Your Creative Potential with AI Agents

Competitors are already using AI Agents

Business Problems We Solve

  • Automation of internal processes.
  • Optimizing AI costs without huge budgets.
  • Training staff, developing custom courses for business needs
  • Integrating AI into client work, automating first lines of contact

Large and Medium Businesses

Startups

Offline Business

100% of clients report increased productivity and reduced operati

AI news and solutions