Salesforce Unveils VLM2VEC and MMEB: A Breakthrough in Universal Multimodal Embeddings

Understanding VLM2VEC and MMEB: A New Era in Multimodal AI

Introduction to Multimodal Embeddings

Multimodal embeddings integrate visual and textual data, allowing systems to interpret and relate images and language in a meaningful way. This technology is crucial for various applications, including:

Visual Question Answering
Information Retrieval
Classification
Visual Grounding

These capabilities are essential for AI models that analyze real-world content, such as digital assistants and visual search engines.

The Challenge of Generalization

A significant challenge in the field has been the difficulty of existing models to generalize across different tasks and modalities. Most models are designed for specific tasks and struggle with unfamiliar datasets. Additionally, the lack of a unified benchmark leads to inconsistent evaluations, limiting the models’ effectiveness in real-world applications.

Existing Solutions and Their Limitations

Current tools like CLIP, BLIP, and SigLIP generate visual-textual embeddings but face limitations in cross-modal reasoning. These models typically use separate encoders for images and text, merging their outputs through basic methods. As a result, they often underperform in zero-shot scenarios due to shallow integration and insufficient task-specific training.

Introducing VLM2VEC and MMEB

A collaboration between Salesforce Research and the University of Waterloo has led to the development of VLM2VEC, paired with a comprehensive benchmark known as MMEB. This benchmark includes:

36 datasets
Four major tasks: classification, visual question answering, retrieval, and visual grounding
20 datasets for training and 16 for evaluation, including out-of-distribution tasks

The VLM2VEC framework utilizes contrastive training to convert any vision-language model into an effective embedding model, enabling it to process diverse combinations of text and images.

How VLM2VEC Works

The research team employed backbone models such as Phi-3.5-V and LLaVA-1.6. The process involves:

Creating task-specific queries and targets.
Using a vision-language model to generate embeddings.
Applying contrastive training with the InfoNCE loss function to enhance alignment of embeddings.
Utilizing GradCache for efficient memory management during training.

This structured approach allows VLM2VEC to adapt its encoding based on the task, significantly improving generalization.

Performance Outcomes

The results indicate a substantial improvement in performance. The best version of VLM2VEC achieved:

Precision@1 score of 62.9% across all MMEB datasets.
Strong zero-shot performance with a score of 57.1% on out-of-distribution datasets.
Improvement of 18.2 points over the best baseline model without fine-tuning.

These findings highlight the effectiveness of VLM2VEC in comparison to traditional models, demonstrating its potential for scalable and adaptable multimodal AI applications.

Conclusion

The introduction of VLM2VEC and MMEB addresses the limitations of existing multimodal embedding tools by providing a robust framework for generalization across tasks. This advancement represents a significant leap forward in the development of multimodal AI, making it more versatile and efficient for real-world applications.

Unleash Your Creative Potential with AI Agents

Competitors are already using AI Agents

Business Problems We Solve

Automation of internal processes.
Optimizing AI costs without huge budgets.
Training staff, developing custom courses for business needs
Integrating AI into client work, automating first lines of contact

Large and Medium Businesses

Startups

Offline Business

Get a plan to reduce routine and improve metrics

100% of clients report increased productivity and reduced operati

AI Agents

Localization Project Manager – Coordinating translation workflows, answering vendor or process-related questions.

Job Title: Localization Project Manager Overview The Localization Project Manager plays a vital role in coordinating translation workflows while addressing vendor and process-related queries. This position is crucial for ensuring that translation projects are executed efficiently…
AI Agents

Environmental Health & Safety Officer – Answering compliance-related questions, retrieving safety protocols or audit histories.

Professional Summary The AI-driven Environmental Health & Safety Officer is a reliable and effective digital team member that performs repetitive and time-consuming tasks with remarkable speed, accuracy, and stability. By automating these tasks, it frees up…
AI Agents

Legal Contract Reviewer – Auto-flagging clause inconsistencies or retrieving precedent cases for review.

Job Title: Legal Contract Reviewer – Auto-flagging Clause Inconsistencies or Retrieving Precedent Cases for Review The AI functions as a reliable and effective digital team member that excels in performing repetitive and time-consuming tasks. With remarkable…
AI Agents

Customer Retention Analyst – Creating customer summaries, identifying churn risk patterns, and suggesting retention steps.

Customer Retention Analyst Professional Summary A highly analytical and detail-oriented Customer Retention Analyst with a proven track record in creating comprehensive customer summaries, identifying churn risk patterns, and suggesting effective retention strategies. Adept at leveraging data-driven…

Itinai.com httpss.mj.runmrqch2uvtvo russian handsome charisma 9fdbb2d5 a55b 425d 8f3b 76d26f86710f 2

AI Business Accelerator

Start Your AI Business in Just a Week with itinai.com

You’re a great fit if you:

Have an audience (even 500+ followers in Instagram, email, etc.)
Have an idea, service, or product you want to scale
Can invest 2–3 hours a day
You’re motivated to earn with AI but don’t want to handle technical setup

AI news and solutions

This AI Paper from Princeton and Stanford Introduces CRISPR-GPT For Innovative Gene-Editing Enhancements

Practical Solutions in Gene Editing Enhancing Precision and Efficiency Gene editing is a cornerstone of modern biotechnology, with implications across various fields. Recent innovations have enhanced precision and expanded applicability, addressing challenges in designing and conducting…

AI Tech News
DAI#20 – AI lawyers, chefs, and terrorist chatbots

The weekly AI roundup summarized: AI news roundup highlights: – AI’s impact on the legal industry, including potential disputes and the use of AI in the courtroom. – UK’s considerations for regulating AI and the EU’s…

AI Tech News
This AI Paper by Narrative BI Introduces a Hybrid Approach to Business Data Analysis with LLMs and Rule-Based Systems

Practical Solutions for Business Data Analysis Challenges and Hybrid Approach Business data analysis is crucial for informed decision-making and maintaining a competitive edge. Traditional rule-based systems and standalone AI models both have limitations in dealing with…

AI Tech News
Transformer-Based AI Models for Ovarian Lesion Diagnosis: Enhancing Accuracy and Reducing Expert Referral Dependence Across International Centers

Understanding Ovarian Lesions and the Need for Effective Management Ovarian lesions are often found accidentally, making their management essential to prevent delays in diagnosis or unnecessary treatments. The main tool for diagnosing these lesions is transvaginal…

AI Tech News
Convolutional Kolmogorov-Arnold Networks (Convolutional KANs): An Innovative Alternative to the Standard Convolutional Neural Networks (CNNs)

Practical Solutions in Computer Vision with Convolutional KANs Introduction to Convolutional KANs Computer vision, a key area of AI, focuses on enabling machines to interpret visual data. Convolutional KANs offer an innovative alternative to traditional CNNs,…

AI Tech News
This AI Paper from China Introduces TinyChart: An Efficient Multimodal Large Language Models MLLMs for Chart Understanding with Only 3B Parameters

Introducing TinyChart: Revolutionizing Chart Understanding with Efficient AI Practical Solutions and Value Charts are crucial for data visualization in various fields. Automated chart comprehension is essential as data volume increases. Multimodal Large Language Models (MLLMs) have…

AI Tech News
Optimizing Large Model Inference with Ladder Residual: Enhancing Tensor Parallelism through Communication-Computing Overlap

Understanding LLM Inference Challenges Large Language Model (LLM) inference requires a lot of memory and computing power. To solve this, we use model parallelism strategies that share workloads across multiple GPUs. This helps reduce memory issues…

AI Tech News
Deep Learning Architectures From CNN, RNN, GAN, and Transformers To Encoder-Decoder Architectures

AI Tech News
Uncovering How Vision Transformers Understand Object Relations: A Two-Stage Approach to Visual Reasoning

Understanding the Challenges of Vision Transformers Vision Transformers (ViTs) have shown great success in tasks like image classification and generation. However, they struggle with complex tasks that involve understanding relationships between objects. A major issue is…

AI Tech News
GraphReader: A Graph-based AI Agent System Designed to Handle Long Texts by Structuring them into a Graph and Employing an Agent to Explore this Graph Autonomously

GraphReader: A Graph-based AI Agent System for Long Text Processing Practical Solutions and Value Large language models (LLMs) often struggle with processing long contexts due to limitations in context window size and memory usage. GraphReader presents…

AI Tech News
Efficient Fine-Tuning of Qwen3-14B with Unsloth AI on Google Colab

Efficient Fine-Tuning of Qwen3-14B Using Unsloth AI A Practical Guide to Fine-Tuning Qwen3-14B with Unsloth AI Introduction Fine-tuning large language models (LLMs) like Qwen3-14B can be resource-intensive, often requiring substantial time and memory. This can slow…

AI News
BiMediX2: A Groundbreaking Bilingual Bio-Medical Large Multimodal Model integrating Text and Image Analysis for Advanced Medical Diagnostics

Advancements in Healthcare AI Recent developments in healthcare AI, such as medical LLMs and LMMs, show promise in enhancing access to medical advice. However, many of these models primarily focus on English, which limits their effectiveness…

AI Tech News
Alibaba Researchers Propose Reward Learning on Policy (RLP): An Unsupervised AI Framework that Refines a Reward Model Using Policy Samples to Keep it on-Distribution

AI Tech News
UK report lists potential AI risks and doomsday scenarios

The UK government has released a report on the capabilities and risks of frontier AI models, which will be discussed at the upcoming AI Safety Summit. The report acknowledges the potential benefits of AI but also…

AI Tech News
LangChain Introduces LangGraph Studio: The First Agent IDE for Visualizing, Interacting with, and Debugging Complex Agentic Applications

LangChain Introduces LangGraph Studio: The First Agent IDE for Visualizing, Interacting with, and Debugging Complex Agentic Applications LangGraph Studio is the first integrated development environment (IDE) specifically designed for agent development, offering practical solutions for visualizing,…

AI Tech News
Google DeepMind Researchers Unlock the Potential of Decoding-Based Regression for Tabular and Density Estimation Tasks

Understanding Regression Tasks and Their Challenges Regression tasks aim to predict continuous numeric values but often rely on traditional approaches that have some limitations: Limitations of Traditional Approaches Distribution Assumptions: Many methods, like Gaussian models, assume…

AI Tech News
This AI Paper Proposes an Interactive Agent Foundation Model that Uses a Novel Multi-Task Agent Training Paradigm for Training AI Agents Across a Wide Range of Domains, Datasets, and Tasks

AI development is evolving from static, task-centric models to dynamic, adaptable agent-based systems suitable for various applications. Recent research proposes the Interactive Agent Foundation Model, a multi-modal system with unified pre-training to process text, visual data,…

AI Tech News
Microsoft and Tsinghua University Researchers Introduce Distilled Decoding: A New Method for Accelerating Image Generation in Autoregressive Models without Quality Loss

Transforming Image Generation with Distilled Decoding Key Innovations in Autoregressive (AR) Models Autoregressive models are revolutionizing image generation by creating high-quality visuals in a step-by-step process. They generate each part of an image based on previously…

AI Tech News
RABBITS: A Specialized Dataset and Leaderboard to Aid in Evaluating LLM Performance in Healthcare

AI Solutions for Biomedical NLP Enhancing Healthcare Delivery and Clinical Decision-Making Biomedical natural language processing (NLP) utilizes machine learning models to interpret medical texts, improving diagnostics, treatment recommendations, and medical information extraction. Challenges in Biomedical NLP…

AI Tech News
Accelerate deep learning model training up to 35% with Amazon SageMaker smart sifting

SageMaker’s new ‘smart sifting’ feature filters less informative data during training, potentially reducing deep learning model training costs by up to 35%. This online data sifting process requires no changes to existing training pipelines and aims…

AI Tech News