InternVideo2.5: Hierarchical Token Compression and Task Preference Optimization for Video MLLMs

Understanding Multimodal Large Language Models (MLLMs)

Multimodal large language models (MLLMs) are a promising step towards achieving artificial general intelligence. They combine different types of sensory information into one system. However, they struggle with basic vision tasks, performing much worse than humans. Key challenges include:

Object Recognition: Identifying objects accurately.
Localization: Determining where objects are located.
Motion Recall: Remembering movements over time.

Despite ongoing research, reaching human-level visual understanding is still a challenge. Developing systems that can interpret and reason across various sensory inputs with human-like accuracy remains complex.

Current Research Approaches

Researchers are exploring different methods to improve visual understanding in MLLMs. These include:

Combining Technologies: Using vision encoders, language models, and connectors to perform complex tasks like image descriptions and visual queries.
Video Processing: Enhancing MLLMs to handle sequential visuals and understand changes over time.

However, challenges persist in detailed visual tasks, leading to two main strategies:

Pixel-to-Sequence (P2S): A method for processing visual data.
Pixel-to-Embedding (P2E): An approach for embedding visual information.

Introducing InternVideo2.5

Researchers from Shanghai AI Laboratory, Nanjing University, and Shenzhen Institutes of Advanced Technology have developed InternVideo2.5. This new model enhances video MLLM capabilities by:

Long and Rich Context (LRC) Modeling: Improving the understanding of detailed video content and complex time sequences.
Integrating Annotations: Using direct preference optimization to incorporate detailed visual task annotations.
Adaptive Hierarchical Token Compression: Creating efficient representations of spatiotemporal data.

Key Features of InternVideo2.5

The architecture of InternVideo2.5 includes:

Dynamic Video Sampling: Processing between 64 to 512 frames, compressing each 8-frame clip into 128 tokens.
Advanced Components: Utilizing a Temporal Head based on CG-DETR and a Mask Head with SAM2’s pre-trained weights.
Optimized Processing: Implementing two-layer MLPs for better positioning and encoding of spatial inputs.

Performance Improvements

InternVideo2.5 shows significant advancements in video understanding tasks:

Enhanced Accuracy: Over 3 points improvement on MVBench and Perception Test for short video predictions.
Superior Recall: Demonstrated better memory capabilities in complex tasks.

Conclusion

InternVideo2.5 represents a major step forward in video MLLM technology, focusing on:

Improved Visual Capabilities: Enhancements in object tracking and understanding.
Future Research Opportunities: Addressing high computational costs and extending context processing techniques.

For more details, check out the Paper and GitHub. Follow us on Twitter, join our Telegram Channel, and connect with our LinkedIn Group. Also, join our 70k+ ML SubReddit.

Transform Your Business with AI

To stay competitive, consider using InternVideo2.5 in your operations:

Identify Automation Opportunities: Find key areas in customer interactions that can benefit from AI.
Define KPIs: Ensure your AI projects have measurable impacts on your business.
Select an AI Solution: Choose tools that fit your needs and allow customization.
Implement Gradually: Start with a pilot project, gather data, and expand AI use wisely.

For AI KPI management advice, connect with us at hello@itinai.com. For ongoing insights, follow us on Telegram or Twitter.

Explore how AI can enhance your sales processes and customer engagement at itinai.com.

List of Useful Links:

Unleash Your Creative Potential with AI Agents

Competitors are already using AI Agents

Business Problems We Solve

Automation of internal processes.
Optimizing AI costs without huge budgets.
Training staff, developing custom courses for business needs
Integrating AI into client work, automating first lines of contact

Large and Medium Businesses

Startups

Offline Business

Get a plan to reduce routine and improve metrics

100% of clients report increased productivity and reduced operati

AI Agents

Localization Project Manager – Coordinating translation workflows, answering vendor or process-related questions.

Job Title: Localization Project Manager Overview The Localization Project Manager plays a vital role in coordinating translation workflows while addressing vendor and process-related queries. This position is crucial for ensuring that translation projects are executed efficiently…
AI Agents

Environmental Health & Safety Officer – Answering compliance-related questions, retrieving safety protocols or audit histories.

Professional Summary The AI-driven Environmental Health & Safety Officer is a reliable and effective digital team member that performs repetitive and time-consuming tasks with remarkable speed, accuracy, and stability. By automating these tasks, it frees up…
AI Agents

Legal Contract Reviewer – Auto-flagging clause inconsistencies or retrieving precedent cases for review.

Job Title: Legal Contract Reviewer – Auto-flagging Clause Inconsistencies or Retrieving Precedent Cases for Review The AI functions as a reliable and effective digital team member that excels in performing repetitive and time-consuming tasks. With remarkable…
AI Agents

Customer Retention Analyst – Creating customer summaries, identifying churn risk patterns, and suggesting retention steps.

Customer Retention Analyst Professional Summary A highly analytical and detail-oriented Customer Retention Analyst with a proven track record in creating comprehensive customer summaries, identifying churn risk patterns, and suggesting effective retention strategies. Adept at leveraging data-driven…

Itinai.com httpss.mj.runmrqch2uvtvo russian handsome charisma 9fdbb2d5 a55b 425d 8f3b 76d26f86710f 2

AI Business Accelerator

Start Your AI Business in Just a Week with itinai.com

You’re a great fit if you:

Have an audience (even 500+ followers in Instagram, email, etc.)
Have an idea, service, or product you want to scale
Can invest 2–3 hours a day
You’re motivated to earn with AI but don’t want to handle technical setup

AI news and solutions

OctoThinker: Advancements in Reinforcement Learning for Enhanced LLM Performance

Introduction: Reinforcement Learning Progress through Chain-of-Thought Prompting Large Language Models (LLMs) have made remarkable strides in tackling complex reasoning tasks, largely due to the innovative approach of Chain-of-Thought (CoT) prompting combined with large-scale reinforcement learning (RL).…

AI Tech News
FeatUp: A Machine Learning Algorithm that Upgrades the Resolution of Deep Neural Networks for Improved Performance in Computer Vision Tasks

AI Tech News
This AI Paper from UC Berkeley Shows How Interfacing GPT with Prolog (Reliable Symbolic System) Drastically Improves Its Math Problem-Solving Abilities

The Impact of Combining Large Language Models (LLMs) with External Tools Practical Solutions and Value Recent developments in Natural Language Processing (NLP) have seen large language models (LLMs) achieving human-level performance in various fields. However, their…

AI Tech News
AI21 Labs Breaks New Ground with ‘Jamba’: The Pioneering Hybrid SSM-Transformer Large Language Model

AI Tech News
RouterBench: A Novel Machine Learning Framework Designed to Systematically Assess the Efficacy of LLM Routing Systems

AI Tech News
Build Efficient Data Analysis Workflows with Lilac: A Comprehensive Coding Guide for Data Professionals

Understanding the Target Audience The target audience for “A Coding Guide to Build a Functional Data Analysis Workflow Using Lilac” consists mainly of data professionals, data analysts, and business intelligence developers. These individuals work across various…

AI Tech News
Researchers at the University of Tokyo Propose FlexFlood: A Data Updating Algorithm that Ensures Fast Search Even if Data Distribution Changes

Understanding Data Management with FlexFlood Filtering, scanning, and updating data are essential tasks in databases. Managing multidimensional data is crucial in real-world scenarios, where structures like the **Kd-tree** are commonly used. Recent studies have explored ways…

AI Tech News
Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation

The paper, presented at the NeurIPS 2023 ICBINB workshop, examines the use of pre-trained language models in text-to-image auto-regressive generation, finding them of limited utility and providing a twofold analysis related to cross-modality tokens.

AI Tech News
AI Monetization for Independent Real Estate Agents

AI-Powered Real Estate Lead Generation: A Business Plan Executive Summary: This plan details a low-barrier-to-entry business leveraging AI to generate and qualify leads for independent real estate agents in the U.S. utilizing the AI Business Accelerator…

AI Business
NVIDIA Utilizes Generative AI to Design Semiconductors: ChipNeMo

NVIDIA has released a groundbreaking research paper demonstrating how generative artificial intelligence (AI) can revolutionize semiconductor design. The study reveals that large language models (LLMs) can benefit specialized fields like chip design. NVIDIA’s custom LLM called…

AI Tech News
A flexible solution to help artists improve animation

MIT researchers have introduced a new technique that gives artists greater control over animations in movies and video games. Using mathematical functions called barycentric coordinates, the method allows artists to define how 2D and 3D shapes…

AI Tech News
Efficient Prediction of At-Risk University Students Using Reduced Training Vector-Based SVM (RTV-SVM)

Predicting At-Risk University Students Using Reduced Training Vector-Based SVM (RTV-SVM) Practical Solutions and Value: Efficiently predicts at-risk and marginal university students, reducing faculty workload and financial strain on institutions. Reduces training vectors by 59.7% while maintaining…

AI Tech News
ByteDance Launches QuaDMix: A Unified AI Framework for Optimizing Data Quality and Diversity in LLM Pretraining

ByteDance’s QuaDMix: Innovating Data Quality and Diversity in AI ByteDance Introduces QuaDMix: A Unified AI Framework for Data Quality and Diversity in LLM Pretraining The Challenge in Large Language Model Training The efficiency and effectiveness of…

AI Tech News
Liquid AI Launches LFM2-VL: Fast Vision-Language Models for Developers and Enterprises

Introduction to LFM2-VL Liquid AI has made a significant leap in the field of artificial intelligence with the release of LFM2-VL, a new family of vision-language foundation models. These models are tailored for low-latency and device-aware…

AI Tech News
Implementing Text-to-Speech with BARK in Google Colab using Hugging Face

“`html Text-to-Speech Technology Overview Text-to-Speech (TTS) technology has significantly advanced, evolving from robotic voices to highly natural speech synthesis. BARK, developed by Suno, is an open-source TTS model that generates human-like speech in multiple languages, including…

AI Tech News
This AI Research Shares a Comprehensive Overview of Large Language Models (LLMs) on Graphs

Large Language Models (LLMs) like GPT, BERT, PaLM, and LLaMA have advanced Natural Language Processing and Generation. They excel at various tasks, but there’s growing interest in their application to graph-based tasks. Research explores integrating LLMs…

AI Tech News
SW/HW Co-optimization Strategy for Large Language Models (LLMs)

The article discusses the challenges and solutions for optimizing the performance and cost of running Large Language Models (LLMs). It highlights the high expenses of using OpenAI APIs and the trend of companies hosting their own…

AI Tech News
Replete-AI Introduces Replete-Coder-Qwen2-1.5b: A Versatile AI Model for Advanced Coding and General-Purpose Use with Unmatched Efficiency

Replete-Coder-Qwen2-1.5b: A Versatile AI Model for Advanced Coding and General-Purpose Use Overview Replete-Coder-Qwen2-1.5b is an advanced AI model designed for versatile applications. It is trained on a diverse dataset, making it capable of handling coding and…

AI Tech News
DRLQ: A Novel Deep Reinforcement Learning (DRL)-based Technique for Task Placement in Quantum Cloud Computing Environments

The Value of DRLQ in Quantum Cloud Computing Environments Challenges in Quantum Computing The traditional heuristic approach struggles to manage tasks in the evolving quantum computing landscape, leading to inefficiencies in task scheduling and resource management.…

AI Tech News
Meet GeneGPT: A Novel Artificial Intelligence Method for Teaching LLMs to Use the Web APIs of the National Center for Biotechnology Information (NCBI) for Answering Genomics Questions

Large language models (LLMs) excel in processing vast datasets but struggle with accuracy. GeneGPT enhances LLMs’ access to biomedical data by integrating with NCBI’s Web APIs, improving data retrieval accuracy and versatility. It outperforms current models,…

AI Tech News