InternVideo2.5: Hierarchical Token Compression and Task Preference Optimization for Video MLLMs

Understanding Multimodal Large Language Models (MLLMs)

Multimodal large language models (MLLMs) are a promising step towards achieving artificial general intelligence. They combine different types of sensory information into one system. However, they struggle with basic vision tasks, performing much worse than humans. Key challenges include:

Object Recognition: Identifying objects accurately.
Localization: Determining where objects are located.
Motion Recall: Remembering movements over time.

Despite ongoing research, reaching human-level visual understanding is still a challenge. Developing systems that can interpret and reason across various sensory inputs with human-like accuracy remains complex.

Current Research Approaches

Researchers are exploring different methods to improve visual understanding in MLLMs. These include:

Combining Technologies: Using vision encoders, language models, and connectors to perform complex tasks like image descriptions and visual queries.
Video Processing: Enhancing MLLMs to handle sequential visuals and understand changes over time.

However, challenges persist in detailed visual tasks, leading to two main strategies:

Pixel-to-Sequence (P2S): A method for processing visual data.
Pixel-to-Embedding (P2E): An approach for embedding visual information.

Introducing InternVideo2.5

Researchers from Shanghai AI Laboratory, Nanjing University, and Shenzhen Institutes of Advanced Technology have developed InternVideo2.5. This new model enhances video MLLM capabilities by:

Long and Rich Context (LRC) Modeling: Improving the understanding of detailed video content and complex time sequences.
Integrating Annotations: Using direct preference optimization to incorporate detailed visual task annotations.
Adaptive Hierarchical Token Compression: Creating efficient representations of spatiotemporal data.

Key Features of InternVideo2.5

The architecture of InternVideo2.5 includes:

Dynamic Video Sampling: Processing between 64 to 512 frames, compressing each 8-frame clip into 128 tokens.
Advanced Components: Utilizing a Temporal Head based on CG-DETR and a Mask Head with SAM2’s pre-trained weights.
Optimized Processing: Implementing two-layer MLPs for better positioning and encoding of spatial inputs.

Performance Improvements

InternVideo2.5 shows significant advancements in video understanding tasks:

Enhanced Accuracy: Over 3 points improvement on MVBench and Perception Test for short video predictions.
Superior Recall: Demonstrated better memory capabilities in complex tasks.

Conclusion

InternVideo2.5 represents a major step forward in video MLLM technology, focusing on:

Improved Visual Capabilities: Enhancements in object tracking and understanding.
Future Research Opportunities: Addressing high computational costs and extending context processing techniques.

For more details, check out the Paper and GitHub. Follow us on Twitter, join our Telegram Channel, and connect with our LinkedIn Group. Also, join our 70k+ ML SubReddit.

Transform Your Business with AI

To stay competitive, consider using InternVideo2.5 in your operations:

Identify Automation Opportunities: Find key areas in customer interactions that can benefit from AI.
Define KPIs: Ensure your AI projects have measurable impacts on your business.
Select an AI Solution: Choose tools that fit your needs and allow customization.
Implement Gradually: Start with a pilot project, gather data, and expand AI use wisely.

For AI KPI management advice, connect with us at hello@itinai.com. For ongoing insights, follow us on Telegram or Twitter.

Explore how AI can enhance your sales processes and customer engagement at itinai.com.

List of Useful Links:

Unleash Your Creative Potential with AI Agents

Competitors are already using AI Agents

Business Problems We Solve

Automation of internal processes.
Optimizing AI costs without huge budgets.
Training staff, developing custom courses for business needs
Integrating AI into client work, automating first lines of contact

Large and Medium Businesses

Startups

Offline Business

Get a plan to reduce routine and improve metrics

100% of clients report increased productivity and reduced operati

AI Agents

Localization Project Manager – Coordinating translation workflows, answering vendor or process-related questions.

Job Title: Localization Project Manager Overview The Localization Project Manager plays a vital role in coordinating translation workflows while addressing vendor and process-related queries. This position is crucial for ensuring that translation projects are executed efficiently…
AI Agents

Environmental Health & Safety Officer – Answering compliance-related questions, retrieving safety protocols or audit histories.

Professional Summary The AI-driven Environmental Health & Safety Officer is a reliable and effective digital team member that performs repetitive and time-consuming tasks with remarkable speed, accuracy, and stability. By automating these tasks, it frees up…
AI Agents

Legal Contract Reviewer – Auto-flagging clause inconsistencies or retrieving precedent cases for review.

Job Title: Legal Contract Reviewer – Auto-flagging Clause Inconsistencies or Retrieving Precedent Cases for Review The AI functions as a reliable and effective digital team member that excels in performing repetitive and time-consuming tasks. With remarkable…
AI Agents

Customer Retention Analyst – Creating customer summaries, identifying churn risk patterns, and suggesting retention steps.

Customer Retention Analyst Professional Summary A highly analytical and detail-oriented Customer Retention Analyst with a proven track record in creating comprehensive customer summaries, identifying churn risk patterns, and suggesting effective retention strategies. Adept at leveraging data-driven…

Itinai.com httpss.mj.runmrqch2uvtvo russian handsome charisma 9fdbb2d5 a55b 425d 8f3b 76d26f86710f 2

AI Business Accelerator

Start Your AI Business in Just a Week with itinai.com

You’re a great fit if you:

Have an audience (even 500+ followers in Instagram, email, etc.)
Have an idea, service, or product you want to scale
Can invest 2–3 hours a day
You’re motivated to earn with AI but don’t want to handle technical setup

AI news and solutions

Automated Invoice Processing

Automated Invoice Processing: A New Era for Finance Teams The finance department has long been the engine room of any successful business, but too often it’s burdened with repetitive, manual tasks. Ask any Accounts Payable (AP)…

AI Document Assistant
GitHub Launches GitHub Models: Enabling Millions of Developers to Become AI Engineers and Build with Industry-Leading AI Models

GitHub Launches GitHub Models: Enabling Millions of Developers to Become AI Engineers and Build with Industry-Leading AI Models The number of modern applications containing both the backend and frontend code with one or more generative AI…

AI Tech News
Adaptive Attacks on LLMs: Lessons from the Frontlines of AI Robustness Testing

Understanding the Importance of AI Safety The field of Artificial Intelligence (AI) is progressing quickly, especially with Large Language Models (LLMs) becoming essential in AI applications. These models come with built-in safety features to prevent unethical…

AI Tech News
Operations Manager – Generating process summaries, retrieving SOPs, or answering cross-functional operational questions.

Professional Summary The AI serves as a reliable and effective digital team member, performing repetitive and time-consuming tasks with remarkable speed, accuracy, and stability. By automating these tasks, it frees up human employees to focus on…

AI Agents
Tensoic AI Releases Kan-Llama: A 7B Llama-2 LoRA PreTrained and FineTuned on ‘Kannada’ Tokens

Tensoic introduced Kannada Llama (Kan-LLaMA), aiming to overcome limitations of language models (LLMs) by emphasizing the importance of open models for natural language processing and machine translation. The paper presents the solution for enhancing efficiency of…

AI Tech News
NavGPT-2: Integrating LLMs and Navigation Policy Networks for Smarter Agents

NavGPT-2: Integrating LLMs and Navigation Policy Networks for Smarter Agents NavGPT-2 effectively combines Large Language Models (LLMs) and Vision-and-Language Navigation (VLN) tasks to enhance navigation capabilities. Practical Solutions and Value NavGPT-2 overcomes the limitations of integrating…

AI Tech News
UC Berkeley Researchers Released Sky-T1-32B-Preview: An Open-Source Reasoning LLM Trained for Under $450 Surpasses OpenAI-o1 on Benchmarks like Math500, AIME, and Livebench

Unlocking AI for Everyone The rapid growth of artificial intelligence (AI) brings exciting opportunities, but high costs often limit access. Advanced models like GPT-4 and OpenAI’s o1 are powerful but expensive to develop and train. This…

AI Tech News
Google DeepMind Open-Sources GenCast: A Machine Learning-based Weather Model that can Predict Different Weather Conditions up to 15 Days Ahead

Weather Forecasting Challenges and Solutions Understanding the Complexity Accurately predicting the weather is difficult due to the unpredictable nature of the atmosphere. Traditional methods, like numerical weather prediction (NWP), provide insights but are costly and can…

AI Tech News
Fireworks AI Releases f1: A Compound AI Model Specialized in Complex Reasoning that Beats GPT-4o and Claude 3.5 Sonnet Across Hard Coding, Chat and Math Benchmarks

Challenges in AI Development The field of artificial intelligence is growing quickly, but there are still many challenges, especially in complex reasoning tasks. Current AI models, like GPT-4 and Claude 3.5 Sonnet, often struggle with difficult…

AI Tech News
Saal AI to Showcase Groundbreaking Technologies at UMEX SimTEX 2023

Saal AI will feature cutting-edge defense technology at UMEX SimTEX 2023, presenting products designed to revolutionize the industry. Attendees can engage with live demonstrations, attend AI technology sessions, and participate in interactive activities. Interested visitors can…

AI Tech News
METAL: A Multi-Agent Framework for Enhanced Chart Generation

Challenges in Data Visualization Creating charts that accurately represent complex data is a significant challenge in today’s data visualization environment. This task requires not only precise design elements but also the ability to convert these visual…

AI Tech News
Meta Teams Up with Microsoft Bing to Introduce AI Chatbot Across Its Platforms

Meta has partnered with Microsoft Bing to launch an AI chatbot across its platforms, including WhatsApp, Messenger, and Instagram. The chatbot, powered by Meta AI, offers features such as answering queries, text generation, and language translation.…

AI Tech News
How to Find the Biggest Trends in 2024 (5 Proven Methods)

The text discusses the importance of spotting new trends and the various methods to identify them early. It covers tools such as Exploding Topics, utilizing YouTube, discovering mega trends through data, public domain opportunities, and sports…

AI Tech News
A Comprehensive Review of Video Diffusion Models in the Artificial Intelligence Generated Content (AIGC)

The recent boom in Artificial Intelligence (AI) has led to significant advancements in the sub-field of Computer Vision, particularly in the domain of video diffusion models. These models have surpassed alternative techniques and shown remarkable generative…

AI Tech News
This AI Paper Introduces the COVE Method: A Novel AI Approach to Tackling Hallucination in Language Models Through Self-Verification

Researchers from Meta AI and ETH Zurich have introduced a new method called COVE (Chain-of-Verification) to tackle hallucinations in language models. By using verification questions to assess and improve initial responses, they achieved greater accuracy in…

AI Tech News
This AI Paper Explores the Impact of Model Compression on Subgroup Robustness in BERT Language Models

AI Tech News
Llama 3.1 vs GPT-4o vs Claude 3.5: A Comprehensive Comparison of Leading AI Models

The Value of Leading AI Models Llama 3.1: Open Source Innovation Llama 3.1, developed by Meta, offers a 128K context length for comprehensive text understanding. It is open-source, flexible, and supports eight languages, making it ideal…

AI Tech News
This Research Paper Introduces Lavie: High-Quality Video Generation with Cascaded Latent Diffusion Models

LaVie is a new video generation framework that aims to synthesize visually realistic and temporally coherent videos using text inputs. It incorporates simple temporal self-attention and joint image-video fine-tuning to enhance the quality and creativity of…

AI Tech News
Enhancing the Accuracy of Large Language Models with Corrective Retrieval Augmented Generation (CRAG)

In natural language processing, the pursuit of precise language models has led to innovative approaches to mitigate inaccuracies, particularly in large language models (LLMs). Corrective Retrieval Augmented Generation (CRAG) addresses this by using a lightweight retrieval…

AI Tech News
On-Chip Implementation of Backpropagation for Spiking Neural Networks on Neuromorphic Hardware

Innovative AI Solutions Inspired by Nature Natural neural systems have led to breakthroughs in machine learning and neuromorphic circuits, focusing on energy-efficient data processing. However, using the backpropagation algorithm, essential for deep learning, on neuromorphic hardware…

AI Tech News