Latent Action Pretraining for General Action models (LAPA): An Unsupervised Method for Pretraining Vision-Language-Action (VLA) Models without Ground-Truth Robot Action Labels

Vision-Language-Action Models (VLA) for Robotics

VLA models combine large language models with vision encoders and are fine-tuned on robot datasets. This enables robots to understand new instructions and recognize unfamiliar objects. However, most robot datasets require human control, making it hard to scale. In contrast, using Internet video data offers more examples of human actions and interactions, which can improve scalability.

Challenges with Internet Videos

Learning from online videos is challenging because:

Most videos lack clear labels for actions.
Video contexts often differ from the environments where robots operate.

Advancements in Vision-Language Models (VLMs)

VLMs trained on large datasets of text, images, and videos can understand and generate both text and multimodal data. By adding auxiliary tasks, the performance during training has improved. Yet, these methods still depend on labeled action data, which limits the scalability of developing general VLAs.

Training Robot Policies from Videos

Using videos rich in dynamics and behavior can help robots learn better. Some recent studies use generative models trained on human videos to enhance robotic tasks. However, current methods often need specific human-robot data or are too task-specific.

LAPA: A New Approach

Researchers from various institutions introduced Latent Action Pre Training for General Action models (LAPA). This unsupervised method utilizes internet-scale videos without labeled robot actions.

How LAPA Works

LAPA includes:

**First Stage**: Using a VQ-VAE-based method to break actions into smaller parts.
**Second Stage**: A Vision-Language Model predicts latent actions from video and task descriptions, followed by fine-tuning on a small robot dataset.

Key Benefits of LAPA

LAPA outperforms previous models like OPENVLA, achieving:

Better efficiency, using only 272 H100 hours vs. 21,500 A100-hours.
Improved performance in real-world tasks requiring language conditioning and generalization.

Conclusion and Future Opportunities

LAPA is a scalable pre-training method for VLAs, demonstrating improved transfer to various tasks. Although LAPA shows limitations in fine-grained motion tasks, it offers significant advancements in robotic performance.

Future Directions

Potential areas for improvement include:

Expanding latent action generation for better fine-grained motion tasks.
Implementing hierarchical architectures to reduce latency during real-time inference.

Discover More

For more details, check out the Paper, Model Card on HuggingFace, and Project Page. Follow us on Twitter, join our Telegram Channel, and be part of our LinkedIn Group.

For AI advancement opportunities and insights, connect with us at hello@itinai.com or follow us on Telegram and Twitter.

Upcoming Live Webinar

Oct 29, 2024 – Learn about the best platform for serving fine-tuned models: Predibase Inference Engine.

List of Useful Links:

Unleash Your Creative Potential with AI Agents

Competitors are already using AI Agents

Business Problems We Solve

Automation of internal processes.
Optimizing AI costs without huge budgets.
Training staff, developing custom courses for business needs
Integrating AI into client work, automating first lines of contact

Large and Medium Businesses

Startups

Offline Business

Get a plan to reduce routine and improve metrics

100% of clients report increased productivity and reduced operati

AI Agents

Localization Project Manager – Coordinating translation workflows, answering vendor or process-related questions.

Job Title: Localization Project Manager Overview The Localization Project Manager plays a vital role in coordinating translation workflows while addressing vendor and process-related queries. This position is crucial for ensuring that translation projects are executed efficiently…
AI Agents

Environmental Health & Safety Officer – Answering compliance-related questions, retrieving safety protocols or audit histories.

Professional Summary The AI-driven Environmental Health & Safety Officer is a reliable and effective digital team member that performs repetitive and time-consuming tasks with remarkable speed, accuracy, and stability. By automating these tasks, it frees up…
AI Agents

Legal Contract Reviewer – Auto-flagging clause inconsistencies or retrieving precedent cases for review.

Job Title: Legal Contract Reviewer – Auto-flagging Clause Inconsistencies or Retrieving Precedent Cases for Review The AI functions as a reliable and effective digital team member that excels in performing repetitive and time-consuming tasks. With remarkable…
AI Agents

Customer Retention Analyst – Creating customer summaries, identifying churn risk patterns, and suggesting retention steps.

Customer Retention Analyst Professional Summary A highly analytical and detail-oriented Customer Retention Analyst with a proven track record in creating comprehensive customer summaries, identifying churn risk patterns, and suggesting effective retention strategies. Adept at leveraging data-driven…

Itinai.com httpss.mj.runmrqch2uvtvo russian handsome charisma 9fdbb2d5 a55b 425d 8f3b 76d26f86710f 2

AI Business Accelerator

Start Your AI Business in Just a Week with itinai.com

You’re a great fit if you:

Have an audience (even 500+ followers in Instagram, email, etc.)
Have an idea, service, or product you want to scale
Can invest 2–3 hours a day
You’re motivated to earn with AI but don’t want to handle technical setup

AI news and solutions

Ola: A State-of-the-Art Omni-Modal Understanding Model with Advanced Progressive Modality Alignment Strategy

Understanding the Challenge of Omni-modal Data Working with various types of data—like text, images, videos, and audio—within a single model is quite challenging. Current large language models often don’t perform as well when trying to handle…

AI Tech News
Speculative Retrieval Augmented Generation (Speculative RAG): A Novel Framework Enhancing Accuracy and Efficiency in Knowledge-intensive Query Processing with LLMs

The Value of Speculative Retrieval Augmented Generation (Speculative RAG) Enhancing Accuracy and Efficiency in Knowledge-intensive Query Processing with LLMs The field of natural language processing has seen significant advancements with the emergence of Large Language Models…

AI Tech News
Fin-R1: Advancing Financial Reasoning with a Specialized Large Language Model

Fin-R1: Advancements in Financial AI Fin-R1: Innovations in Financial AI Introduction Large Language Models (LLMs) are rapidly evolving, yet their application in complex financial problem-solving is still being explored. The development of LLMs is a significant…

AI Tech News
TacticAI: an AI assistant for football tactics

Liverpool FC and our organization have collaborated for multiple years. We have developed a comprehensive AI system to offer advice to coaches regarding corner kicks.

AI Tech News
IGNN-Solver: A Novel Graph Neural Solver for Implicit Graph Neural Networks

Challenges with Implicit Graph Neural Networks (IGNNs) The main issues with IGNNs are their slow inference speed and limited scalability. Although they effectively manage long-range dependencies in graphs, they rely on complex fixed-point iterations that are…

AI Tech News
Newton’s Laws of Motion: The Original Gradient Descent

This text explores the connection between the gradient descent algorithm in machine learning and Newton’s laws of motion. It explains that gradient descent is used to update parameters in a neural network to minimize a loss…

AI Tech News
xAI’s unhinged Grok drops an awkward blooper by referring to OpenAI

An AI startup’s unveiling of Grok, a sarcastic chatbot, has stirred controversy. Despite providing real-time content access and unique qualities, its behavior has raised concerns. Users noted similarities with ChatGPT, leading to questions about the AI’s…

AI Tech News
Meta AI Introduces MR.Q: A Model-Free Reinforcement Learning Algorithm with Model-Based Representations for Enhanced Generalization

Understanding Reinforcement Learning (RL) Reinforcement learning (RL) helps agents make decisions by maximizing rewards over time. It’s useful in various fields like robotics, gaming, and automation, where agents learn the best actions by interacting with their…

AI Tech News
Google DeepMind Introduces Diffusion Model Predictive Control (D-MPC): Combining Multi-Step Action Proposals and Dynamics Models Using Diffusion Models for Online MPC

Understanding Model Predictive Control (MPC) Model Predictive Control (MPC) is a method that helps make decisions by predicting future outcomes. It uses a model of the system to choose the best actions over a set period.…

AI Tech News
Global-MMLU: A World-class Benchmark Redefining Multilingual AI by Bridging Cultural and Linguistic Gaps for Equitable Evaluation Across 42 Languages and Diverse Contexts

Global-MMLU: A New Standard for Multilingual AI What is Global-MMLU? Global-MMLU is a groundbreaking benchmark created by a collaboration of top researchers from various institutions. It aims to improve upon traditional multilingual datasets, especially the Massive…

AI Tech News
Leveraging Machine Learning and Process-Based Models for Soil Organic Carbon Prediction: A Comparative Study and the Role of ChatGPT in Soil Science

Practical Solutions for Soil Health and Carbon Prediction Utilizing ML and Process-Based Models In recent years, machine learning (ML) algorithms have gained recognition in ecological modeling, including predicting soil organic carbon (SOC). A study in Austria…

AI Tech News
FedPart: A New AI Technique for Enhancing Federated Learning Efficiency through Partial Network Updates and Layer Selection Strategies

Understanding Federated Learning Federated Learning is a method of Machine Learning that prioritizes user privacy. It keeps data on users’ devices rather than sending it to a central server. This approach is especially beneficial for sensitive…

AI Tech News
Google AI Launches Gemma 3: Efficient Multimodal Models for On-Device AI

Challenges in Artificial Intelligence Artificial intelligence faces two significant challenges: high computational resource requirements for advanced language models and their unsuitability for everyday devices due to latency and size. Moreover, ensuring safe operation with proper risk…

AI Tech News
Build a Customizable Multi-Tool AI Agent with LangGraph and Claude

Building a Custom Multi-Tool AI Agent: A Practical Guide This guide provides a straightforward approach to creating a customizable multi-tool AI agent using LangGraph and Claude. Designed for a range of tasks such as mathematical calculations,…

AI News
MEDEC: A Benchmark for Detecting and Correcting Medical Errors in Clinical Notes Using LLMs

Understanding the Challenges and Solutions of LLMs in Medical Documentation Impressive Capabilities but Significant Risks Large Language Models (LLMs) can answer medical questions accurately and even outperform average humans in some medical exams. However, using them…

AI Tech News
This Paper Explores the Application of Deep Learning in Blind Motion Deblurring: A Comprehensive Review and Future Prospects

The text discusses the challenges of motion blur in computer vision tasks and the advancements in deep learning-based image deblurring. It covers the use of CNN, RNN, GAN, and Transformer-based approaches for blind motion deblurring and…

AI Tech News
Diffusion Models as Masked Audio-Video Learners

Recently, a paper on the use of audio-visual synchronization for learning audio-visual representations was accepted at the Machine Learning for Audio Workshop at NeurIPS 2023. The paper discusses the effectiveness of unsupervised training frameworks, particularly the…

AI Tech News
Semantic Hearing: A Machine Learning-Based Novel Capability for Hearable Devices to Focus on or Ignore Specific Sounds in Real Environments while Maintaining Spatial Awareness

Researchers from the University of Washington and Microsoft have developed noise-canceling headphones with semantic hearing capabilities, enabled by advanced machine learning algorithms. These headphones allow users to selectively choose the sounds they want to hear while…

AI Tech News
Transparency in Foundation Models: The Next Step in Foundation Model Transparency Index FMTI

Practical Solutions for AI Transparency Enhancing Transparency for Foundation Models Foundation models play a central role in the economy and society, and transparency is vital for accountability and understanding. Regulations like the EU AI Act and…

AI Tech News
Sora: first impressions

AI Tech News