Latent Action Pretraining for General Action models (LAPA): An Unsupervised Method for Pretraining Vision-Language-Action (VLA) Models without Ground-Truth Robot Action Labels

Vision-Language-Action Models (VLA) for Robotics

VLA models combine large language models with vision encoders and are fine-tuned on robot datasets. This enables robots to understand new instructions and recognize unfamiliar objects. However, most robot datasets require human control, making it hard to scale. In contrast, using Internet video data offers more examples of human actions and interactions, which can improve scalability.

Challenges with Internet Videos

Learning from online videos is challenging because:

Most videos lack clear labels for actions.
Video contexts often differ from the environments where robots operate.

Advancements in Vision-Language Models (VLMs)

VLMs trained on large datasets of text, images, and videos can understand and generate both text and multimodal data. By adding auxiliary tasks, the performance during training has improved. Yet, these methods still depend on labeled action data, which limits the scalability of developing general VLAs.

Training Robot Policies from Videos

Using videos rich in dynamics and behavior can help robots learn better. Some recent studies use generative models trained on human videos to enhance robotic tasks. However, current methods often need specific human-robot data or are too task-specific.

LAPA: A New Approach

Researchers from various institutions introduced Latent Action Pre Training for General Action models (LAPA). This unsupervised method utilizes internet-scale videos without labeled robot actions.

How LAPA Works

LAPA includes:

**First Stage**: Using a VQ-VAE-based method to break actions into smaller parts.
**Second Stage**: A Vision-Language Model predicts latent actions from video and task descriptions, followed by fine-tuning on a small robot dataset.

Key Benefits of LAPA

LAPA outperforms previous models like OPENVLA, achieving:

Better efficiency, using only 272 H100 hours vs. 21,500 A100-hours.
Improved performance in real-world tasks requiring language conditioning and generalization.

Conclusion and Future Opportunities

LAPA is a scalable pre-training method for VLAs, demonstrating improved transfer to various tasks. Although LAPA shows limitations in fine-grained motion tasks, it offers significant advancements in robotic performance.

Future Directions

Potential areas for improvement include:

Expanding latent action generation for better fine-grained motion tasks.
Implementing hierarchical architectures to reduce latency during real-time inference.

Discover More

For more details, check out the Paper, Model Card on HuggingFace, and Project Page. Follow us on Twitter, join our Telegram Channel, and be part of our LinkedIn Group.

For AI advancement opportunities and insights, connect with us at hello@itinai.com or follow us on Telegram and Twitter.

Upcoming Live Webinar

Oct 29, 2024 – Learn about the best platform for serving fine-tuned models: Predibase Inference Engine.

List of Useful Links:

Unleash Your Creative Potential with AI Agents

Competitors are already using AI Agents

Business Problems We Solve

Automation of internal processes.
Optimizing AI costs without huge budgets.
Training staff, developing custom courses for business needs
Integrating AI into client work, automating first lines of contact

Large and Medium Businesses

Startups

Offline Business

Get a plan to reduce routine and improve metrics

100% of clients report increased productivity and reduced operati

AI Agents

Localization Project Manager – Coordinating translation workflows, answering vendor or process-related questions.

Job Title: Localization Project Manager Overview The Localization Project Manager plays a vital role in coordinating translation workflows while addressing vendor and process-related queries. This position is crucial for ensuring that translation projects are executed efficiently…
AI Agents

Environmental Health & Safety Officer – Answering compliance-related questions, retrieving safety protocols or audit histories.

Professional Summary The AI-driven Environmental Health & Safety Officer is a reliable and effective digital team member that performs repetitive and time-consuming tasks with remarkable speed, accuracy, and stability. By automating these tasks, it frees up…
AI Agents

Legal Contract Reviewer – Auto-flagging clause inconsistencies or retrieving precedent cases for review.

Job Title: Legal Contract Reviewer – Auto-flagging Clause Inconsistencies or Retrieving Precedent Cases for Review The AI functions as a reliable and effective digital team member that excels in performing repetitive and time-consuming tasks. With remarkable…
AI Agents

Customer Retention Analyst – Creating customer summaries, identifying churn risk patterns, and suggesting retention steps.

Customer Retention Analyst Professional Summary A highly analytical and detail-oriented Customer Retention Analyst with a proven track record in creating comprehensive customer summaries, identifying churn risk patterns, and suggesting effective retention strategies. Adept at leveraging data-driven…

Itinai.com httpss.mj.runmrqch2uvtvo russian handsome charisma 9fdbb2d5 a55b 425d 8f3b 76d26f86710f 2

AI Business Accelerator

Start Your AI Business in Just a Week with itinai.com

You’re a great fit if you:

Have an audience (even 500+ followers in Instagram, email, etc.)
Have an idea, service, or product you want to scale
Can invest 2–3 hours a day
You’re motivated to earn with AI but don’t want to handle technical setup

AI news and solutions

Comparative Analysis of Top 14 Vector Databases: Features, Performance, and Scalability Insights

AI Tech News
MaRDIFlow: Automating Metadata Abstraction for Enhanced Reproducibility in Computational Workflows

Practical Solutions for Computational Workflows Enhancing Research with Computational Workflows The integration of data-intensive computational studies is vital across scientific disciplines. Computational workflows systematically outline methods, data, and computing resources. With complex simulation models and vast…

AI Tech News
Researchers from Nankai University and ByteDance Introduce ‘ChatAnything’: A Novel AI Framework Dedicated to the Generation of LLM-Enhanced Personas

Researchers from Nankai University and ByteDance have developed a framework called ChatAnything that generates anthropomorphized personas for large language model (LLM)-based characters. The framework uses in-context learning and system prompts to create customized personalities, voices, and…

AI Tech News
Heterogeneous Mixture of Experts (HMoE): Enhancing Model Efficiency and Performance with Diverse Expert Capacities

The Heterogeneous Mixture of Experts (HMoE) Model: Optimizing Efficiency and Performance The HMoE model introduces experts of varying sizes to handle diverse token complexities, improving resource utilization and overall model performance. The research proposes a new…

AI Tech News
Optimizing Training Data Allocation Between Supervised and Preference Finetuning in Large Language Models

“`html Optimizing Training Data Allocation Between Supervised and Preference Finetuning in Large Language Models Introduction Large Language Models (LLMs) face challenges in improving their training methods, specifically in balancing Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL)…

AI Tech News
Why Your Team Can’t Find Anything: Your Docs Need an AI Brain

Why Your Team Can’t Find Anything: Your Docs Need an AI Brain Imagine this scenario: you’re in the middle of a critical project, and suddenly, you can’t find the document you need. Hours are wasted searching…

AI Document Assistant
Cohere AI Unveils Cohere’s Embed v3 Model: Offering State-of-the-Art Performance per Trusted MTEB and BEIR Benchmarks

Cohere’s Embed v3 model is a valuable solution for finding relevant and informative content in text data. It outperforms other models in benchmark tests and offers efficient navigation through vast amounts of information. Supporting over 100…

AI Tech News
Can Language Models Reason Beyond Words? Exploring Implicit Reasoning in Multi-Layer Hidden States for Complex Tasks

Large Language Models (LLMs) have shown impressive capabilities in language understanding and reasoning. To enhance their proficiency, researchers have employed the chain of thought (CoT) technique but it delays the generation of the desired answer. In…

AI Tech News
Top Online Courses on Google Gemini

Practical Solutions and Value of Google Gemini AI Courses Introduction to Gemini for Google Workspace Learn about Generative AI and its potential, challenges, and limitations. Understand the main features of Gemini Enterprise add-on and responsible usage.…

AI Tech News
From Theory to Robotics: Applying Sums-of-Squares Optimization for Better Control

AI Tech News
Enhancing AI Decision-Making: Attentive Reasoning Queries (ARQs) for LLMs

Introduction to Large Language Models (LLMs) Large Language Models (LLMs) are essential tools in customer support, automated content creation, and data retrieval. However, their effectiveness can be limited by challenges in consistently following detailed instructions across…

AI Tech News
VisualWebInstruct: Enhancing Vision-Language Models with a Large-Scale Multimodal Reasoning Dataset

Introduction to Visual Language Models (VLMs) Visual language models (VLMs) have made significant strides in perception-driven tasks like visual question answering and document-based visual reasoning. However, their performance in reasoning-intensive tasks is limited by the lack…

AI Tech News
SEED-X: A Unified and Versatile Foundation Model that can Model Multi-Granularity Visual Semantics for Comprehension and Generation Tasks

AI Tech News
Dimple: The First Discrete Diffusion Multimodal Language Model for Enhanced Text Generation

Understanding Dimple: A Breakthrough in Text Generation Understanding Dimple: A Breakthrough in Text Generation Introduction to Dimple Researchers at the National University of Singapore have developed Dimple, a new model that enhances text generation through innovative…

AI News
Elia: An Open Source Terminal UI for Interacting with LLMs

Practical AI Solution: Elia – An Open Source Terminal UI for Interacting with LLMs People working with large language models often need a quick and efficient way to interact with these powerful tools. However, existing methods…

AI Tech News
Neurodiversity and invisible disabilities in Agile

This post discusses the importance of embracing neurodiversity and addressing invisible disabilities within Agile teams. It also provides practical tips for creating an inclusive and efficient team.

Scrum Agile News
CMU Researchers Explore Expert Guidance and Strategic Deviations in Multi-Agent Imitation Learning

Practical Solutions and Value in AI for Multi-Agent Imitation Learning Challenges in Multi-Agent Imitation Learning The challenge of a mediator learning to coordinate a group of strategic agents without knowing their underlying utility functions can be…

AI Tech News
Unveiling Player Insights: A Novel Machine Learning Approach to Understanding Gaming Behavior

AI Tech News
Det finns en överskattning av stora språkmodellers resonemangsförmåga

“`html Новое исследование MIT о лимитах больших языковых моделей Недавнее исследование MIT:s Computer Science and Artificial Intelligence Laboratory (CSAIL) подчеркнуло, что большие языковые модели (LLM) проявляют себя отлично в знакомых сценариях, но сталкиваются с трудностями в…

AI Tech News
MIND (Math Informed syNthetic Dialogue): How Structured Synthetic Data Improves the Mathematical and Logical Capabilities of AI-Powered Language Models

Understanding Large Language Models (LLMs) Large language models (LLMs) can understand and create text that resembles human language. However, they struggle with mathematical reasoning, especially in complex problems that require logical, step-by-step thinking. Enhancing their mathematical…

AI Tech News