Latent Action Pretraining for General Action models (LAPA): An Unsupervised Method for Pretraining Vision-Language-Action (VLA) Models without Ground-Truth Robot Action Labels

Latent Action Pretraining for General Action models (LAPA): An Unsupervised Method for Pretraining Vision-Language-Action (VLA) Models without Ground-Truth Robot Action Labels

Vision-Language-Action Models (VLA) for Robotics

VLA models combine large language models with vision encoders and are fine-tuned on robot datasets. This enables robots to understand new instructions and recognize unfamiliar objects. However, most robot datasets require human control, making it hard to scale. In contrast, using Internet video data offers more examples of human actions and interactions, which can improve scalability.

Challenges with Internet Videos

Learning from online videos is challenging because:

  • Most videos lack clear labels for actions.
  • Video contexts often differ from the environments where robots operate.

Advancements in Vision-Language Models (VLMs)

VLMs trained on large datasets of text, images, and videos can understand and generate both text and multimodal data. By adding auxiliary tasks, the performance during training has improved. Yet, these methods still depend on labeled action data, which limits the scalability of developing general VLAs.

Training Robot Policies from Videos

Using videos rich in dynamics and behavior can help robots learn better. Some recent studies use generative models trained on human videos to enhance robotic tasks. However, current methods often need specific human-robot data or are too task-specific.

LAPA: A New Approach

Researchers from various institutions introduced Latent Action Pre Training for General Action models (LAPA). This unsupervised method utilizes internet-scale videos without labeled robot actions.

How LAPA Works

LAPA includes:

  • **First Stage**: Using a VQ-VAE-based method to break actions into smaller parts.
  • **Second Stage**: A Vision-Language Model predicts latent actions from video and task descriptions, followed by fine-tuning on a small robot dataset.

Key Benefits of LAPA

LAPA outperforms previous models like OPENVLA, achieving:

  • Better efficiency, using only 272 H100 hours vs. 21,500 A100-hours.
  • Improved performance in real-world tasks requiring language conditioning and generalization.

Conclusion and Future Opportunities

LAPA is a scalable pre-training method for VLAs, demonstrating improved transfer to various tasks. Although LAPA shows limitations in fine-grained motion tasks, it offers significant advancements in robotic performance.

Future Directions

Potential areas for improvement include:

  • Expanding latent action generation for better fine-grained motion tasks.
  • Implementing hierarchical architectures to reduce latency during real-time inference.

Discover More

For more details, check out the Paper, Model Card on HuggingFace, and Project Page. Follow us on Twitter, join our Telegram Channel, and be part of our LinkedIn Group.

For AI advancement opportunities and insights, connect with us at hello@itinai.com or follow us on Telegram and Twitter.

Upcoming Live Webinar

Oct 29, 2024 – Learn about the best platform for serving fine-tuned models: Predibase Inference Engine.

List of Useful Links:

AI Products for Business or Try Custom Development

AI Sales Bot

Welcome AI Sales Bot, your 24/7 teammate! Engaging customers in natural language across all channels and learning from your materials, it’s a step towards efficient, enriched customer interactions and sales

AI Document Assistant

Unlock insights and drive decisions with our AI Insights Suite. Indexing your documents and data, it provides smart, AI-driven decision support, enhancing your productivity and decision-making.

AI Customer Support

Upgrade your support with our AI Assistant, reducing response times and personalizing interactions by analyzing documents and past engagements. Boost your team and customer satisfaction

AI Scrum Bot

Enhance agile management with our AI Scrum Bot, it helps to organize retrospectives. It answers queries and boosts collaboration and efficiency in your scrum processes.