Itinai.com llm large language model structure neural network 3ca9a360 5bda 4524 a7b9 b878349f3823 0
Itinai.com llm large language model structure neural network 3ca9a360 5bda 4524 a7b9 b878349f3823 0

Latent Action Pretraining for General Action models (LAPA): An Unsupervised Method for Pretraining Vision-Language-Action (VLA) Models without Ground-Truth Robot Action Labels

Latent Action Pretraining for General Action models (LAPA): An Unsupervised Method for Pretraining Vision-Language-Action (VLA) Models without Ground-Truth Robot Action Labels

Vision-Language-Action Models (VLA) for Robotics

VLA models combine large language models with vision encoders and are fine-tuned on robot datasets. This enables robots to understand new instructions and recognize unfamiliar objects. However, most robot datasets require human control, making it hard to scale. In contrast, using Internet video data offers more examples of human actions and interactions, which can improve scalability.

Challenges with Internet Videos

Learning from online videos is challenging because:

  • Most videos lack clear labels for actions.
  • Video contexts often differ from the environments where robots operate.

Advancements in Vision-Language Models (VLMs)

VLMs trained on large datasets of text, images, and videos can understand and generate both text and multimodal data. By adding auxiliary tasks, the performance during training has improved. Yet, these methods still depend on labeled action data, which limits the scalability of developing general VLAs.

Training Robot Policies from Videos

Using videos rich in dynamics and behavior can help robots learn better. Some recent studies use generative models trained on human videos to enhance robotic tasks. However, current methods often need specific human-robot data or are too task-specific.

LAPA: A New Approach

Researchers from various institutions introduced Latent Action Pre Training for General Action models (LAPA). This unsupervised method utilizes internet-scale videos without labeled robot actions.

How LAPA Works

LAPA includes:

  • **First Stage**: Using a VQ-VAE-based method to break actions into smaller parts.
  • **Second Stage**: A Vision-Language Model predicts latent actions from video and task descriptions, followed by fine-tuning on a small robot dataset.

Key Benefits of LAPA

LAPA outperforms previous models like OPENVLA, achieving:

  • Better efficiency, using only 272 H100 hours vs. 21,500 A100-hours.
  • Improved performance in real-world tasks requiring language conditioning and generalization.

Conclusion and Future Opportunities

LAPA is a scalable pre-training method for VLAs, demonstrating improved transfer to various tasks. Although LAPA shows limitations in fine-grained motion tasks, it offers significant advancements in robotic performance.

Future Directions

Potential areas for improvement include:

  • Expanding latent action generation for better fine-grained motion tasks.
  • Implementing hierarchical architectures to reduce latency during real-time inference.

Discover More

For more details, check out the Paper, Model Card on HuggingFace, and Project Page. Follow us on Twitter, join our Telegram Channel, and be part of our LinkedIn Group.

For AI advancement opportunities and insights, connect with us at hello@itinai.com or follow us on Telegram and Twitter.

Upcoming Live Webinar

Oct 29, 2024 – Learn about the best platform for serving fine-tuned models: Predibase Inference Engine.

List of Useful Links:

Itinai.com office ai background high tech quantum computing 0002ba7c e3d6 4fd7 abd6 cfe4e5f08aeb 0

Vladimir Dyachkov, Ph.D
Editor-in-Chief itinai.com

I believe that AI is only as powerful as the human insight guiding it.

Unleash Your Creative Potential with AI Agents

Competitors are already using AI Agents

Business Problems We Solve

  • Automation of internal processes.
  • Optimizing AI costs without huge budgets.
  • Training staff, developing custom courses for business needs
  • Integrating AI into client work, automating first lines of contact

Large and Medium Businesses

Startups

Offline Business

100% of clients report increased productivity and reduced operati

AI news and solutions