Meet LEO: A Groundbreaking Embodied Multi-Modal Agent for Advanced 3D World Interaction and Task Solving

LEO is a generalized agent developed by researchers at the Beijing Institute for General Artificial Intelligence, CMU, Peking University, and Tsinghua University. It is trained in an LLM-based architecture and is capable of perceiving, reasoning, planning, and acting in complex 3D environments. LEO incorporates 3D vision-language alignment and action, and has demonstrated proficiency in tasks such as navigation and robotic manipulation. The team curated a large dataset and used scene-graph-based prompting and refinement methods to improve data quality. LEO’s responses are grounded in spatial relations and show concrete understanding of objects and actions in the scenes.

 Meet LEO: A Groundbreaking Embodied Multi-Modal Agent for Advanced 3D World Interaction and Task Solving

Meet LEO: A Groundbreaking Embodied Multi-Modal Agent for Advanced 3D World Interaction and Task Solving

AI systems that can handle multiple tasks or domains without the need for extensive reprogramming or retraining are known as generalist agents. These agents are designed to generalize knowledge and skills across various domains, enabling them to solve different problems with flexibility and adaptability. In training or research simulations, generalist agents in 3D environments can adapt to different scenarios, learn from experiences, and perform tasks within the virtual space. For example, in pilot or surgeon training simulations, these agents can replicate various scenarios and respond accordingly.

However, generalist agents face challenges in 3D worlds, such as handling the complexity of three-dimensional spaces, learning representations that generalize across diverse environments, and making decisions considering the multi-dimensional nature of their surroundings. To navigate and interact effectively within these environments, these agents often employ techniques from reinforcement learning, computer vision, and spatial reasoning.

Researchers from the Beijing Institute for General Artificial Intelligence, CMU, Peking University, and Tsinghua University have developed a generalized agent called LEO. LEO is a multi-modal and multitasking agent with a generic embodiment. LEO can perceive, ground, reason, plan, and act using shared model architectures and weights. It leverages an egocentric 2D image encoder for the embodied view and an object-centric 3D point cloud encoder for the third-person global perspective.

LEO can be trained with task-agnostic inputs and outputs using autoregressive training objectives. The 3D encoder generates an object-centric token for each observed entity, allowing for flexibility in adapting to tasks with different embodiments. The training data for LEO consisted of extensive object-level and scene-level multi-modal tasks in the 3D world, curated and generated by the research team.

To improve the quality of the generated data and enhance its scale and diversity, the team proposed scene-graph-based prompting and refinement methods, as well as Object-centric Chain-of-Thought (O-CoT) techniques. LEO was extensively evaluated and demonstrated proficiency in diverse tasks, including embodied navigation and robotic manipulation. The team also observed consistent performance gains when scaling up the training data.

The results show that LEO’s responses incorporate rich spatial relations and are precisely grounded in the 3D scenes. LEO can bridge the gap between 3D vision language and embodied movement, as joint learning demonstrated its feasibility.

Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you want to evolve your company with AI, stay competitive, and use it to your advantage, meet LEO: A Groundbreaking Embodied Multi-Modal Agent for Advanced 3D World Interaction and Task Solving.

Discover how AI can redefine your way of work

– Identify Automation Opportunities: Locate key customer interaction points that can benefit from AI.
– Define KPIs: Ensure your AI endeavors have measurable impacts on business outcomes.
– Select an AI Solution: Choose tools that align with your needs and provide customization.
– Implement Gradually: Start with a pilot, gather data, and expand AI usage judiciously.

For AI KPI management advice, connect with us at hello@itinai.com. And for continuous insights into leveraging AI, stay tuned on our Telegram t.me/itinainews or Twitter @itinaicom.

Spotlight on a Practical AI Solution

Consider the AI Sales Bot from itinai.com/aisalesbot. It is designed to automate customer engagement 24/7 and manage interactions across all customer journey stages.

Discover how AI can redefine your sales processes and customer engagement. Explore solutions at itinai.com.

List of Useful Links:

AI Products for Business or Try Custom Development

AI Sales Bot

Welcome AI Sales Bot, your 24/7 teammate! Engaging customers in natural language across all channels and learning from your materials, it’s a step towards efficient, enriched customer interactions and sales

AI Document Assistant

Unlock insights and drive decisions with our AI Insights Suite. Indexing your documents and data, it provides smart, AI-driven decision support, enhancing your productivity and decision-making.

AI Customer Support

Upgrade your support with our AI Assistant, reducing response times and personalizing interactions by analyzing documents and past engagements. Boost your team and customer satisfaction

AI Scrum Bot

Enhance agile management with our AI Scrum Bot, it helps to organize retrospectives. It answers queries and boosts collaboration and efficiency in your scrum processes.