The text discusses the development of a model called GRIF (Goal Representations for Instruction Following) that combines language and goal-conditioned training to improve robot learning. The model uses contrastive learning to align language instructions and goal images, enabling the robot to understand and carry out tasks specified through either language or images. The GRIF model performs well in real-world tasks and demonstrates better generalization compared to other baselines. The method has potential for further improvement and can be extended to leverage human video data for richer semantics.
Goal Representations for Instruction Following
A longstanding goal of the field of robot learning has been to create generalist agents that can perform tasks for humans. Natural language has the potential to be an easy-to-use interface for humans to specify arbitrary tasks, but it is difficult to train robots to follow language instructions. Approaches like language-conditioned behavioral cloning (LCBC) train policies to directly imitate expert actions conditioned on language, but require humans to annotate all training trajectories and generalize poorly across scenes and behaviors. Meanwhile, recent goal-conditioned approaches perform much better at general manipulation tasks, but do not enable easy task specification for human operators. How can we reconcile the ease of specifying tasks through LCBC-like approaches with the performance improvements of goal-conditioned learning?
Conceptualizing Instruction-Following Robots
An instruction-following robot needs two capabilities: grounding the language instruction in the physical environment and carrying out a sequence of actions to complete the task. These capabilities can be learned separately from appropriate data sources. Vision-language data from non-robot sources can help learn language grounding with generalization to diverse instructions and visual scenes. Unlabeled robot trajectories can be used to train a robot to reach specific goal states, even without associated language instructions.
Conditioning on visual goals (goal images) provides complementary benefits for policy learning. Goals can be freely generated and allow policies to be trained on large amounts of unannotated and unstructured trajectory data. However, goals are less intuitive for human users than natural language. By exposing a language interface for goal-conditioned policies, we can combine the strengths of both goal- and language- task specification to enable generalist robots that can be easily commanded.
Goal Representations for Instruction Following
The GRIF model consists of a language encoder, a goal encoder, and a policy network. The encoders map language instructions and goal images into a shared task representation space, which conditions the policy network when predicting actions. Our approach, Goal Representations for Instruction Following (GRIF), jointly trains a language- and a goal-conditioned policy with aligned task representations. The learned policies can generalize across language and scenes after training on mostly unlabeled demonstration data.
GRIF is trained jointly with language-conditioned behavioral cloning (LCBC) and goal-conditioned behavioral cloning (GCBC). The labeled dataset contains both language and goal task specifications, so we use it to supervise both the language- and goal-conditioned predictions. The unlabeled dataset contains only goals and is used for GCBC. By aligning the representations between goal-conditioned and language-conditioned tasks, we can improve the transfer between the two modalities.
Alignment through Contrastive Learning
We explicitly align representations between goal-conditioned and language-conditioned tasks through contrastive learning. We align representations of state-goal pairs with the language instruction, making the representations easier to learn and focusing on the change from state to goal. We learn this alignment structure through an infoNCE objective on instructions and images from the labeled dataset. The objective encourages high similarity between representations of the same task and low similarity for others.
We modify the CLIP architecture to accommodate and fine-tune it for aligning task representations. This allows us to encode pairs of state and goal images effectively and preserve the pre-training benefits from CLIP.
Robot Policy Results
We evaluated the GRIF policy in the real world on 15 tasks across 3 scenes. GRIF showed the best generalization and strong manipulation capabilities. It was able to ground language instructions and carry out tasks even when multiple tasks were possible in the scene. Comparing to baselines, GRIF outperformed them in terms of generalization and manipulation capabilities.
Conclusion
GRIF enables a robot to utilize large amounts of unlabeled trajectory data to learn goal-conditioned policies while providing a “language interface” to these policies via aligned language-goal task representations. Our approach significantly improves performance over baselines and methods that only use language-annotated data. GRIF has the potential to redefine the way robots follow instructions and can be a valuable tool for middle managers looking to leverage AI solutions for their organizations.
If you want to evolve your company with AI, stay competitive, and use Goal Representations for Instruction Following to your advantage. Discover how AI can redefine your way of work by identifying automation opportunities, defining KPIs, selecting an AI solution, and implementing gradually. For AI KPI management advice, connect with us at hello@itinai.com. And for continuous insights into leveraging AI, stay tuned on our Telegram t.me/itinainews or Twitter @itinaicom.
Spotlight on a Practical AI Solution
Consider the AI Sales Bot from itinai.com/aisalesbot designed to automate customer engagement 24/7 and manage interactions across all customer journey stages. Discover how AI can redefine your sales processes and customer engagement. Explore solutions at itinai.com.