Researchers from ML Foundations have recently unveiled Gelato-30B-A3B, an advanced grounding model aimed at improving AI agents’ abilities to locate and interact with specific elements on graphical user interfaces (GUIs) using natural language instructions. This innovative model, trained on the Click 100k dataset, shows remarkable improvements in accuracy compared to its predecessors, such as GTA1-32B and larger vision-language models like Qwen3-VL-235B-A22B-Instruct.
Understanding the Target Audience
The primary audience for Gelato-30B-A3B encompasses:
- AI researchers and developers interested in cutting-edge grounding models.
- Business managers looking to implement AI solutions for GUI tasks.
- Technical teams aiming to enhance user interactions with software applications.
Key pain points for this audience include:
- Challenges in achieving reliable AI interactions across diverse graphical user interfaces.
- Difficulties in integrating AI models into existing workflows.
- The need for improved accuracy in AI-driven tasks to boost productivity.
Their goals typically involve:
- Implementing AI solutions that can accurately interpret user commands.
- Reducing the time and effort needed for software navigation.
- Enhancing the user experience through seamless AI interactions.
The audience is likely to be interested in:
- Recent advancements in AI and machine learning.
- Practical applications of AI in business settings.
- Data-driven insights into user behavior and software usage.
What Gelato-30B-A3B Does in an Agent Stack
Gelato-30B-A3B is a 31 billion parameter model that refines Qwen3-VL-30B-A3B Instruct using a mixture of experts architecture. It processes screenshots along with textual instructions to produce precise click coordinates as output. Acting as a modular grounding component, Gelato allows a planner model, such as GPT-5, to determine high-level actions while utilizing Gelato for accurate click resolutions across various operating systems and applications.
Click 100k: A Targeted Dataset for GUI Grounding
The backbone of Gelato-30B-A3B is the Click 100k dataset, which pairs computer screen images with natural language instructions, bounding boxes for target elements, image dimensions, and normalized bounding boxes. Each sample is structured as a low-level command, such as “tap on the element between Background and Notifications options,” with precise regions defined.
This dataset is meticulously constructed by filtering and unifying multiple public sources, including:
- ShowUI
- AutoGUI
- PC Agent E
- WaveUI
- OS Atlas
- UGround
- PixMo Points
- SeeClick
- UI VISION
- JEDI subset focusing on spreadsheet and text cell manipulation
Each source contributes a maximum of 50,000 samples, all mapped into a shared schema. The research team employs a rigorous filtering pipeline to ensure data quality, including only relevant and accurate samples.
GRPO Training on Top of Qwen3 VL
Gelato-30B-A3B utilizes GRPO, a reinforcement learning algorithm, to enhance its training. The model initializes from Qwen3 VL 30B A3B Instruct and undergoes 100 reinforcement learning steps on 32 A100 GPUs with 40 GB memory. Performance benchmarks indicate the following accuracy rates:
- 63.88% on ScreenSpot Pro
- 67.19% on OS World G
- 73.40% on OS World G Refined
By implementing a simple refusal prompting strategy, scores improve further, raising OS World G results to:
- 69.15% on OS World G
- 74.65% on OS World G Refined
End-to-End Agent Results on OS World
When integrated into the GTA1.5 agent framework, Gelato-30B-A3B showcases improved performance in real-world tasks. In this setup, GPT-5 acts as the planner, while Gelato provides grounding, achieving:
- 58.71% automated success rate on OS World tasks
- 61.85% success rate under human evaluation
Key Takeaways
Gelato-30B-A3B sets a new standard for GUI grounding models, outperforming previous models like GTA1-32B and larger vision-language models. Its training on the Click 100k dataset, combined with a GRPO reinforcement learning approach, significantly enhances grounding accuracy and overall agent performance. For further exploration, visit the GitHub repository for tutorials, codes, and notebooks.
FAQs
- What is Gelato-30B-A3B? Gelato-30B-A3B is a grounding model designed to improve AI agents’ ability to find and interact with GUI elements based on natural language instructions.
- How does Gelato-30B-A3B improve accuracy? It uses a combination of a specialized dataset (Click 100k) and reinforcement learning techniques to enhance its performance.
- What is the Click 100k dataset? It is a dataset that pairs images of computer screens with natural language commands, providing the necessary training data for the model.
- Who can benefit from Gelato-30B-A3B? AI researchers, business managers, and technical teams focused on improving user interactions with software can all benefit from this model.
- What are the potential applications of this model? It can be used in various applications, including software navigation, automated user interface interactions, and enhancing user experience across platforms.

























