Revolutionizing GUI Interaction: Gelato-30B-A3B Grounding Model Explained for AI Developers

Researchers from ML Foundations have recently unveiled Gelato-30B-A3B, an advanced grounding model aimed at improving AI agents’ abilities to locate and interact with specific elements on graphical user interfaces (GUIs) using natural language instructions. This innovative model, trained on the Click 100k dataset, shows remarkable improvements in accuracy compared to its predecessors, such as GTA1-32B and larger vision-language models like Qwen3-VL-235B-A22B-Instruct.

Understanding the Target Audience

The primary audience for Gelato-30B-A3B encompasses:

AI researchers and developers interested in cutting-edge grounding models.
Business managers looking to implement AI solutions for GUI tasks.
Technical teams aiming to enhance user interactions with software applications.

Key pain points for this audience include:

Challenges in achieving reliable AI interactions across diverse graphical user interfaces.
Difficulties in integrating AI models into existing workflows.
The need for improved accuracy in AI-driven tasks to boost productivity.

Their goals typically involve:

Implementing AI solutions that can accurately interpret user commands.
Reducing the time and effort needed for software navigation.
Enhancing the user experience through seamless AI interactions.

The audience is likely to be interested in:

Recent advancements in AI and machine learning.
Practical applications of AI in business settings.
Data-driven insights into user behavior and software usage.

What Gelato-30B-A3B Does in an Agent Stack

Gelato-30B-A3B is a 31 billion parameter model that refines Qwen3-VL-30B-A3B Instruct using a mixture of experts architecture. It processes screenshots along with textual instructions to produce precise click coordinates as output. Acting as a modular grounding component, Gelato allows a planner model, such as GPT-5, to determine high-level actions while utilizing Gelato for accurate click resolutions across various operating systems and applications.

Click 100k: A Targeted Dataset for GUI Grounding

The backbone of Gelato-30B-A3B is the Click 100k dataset, which pairs computer screen images with natural language instructions, bounding boxes for target elements, image dimensions, and normalized bounding boxes. Each sample is structured as a low-level command, such as “tap on the element between Background and Notifications options,” with precise regions defined.

This dataset is meticulously constructed by filtering and unifying multiple public sources, including:

ShowUI
AutoGUI
PC Agent E
WaveUI
OS Atlas
UGround
PixMo Points
SeeClick
UI VISION
JEDI subset focusing on spreadsheet and text cell manipulation

Each source contributes a maximum of 50,000 samples, all mapped into a shared schema. The research team employs a rigorous filtering pipeline to ensure data quality, including only relevant and accurate samples.

GRPO Training on Top of Qwen3 VL

Gelato-30B-A3B utilizes GRPO, a reinforcement learning algorithm, to enhance its training. The model initializes from Qwen3 VL 30B A3B Instruct and undergoes 100 reinforcement learning steps on 32 A100 GPUs with 40 GB memory. Performance benchmarks indicate the following accuracy rates:

63.88% on ScreenSpot Pro
67.19% on OS World G
73.40% on OS World G Refined

By implementing a simple refusal prompting strategy, scores improve further, raising OS World G results to:

69.15% on OS World G
74.65% on OS World G Refined

End-to-End Agent Results on OS World

When integrated into the GTA1.5 agent framework, Gelato-30B-A3B showcases improved performance in real-world tasks. In this setup, GPT-5 acts as the planner, while Gelato provides grounding, achieving:

58.71% automated success rate on OS World tasks
61.85% success rate under human evaluation

Key Takeaways

Gelato-30B-A3B sets a new standard for GUI grounding models, outperforming previous models like GTA1-32B and larger vision-language models. Its training on the Click 100k dataset, combined with a GRPO reinforcement learning approach, significantly enhances grounding accuracy and overall agent performance. For further exploration, visit the GitHub repository for tutorials, codes, and notebooks.

FAQs

What is Gelato-30B-A3B? Gelato-30B-A3B is a grounding model designed to improve AI agents’ ability to find and interact with GUI elements based on natural language instructions.
How does Gelato-30B-A3B improve accuracy? It uses a combination of a specialized dataset (Click 100k) and reinforcement learning techniques to enhance its performance.
What is the Click 100k dataset? It is a dataset that pairs images of computer screens with natural language commands, providing the necessary training data for the model.
Who can benefit from Gelato-30B-A3B? AI researchers, business managers, and technical teams focused on improving user interactions with software can all benefit from this model.
What are the potential applications of this model? It can be used in various applications, including software navigation, automated user interface interactions, and enhancing user experience across platforms.

Unleash Your Creative Potential with AI Agents

Competitors are already using AI Agents

Business Problems We Solve

Automation of internal processes.
Optimizing AI costs without huge budgets.
Training staff, developing custom courses for business needs
Integrating AI into client work, automating first lines of contact

Large and Medium Businesses

Startups

Offline Business

Get a plan to reduce routine and improve metrics

100% of clients report increased productivity and reduced operati

AI Agents

Localization Project Manager – Coordinating translation workflows, answering vendor or process-related questions.

Job Title: Localization Project Manager Overview The Localization Project Manager plays a vital role in coordinating translation workflows while addressing vendor and process-related queries. This position is crucial for ensuring that translation projects are executed efficiently…
AI Agents

Environmental Health & Safety Officer – Answering compliance-related questions, retrieving safety protocols or audit histories.

Professional Summary The AI-driven Environmental Health & Safety Officer is a reliable and effective digital team member that performs repetitive and time-consuming tasks with remarkable speed, accuracy, and stability. By automating these tasks, it frees up…
AI Agents

Legal Contract Reviewer – Auto-flagging clause inconsistencies or retrieving precedent cases for review.

Job Title: Legal Contract Reviewer – Auto-flagging Clause Inconsistencies or Retrieving Precedent Cases for Review The AI functions as a reliable and effective digital team member that excels in performing repetitive and time-consuming tasks. With remarkable…
AI Agents

Customer Retention Analyst – Creating customer summaries, identifying churn risk patterns, and suggesting retention steps.

Customer Retention Analyst Professional Summary A highly analytical and detail-oriented Customer Retention Analyst with a proven track record in creating comprehensive customer summaries, identifying churn risk patterns, and suggesting effective retention strategies. Adept at leveraging data-driven…

Itinai.com httpss.mj.runmrqch2uvtvo russian handsome charisma 9fdbb2d5 a55b 425d 8f3b 76d26f86710f 2

AI Business Accelerator

Start Your AI Business in Just a Week with itinai.com

You’re a great fit if you:

Have an audience (even 500+ followers in Instagram, email, etc.)
Have an idea, service, or product you want to scale
Can invest 2–3 hours a day
You’re motivated to earn with AI but don’t want to handle technical setup

AI news and solutions

RXTX: Efficient Machine Learning Algorithm for Structured Matrix Multiplication

RXTX: A Machine Learning-Guided Algorithm for Efficient Structured Matrix Multiplication RXTX: A Machine Learning-Guided Algorithm for Efficient Structured Matrix Multiplication Introduction to Matrix Multiplication Matrix multiplication is a fundamental operation in computer science and numerical linear…

AI News
WebThinker: Empowering Large Reasoning Models for Autonomous Research and Report Generation

WebThinker: Enhancing Large Reasoning Models for Autonomous Research WebThinker: Enhancing Large Reasoning Models for Autonomous Research Introduction to Large Reasoning Models (LRMs) Large reasoning models (LRMs) have demonstrated remarkable abilities in fields such as mathematics, coding,…

AI Tech News
KAIST Researchers Propose VSP-LLM: A Novel Artificial Intelligence Framework to Maximize the Context Modeling Ability by Bringing the Overwhelming Power of LLMs

Researchers at KAIST have developed a novel framework called VSP-LLM, which combines visual speech processing with Large Language Models (LLMs) to enhance speech perception. This technology aims to address challenges in visual speech recognition and translation…

AI Tech News
IBM Researchers Propose a New Training-Free AI Approach to Mitigate Hallucination in LLMs

Practical Solutions for Mitigating Hallucinations in Large Language Models (LLMs) Addressing the Challenge Large language models (LLMs) are essential in various applications, but they often produce unreliable content due to hallucinations. This undermines their trustworthiness, especially…

AI Tech News
Kwai-STaR: An AI Framework that Transforms LLMs into State-Transition Reasoners to Improve Their Intuitive Reasoning Capabilities

Understanding the Challenges of Large Language Models in Mathematics Large Language Models (LLMs) struggle with mathematical reasoning, which includes tasks like understanding math concepts, solving problems, and making logical deductions. While there are methods to improve…

AI Tech News
MIT group releases white papers on governance of AI

MIT leaders and scholars release policy briefs outlining a framework for U.S. artificial intelligence (AI) governance, aiming to enhance U.S. leadership and limit potential harm. The approach involves extending current regulatory and liability approaches and emphasizes…

AI Tech News
Business Analytics with LangChain and LLMs

The text outlines the LangChain framework, demonstrating the ability to query SQL databases using human language. It describes how LangChain allows the integration of Large Language Models (LLMs) with other tools, enabling the creation of interactive…

AI Tech News
Enhancing Artificial Intelligence Reasoning by Addressing Softmax Limitations in Sharp Decision-Making with Adaptive Temperature Techniques

Understanding the Importance of the Softmax Function in AI The ability to draw accurate conclusions from data is crucial for effective reasoning in Artificial Intelligence (AI) systems. The softmax function plays a key role in enabling…

AI Tech News
SarcasmBench: A Comprehensive Evaluation Framework Revealing the Challenges and Performance Gaps of Large Language Models in Understanding Subtle Sarcastic Expressions

Sarcasm Detection in Natural Language Processing Sarcasm is a complex challenge in natural language processing, as it involves conveying one sentiment while implying the opposite. Detecting sarcasm requires understanding context, tone, and cultural cues, which poses…

AI Tech News
Cleanlab Introduces the Trustworthy Language Model (TLM) that Addresses the Primary Challenge to Enterprise Adoption of LLMs: Unreliable Outputs and Hallucinations

AI Tech News
Researchers from Meta AI and UT Austin Explored Scaling in Auto-Encoders and Introduced ViTok: A ViT-Style Auto-Encoder to Perform Exploration

Introduction to ViTok Modern methods for generating images and videos use tokenization to simplify complex data. While there have been significant improvements in generator models, tokenizers, especially those based on convolutional neural networks (CNNs), have not…

AI Tech News
Stability AI unveils its real-time text-to-image generator

Stability AI introduces SDXL Turbo, an AI text-to-image generator that creates images in milliseconds, updating in real-time with prompt edits. It uses Adversarial Diffusion Distillation, blending diffusion model quality and GAN speed, saving computing resources and…

AI Tech News
RLEF: A Reinforcement Learning Approach to Leveraging Execution Feedback in Code Synthesis

Practical Solutions and Value of Reinforcement Learning with Execution Feedback in Code Synthesis Overview: Large Language Models (LLMs) use Natural Language Processing to generate code for tasks like software development. Improving alignment with input is crucial…

AI Tech News
What‘s the Difference Between Similarity Search and Re-Ranking?

The Power of Similarity Search and Re-Ranking in AI Solutions Similarity Search Similarity search, a potent AI strategy, focuses on finding relevant matches based on semantic meaning rather than just keywords. It transforms content into vectors…

AI Tech News
Meet Phind-70B: An Artificial Intelligence (AI) Model that Closes Execution Speed and the Code Generation Quality Gap with GPT-4 Turbo

Phind-70B is a cutting-edge AI model aiming to enhance coding experiences globally. With exceptional speed and code quality, it outperforms GPT-4 Turbo in practice. Utilizing advanced technology and partnerships, it offers a free trial and Phind…

AI Tech News
UC Berkeley and Microsoft Research Redefine Visual Understanding: How Scaling on Scales Outperforms Larger Models with Efficiency and Elegance

AI Tech News
Persona-Plug (PPlug): A Lightweight Plug-and-Play Model for Personalized Language Generation

Practical Solutions for Personalized Language Generation Personalization with Efficient Language Models Traditional methods require extensive fine-tuning for each user, but a more practical approach integrates the user’s holistic style into language models without extensive retraining. Introducing…

AI Tech News
This AI Death Calculator Can Predict Your Death with 78% Accuracy

A groundbreaking AI death calculator, “life2vec,” developed by researchers in Denmark and the United States, can predict individual lifespans with 78% accuracy. It analyzes personal details like income, profession, residence, and health history. Despite its predictive…

AI Tech News
Deep fakes surrounding the Israel-Palestine conflict intensify

The use of AI to create convincing deep fakes has become a problem in the Israel-Gaza conflict. Fake images, including those involving children, are being shared online and are difficult to detect. This is not limited…

AI Tech News
Apple Researchers Unveil DeepPCR: A Novel Machine Learning Algorithm that Parallelizes Typically Sequential Operations in Order to Speed Up Inference and Training of Neural Networks

Apple researchers have developed DeepPCR, an innovative algorithm to speed up neural network training and inference. It reduces computational complexity from O(L) to O(log2 L), achieving significant speed gains, particularly for high values of L. DeepPCR…

AI Tech News