Convergence AI Releases WebGames: A Comprehensive Benchmark Suite Designed to Evaluate General-Purpose Web-Browsing AI Agents

Advancements in AI Agents

AI agents are increasingly sophisticated and capable of managing complex tasks across various platforms. Websites and desktop applications are designed for human interaction, requiring an understanding of visual layouts, interactive elements, and time-sensitive behaviors. Monitoring user actions, from simple clicks to intricate drag-and-drop tasks, poses significant challenges for AI, which currently cannot match human performance in web-related tasks. Therefore, a comprehensive evaluation system is essential to assess and enhance AI agents for web browsing.

Limitations of Existing Benchmarks

Current benchmarks assess AI performance in specific web tasks, such as online shopping and flight booking, but do not adequately reflect the complexity of modern web interactions. Models like GPT-4o, Claude Computer-Use, Gemini-1.5-Pro, and Qwen2-VL face difficulties in navigation and task execution. Traditional evaluation frameworks, initially based on reinforcement learning, have expanded to web tasks but remain limited to short-context scenarios, resulting in incomplete assessments. Advanced web interactions require skills such as tool usage, planning, and environmental reasoning, which are not fully tested. Although multi-agent interactions are gaining traction, existing methods do not effectively evaluate collaboration and competition among AI systems.

Introducing WebGames

To overcome the limitations of current AI benchmarks in web interactions, researchers from Convergence Labs Ltd. and Clusterfudge Ltd. have developed WebGames, a framework that evaluates web-browsing AI agents through over 50 interactive challenges. These challenges encompass basic browser usage, complex input management, mental reasoning, workflow automation, and interactive entertainment. WebGames aims to provide accurate measurements by isolating interaction skills and allowing tested AI to maintain control. Its client-side design eliminates dependencies on external resources, ensuring consistent and reproducible testing.

Modular Design and Standardization

WebGames features a modular design that specifies problems in a standardized JSONL format, facilitating easy integration with automated testing frameworks and the addition of new tasks. All problems adhere to a deterministic verification structure, ensuring that tasks can be verified upon completion. This structure systematically evaluates AI performance through web interactions, quantifying navigation, decision-making, and adaptability in dynamic environments.

Evaluation of Leading Models

Researchers assessed top vision-language foundation models, including GPT-4o, Claude Computer-Use (Sonnet 3.5), Gemini-1.5-Pro, Qwen2-VL, and a Proxy assistant, using WebGames to evaluate their web interaction capabilities. Most models were not originally designed for web interactions and required scaffolding through a Chromium browser using Playwright. Except for Claude, the models lacked sufficient graphical user interface (GUI) grounding to identify exact pixel locations, necessitating a Set-of-Marks (SoMs) approach to highlight relevant elements. The models operated within a partially observed Markov decision process (POMDP), receiving JPEG screenshots and text-based SoM elements while executing tool-based actions through a ReAct-style prompting method. The evaluation indicated that Claude scored lower than GPT-4 despite having more precise web control, likely due to training restrictions that limited actions resembling human behavior. Human participants from Prolific completed tasks efficiently, averaging 80 minutes and earning £18, with some achieving perfect scores. The findings highlighted a significant capability gap between human and AI performance, similar to the ARC challenge, with certain tasks, like “Slider Symphony,” requiring precise drag-and-drop skills that proved challenging for AI models.

Conclusion and Future Directions

The proposed benchmark revealed a substantial gap in performance between humans and AI in web interaction tasks. The highest-performing AI model, GPT-4o, achieved only 41.2% success, while humans reached 95.7%. These results demonstrate that current AI systems struggle with intuitive web interactions, and limitations on models like Claude Computer-Use hinder task success. This approach serves as a reference point for further research aimed at enhancing AI flexibility, reasoning, and efficiency in web interactions.

Next Steps for Businesses

Explore how artificial intelligence can transform your business operations:

  • Identify processes that can be automated and pinpoint customer interactions where AI can add significant value.
  • Establish key performance indicators (KPIs) to ensure your AI investments positively impact your business.
  • Select tools that align with your needs and allow for customization to meet your objectives.
  • Start with a small project, gather data on its effectiveness, and gradually expand your AI initiatives.

If you need assistance in managing AI within your business, contact us at hello@itinai.ru. Connect with us on Telegram, X, and LinkedIn.


AI Products for Business or Try Custom Development

AI Sales Bot

Welcome AI Sales Bot, your 24/7 teammate! Engaging customers in natural language across all channels and learning from your materials, it’s a step towards efficient, enriched customer interactions and sales

AI Document Assistant

Unlock insights and drive decisions with our AI Insights Suite. Indexing your documents and data, it provides smart, AI-driven decision support, enhancing your productivity and decision-making.

AI Customer Support

Upgrade your support with our AI Assistant, reducing response times and personalizing interactions by analyzing documents and past engagements. Boost your team and customer satisfaction

AI Scrum Bot

Enhance agile management with our AI Scrum Bot, it helps to organize retrospectives. It answers queries and boosts collaboration and efficiency in your scrum processes.