Itinai.com futuristic ui icon design 3d sci fi computer scree 96ec8ed5 1368 40d6 b9ef 83c7afdaead4 0
Itinai.com futuristic ui icon design 3d sci fi computer scree 96ec8ed5 1368 40d6 b9ef 83c7afdaead4 0

Webwright Boosts Web Agent Scores from 33.5% to 60.1% – See How

Most web agents today operate by taking a single browser action at a time – they receive a screenshot or DOM text, predict the next click, keypress or scroll, and repeat. This step‑by‑step loop made sense when language models had limited reasoning, but now that models can write and debug code, the rigid action‑at‑a‑time design becomes a bottleneck. It forces the agent to repeat low‑level predictions for tasks that could be expressed as a short program, leading to inefficiency, fragile scripts and difficulty reusing work.

Microsoft Research’s AI Frontiers lab introduced Webwright to solve this problem. Webwright replaces the continuous browser session with a terminal‑native loop. The agent writes Playwright code, runs bash commands, inspects logs and screenshots, and iteratively refines the script. The persistent artifact is the code and logs stored in a local workspace, not a live browser state. This mirrors how a developer builds an RPA script: write once, run many times, adapt and share.

The system consists of three lightweight components – a Runner (~150 lines), a Model Endpoint (~550 lines) and a terminal Environment (~300 lines). The Runner sends context to the model, the model returns a thinking block and a shell command, the Environment executes the command and returns output, logs, screenshots or errors, which feed back into the loop. To prevent premature completion claims, the agent must generate a self‑reflection config, run a final verification script in a clean folder and pass its own success check before marking the task done. To keep context length manageable, the history is summarized every 20 steps.

Evaluations show clear benefits. On Online‑Mind2Web, GPT‑5.4 with Webwright scores 86.7% accuracy, far above the base 33.5% from a screenshot‑only agent. On the long‑horizon Odysseys benchmark, Webwright reaches 60.1% – a 26.6‑point absolute gain over the baseline and a 79.4% relative improvement. Smaller models like Qwen3.5‑9B achieve 66.2% when paired with reusable tool scripts, proving that cost‑effective models can handle complex web tasks when given the right coding framework.

Webwright’s harness is about 1,000 lines total, requires no multi‑agent orchestration, and produces shareable CLI scripts that work with Claude Code, Codex and OpenClaw. By shifting from action‑by‑action prediction to code‑driven terminal interaction, developers and AI practitioners gain a more reliable, reusable and scalable way to automate web tasks.

#AI #Productivity #Automation #LLM #WebAgents #OpenSource

Itinai.com office ai background high tech quantum computing 0002ba7c e3d6 4fd7 abd6 cfe4e5f08aeb 0

Vladimir Dyachkov, Ph.D
Editor-in-Chief itinai.com

I believe that AI is only as powerful as the human insight guiding it.