Webwright Boosts Web Agent Scores from 33.5% to 60.1% – See How

Most web agents today operate by taking a single browser action at a time – they receive a screenshot or DOM text, predict the next click, keypress or scroll, and repeat. This step‑by‑step loop made sense when language models had limited reasoning, but now that models can write and debug code, the rigid action‑at‑a‑time design becomes a bottleneck. It forces the agent to repeat low‑level predictions for tasks that could be expressed as a short program, leading to inefficiency, fragile scripts and difficulty reusing work.

Microsoft Research’s AI Frontiers lab introduced Webwright to solve this problem. Webwright replaces the continuous browser session with a terminal‑native loop. The agent writes Playwright code, runs bash commands, inspects logs and screenshots, and iteratively refines the script. The persistent artifact is the code and logs stored in a local workspace, not a live browser state. This mirrors how a developer builds an RPA script: write once, run many times, adapt and share.

The system consists of three lightweight components – a Runner (~150 lines), a Model Endpoint (~550 lines) and a terminal Environment (~300 lines). The Runner sends context to the model, the model returns a thinking block and a shell command, the Environment executes the command and returns output, logs, screenshots or errors, which feed back into the loop. To prevent premature completion claims, the agent must generate a self‑reflection config, run a final verification script in a clean folder and pass its own success check before marking the task done. To keep context length manageable, the history is summarized every 20 steps.

Evaluations show clear benefits. On Online‑Mind2Web, GPT‑5.4 with Webwright scores 86.7% accuracy, far above the base 33.5% from a screenshot‑only agent. On the long‑horizon Odysseys benchmark, Webwright reaches 60.1% – a 26.6‑point absolute gain over the baseline and a 79.4% relative improvement. Smaller models like Qwen3.5‑9B achieve 66.2% when paired with reusable tool scripts, proving that cost‑effective models can handle complex web tasks when given the right coding framework.

Webwright’s harness is about 1,000 lines total, requires no multi‑agent orchestration, and produces shareable CLI scripts that work with Claude Code, Codex and OpenClaw. By shifting from action‑by‑action prediction to code‑driven terminal interaction, developers and AI practitioners gain a more reliable, reusable and scalable way to automate web tasks.

#AI #Productivity #Automation #LLM #WebAgents #OpenSource