We introduce WebStep, a benchmark for process-level evaluation of web agents. WebStep contains 1,800 task instances across 10 self-hosted websites with controlled difficulty, and is designed to move beyond terminal success as the sole measure of performance. Instead of evaluating only whether an agent reaches the correct final outcome, WebStep makes it possible to analyze how agents search, decide, and fail throughout an interaction trajectory.
The semantic MDP records states and transitions in the background as the agent interacts with the GUI. Skill labels, coverage, and efficiency are computed from the recorded trajectory with no manual annotation.
| Agent | Terminal (%) | Exploration (%) | Execution (%) | Coverage (%) | GUI Steps | Semantic Steps | GUI/Semantic |
|---|---|---|---|---|---|---|---|
| OpenAI CUA | 82.2 | 87.7 | 86.2 | 71.0 | 19.7 | 10.0 | 2.0 |
| Qwen3.5-122B | 57.9 | 66.1 | 73.7 | 67.7 | 22.1 | 9.8 | 2.3 |
| UI-TARS-1.5-7B | 32.6 | 46.3 | 50.6 | 62.6 | 35.0 | 14.0 | 2.5 |
| GUI-Owl-1.5-8B | 31.9 | 43.6 | 55.6 | 61.9 | 28.3 | 10.4 | 2.7 |
| Fara-7B | 31.4 | 43.6 | 55.7 | 60.6 | 18.9 | 8.3 | 2.3 |
Terminal success alongside process-level metrics from semantic MDP traces. Process metrics reveal behavioral differences not visible from terminal success alone.
Agents at 31-33% success diverge in exploration vs. execution accuracy.
OpenAI CUA beats Qwen3.5 by 23.7% on commits but trails by 15.6% on filtering, within the same domain.
GUI-Owl diverges at filtering; Qwen3.5 at inspection.
Similar on easy tasks, sharply different as complexity grows.
Terminal outcome alone cannot distinguish qualitatively different failures. Two agents both fail the same task, but for entirely different reasons: one fails to locate the target, the other reaches it but executes the wrong action.
@article{webstep2026,
title={Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking},
author={TBD},
journal={arXiv preprint},
year={2026}
}