WebStep logo

WebStep

Process-Level Evaluation of Web Agents with Semantic State Tracking

1Yonsei University   2Microsoft Research
WebStep method overview

Each website is paired with a semantic MDP that records high-level states and transitions in the background while the agent interacts only with the GUI. This enables automatic process-level analysis without manual annotation.

Overview

We introduce WebStep, a benchmark for process-level evaluation of web agents. WebStep contains 1,800 task instances across 10 self-hosted websites with controlled difficulty, and is designed to move beyond terminal success as the sole measure of performance. Instead of evaluating only whether an agent reaches the correct final outcome, WebStep makes it possible to analyze how agents search, decide, and fail throughout an interaction trajectory.


Automatic Trajectory Analysis

The semantic MDP records states and transitions in the background as the agent interacts with the GUI. Skill labels, coverage, and efficiency are computed from the recorded trajectory with no manual annotation.

Task
Click any column to inspect state and screenshot

Benchmark Overview

10
Websites
1,800
Task Instances
5
Skill Categories
5
Agents Evaluated
Mail
Calendar
Shopping
Accommodation
Food Delivery
Housing
Coding QA
Code Repo
Job Network
Team Chat
Task distribution across domains

Leaderboard

Agent Terminal (%) Exploration (%) Execution (%) Coverage (%) GUI Steps Semantic Steps GUI/Semantic
OpenAI CUA 82.2 87.7 86.2 71.0 19.7 10.0 2.0
Qwen3.5-122B 57.9 66.1 73.7 67.7 22.1 9.8 2.3
UI-TARS-1.5-7B 32.6 46.3 50.6 62.6 35.0 14.0 2.5
GUI-Owl-1.5-8B 31.9 43.6 55.6 61.9 28.3 10.4 2.7
Fara-7B 31.4 43.6 55.7 60.6 18.9 8.3 2.3

Terminal success alongside process-level metrics from semantic MDP traces. Process metrics reveal behavioral differences not visible from terminal success alone.


Key Findings

1

Same score, different behavior

Agents at 31-33% success diverge in exploration vs. execution accuracy.

2

Skill-specific weaknesses

OpenAI CUA beats Qwen3.5 by 23.7% on commits but trails by 15.6% on filtering, within the same domain.

3

Agent-specific decisive errors

GUI-Owl diverges at filtering; Qwen3.5 at inspection.

4

Gaps widen with difficulty

Similar on easy tasks, sharply different as complexity grows.

Terminal outcome hides different failures

Terminal outcome alone cannot distinguish qualitatively different failures. Two agents both fail the same task, but for entirely different reasons: one fails to locate the target, the other reaches it but executes the wrong action.


BibTeX

@article{webstep2026,
  title={Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking},
  author={TBD},
  journal={arXiv preprint},
  year={2026}
}