WebStep

Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking

COLM 2026

Success rates only tell you whether an agent worked. WebStep shows how it worked and where it went wrong — a semantic MDP records every step, fully automatically.

Jiwan Chung¹ JiHyuk Byun¹ Vibhav Vineet² Seon Joo Kim¹

¹Yonsei University ²Microsoft Research

Paper arXiv Code (coming soon) Dataset (coming soon)

Automatic Trajectory Analysis

Live demo — real agent trajectories, click or hover any step to inspect

Task

Click any column to inspect state and screenshot

The semantic MDP records states and transitions in the background as the agent interacts with the GUI. Skill labels, coverage, and efficiency are computed from the recorded trajectory with no manual annotation.

Leaderboard

Agent	Terminal (%)	Exploration (%)	Execution (%)	Coverage (%)	GUI Steps	Semantic Steps	GUI/Semantic
OpenAI CUA	82.2	87.7	86.2	71.0	19.7	10.0	2.0
Qwen3.5-122B	57.9	66.1	73.7	67.7	22.1	9.8	2.3
UI-TARS-1.5-7B	32.6	46.3	50.6	62.6	35.0	14.0	2.5
GUI-Owl-1.5-8B	31.9	43.6	55.6	61.9	28.3	10.4	2.7
Fara-7B	31.4	43.6	55.7	60.6	18.9	8.3	2.3

Terminal success alongside process-level metrics from semantic MDP traces. Process metrics reveal behavioral differences not visible from terminal success alone.

Key Findings

Same score, different behavior

Agents at 31-33% success diverge in exploration vs. execution accuracy.

Skill-specific weaknesses

OpenAI CUA beats Qwen3.5 by 23.7% on commits but trails by 15.6% on filtering, within the same domain.

Agent-specific decisive errors

GUI-Owl diverges at filtering; Qwen3.5 at inspection.

Gaps widen with difficulty

Similar on easy tasks, sharply different as complexity grows.

Agent	Terminal (%)	Exploration (%)	Execution (%)
UI-TARS-1.5-7B	32.6	46.3	50.6
GUI-Owl-1.5-8B	31.9	43.6	55.6
Fara-7B	31.4	43.6	55.7

Three agents land within a point of each other on terminal success, yet differ visibly in exploration reach and execution accuracy — differences only process metrics expose.

Agents exhibit distinct within-website skill profiles. e.g., on Housing, OpenAI CUA is stronger on commit but weaker on filter, showing that weakness can be concentrated in a specific interaction skill.

Bifurcation analysis localizes where failure is introduced. (a) Wrong branch: the decision that first sends the failing trajectory off course. (b) Delayed commit: extra behavior after missing the right time to finish. (c) Premature commit: missing behavior skipped before finishing too early.

Exploration success by task complexity. (a) Performance by hard negative count separates OpenAI CUA from others. (b) Oracle trajectory length shows downward trend. (c) Tasks requiring detail-page evidence show different model gaps from list-level tasks.

WebStep

Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking

Overview

Agents act on the GUI

The MDP observer records

Process metrics fall out

Process metrics

Fine-grained diagnosis

Actionable insight