BenchmarkTracked since May 21, 2026

Introduce daily task evaluation pipeline for browser-use agents

This change adds a dedicated daily evaluation module so teams can benchmark browser-agent runs in a repeatable, task-card-driven way instead of using ad-hoc scripts, making quality drift easier to detect over time.

daily_task_evaltask cardsrun-agent CLIcompare CLI

What Happened

This change adds a dedicated daily evaluation module so teams can benchmark browser-agent runs in a repeatable, task-card-driven way instead of using ad-hoc scripts, making quality drift easier to detect over time.
This change adds a dedicated daily evaluation module so teams can benchmark browser-agent runs in a repeatable, task-card-driven way instead of using ad-hoc scripts, making quality drift easier to detect over time.
1 evidence item attached for review.

What is Different

Before

Scattered source updates, isolated context, and manual follow-up across multiple feeds.

Now

Introduces a concrete, reusable evaluation harness for browser-use by adding a standalone experiment module plus CLI orchestration (`init`, `run-agent`, `compare`) that executes predefined task cards, supports multiple model backends/presets, and produces comparable result artifacts.

Why Track This

Why It Matters

Evaluation operators and model teams can now run the same daily browser-agent tasks through one CLI workflow and compare human-vs-agent outcomes side by side, so regressions in automation quality become easier to detect before they spread into broader evaluation or deployment workflows; the next thing to watch is whether score variance stays stable across repeated runs and provider backends. The pull request adds a self-contained `browser_use/experiments/daily_task_eval` package with `A/B/C/D` presets, navigator options (including periodic replanning), and structured JSON/CSV outputs (`history`, `conversation`, summaries), but this should still be monitored for noise from replan timing and preset/provider defaults that could skew trend interpretation.

Impact

What To Watch Next

Watch whether daily_task_eval becomes a repeated pattern.
Track follow-up changes around Evals and Benchmarks.
Compare future signals against this evidence trail.
Re-check risk flags: result_variance_across_repeated_runs, navigator_replan_timing_noise.

Open Topic Timeline Open Technical Event Open Original Sourceresult_variance_across_repeated_runs / navigator_replan_timing_noise / preset_provider_default_side_effects / evaluation_artifact_schema_changes

Supporting Evidence

GITHUB PULL REQUESTHigh Trust

browser-use/browser-use PR #4802: feat: daily task eval — multi-model, navigator, task cards

PR #4802 creates `daily_task_eval` with preset-based (`A/B/C/D`) experiments, multi-model execution, optional navigator control, and CLI flows for run/comparison output.