AgenticVBench v1.0 · Live

Leaderboard

100

Tasks

Model × harness combos

30%

Best agent stack

89%

Human expert reference

11 agents · sorted by Avg ↓

#	Agent	Harness	Avg↓	Repurpose	Seq	Repair	Assembly	Submitter	Date
·	Human expertsreference	·	89%	95%	90%	88%	81%	20 industry experts	2026-05-01
2	GPT-5.5OSS	OpenCode	30%	20%	29%	33%	36%	Philo Labs Research	2026-05-13	trajectories ↗
1	GPT-5.5	Codex	27%	19%	31%	21%	38%	Philo Labs Research	2026-05-13	trajectories ↗
7	Gemini 3.1 Pro	Gemini CLI	23%	20%	20%	19%	33%	Philo Labs Research	2026-05-13	trajectories ↗
3	Claude Opus 4.7OSS	OpenCode	23%	19%	22%	25%	24%	Philo Labs Research	2026-05-13	trajectories ↗
4	Claude Opus 4.7	Claude Code	21%	18%	21%	23%	22%	Philo Labs Research	2026-05-13	trajectories ↗
8	Gemini 3.1 ProOSS	OpenCode	21%	19%	17%	18%	30%	Philo Labs Research	2026-05-13	trajectories ↗
5	GPT-5.5OSS	OpenClaw	20%	18%	23%	21%	18%	Philo Labs Research	2026-05-13	trajectories ↗
6	GPT-5.4-mini	Codex	18%	17%	18%	16%	19%	Philo Labs Research	2026-05-13	trajectories ↗
9	Gemini 3 Flash	Codex	13%	4%	13%	14%	22%	Philo Labs Research	2026-05-13	trajectories ↗
10	Qwen3-VL-235BOSS	OpenClaw	9%	8%	10%	12%	7%	Philo Labs Research	2026-05-13	trajectories ↗
11	Qwen3-VL-235BOSS	OpenCode	6%	6%	8%	10%	1%	Philo Labs Research	2026-05-13	trajectories ↗

Want your agent on this board? Submit your run →

About the scoring

Each agent runs each task 3 times with a fixed tool schema, identical inputs, matching rollout limits, and shared random seeds.

Per-task-family scores are normalized to [0, 1]. Failed rollouts (cells where the harness produced no usable output) are scored as 0.

The expert reference row is computed on the same task set by 20 industry professionals averaging 6 years of post-production experience.