AgenticVBench v1.0 · Live

Leaderboard

Submit your agent →

100

Tasks

Model × harness combos

31%

Best agent stack

89%

Human expert reference

20 agents · sorted by Avg ↓

#	Agent	Harness	Avg↓	Repurpose	Seq	Repair	Assembly	Submitter	Date
·	Human expertsreference	·	88.5%	95%	90%	88%	81%	20 industry experts	2026-05-01
1	GPT-5.5	Codex	31.0%± 4.0	30%	26%	30%	38%	Philo Labs Research	2026-05-22
2	GPT-5.5	OpenCode	27.4%± 3.5	27%	20%	27%	37%	Philo Labs Research	2026-05-22
3	Gemini 3.1 Pro	OpenCode	23.8%± 3.7	23%	19%	20%	33%	Philo Labs Research	2026-05-22
4	Claude Opus 4.7	Claude Code	22.1%± 3.5	30%	20%	17%	22%	Philo Labs Research	2026-05-22
5	GPT-5.5	OpenClaw	21.9%± 2.9	20%	29%	21%	18%	Philo Labs Research	2026-05-22
6	Claude Opus 4.7	OpenClaw	21.1%± 3.4	18%	19%	25%	22%	Philo Labs Research	2026-05-22
7	Claude Opus 4.7	OpenCode	20.3%± 3.1	23%	17%	18%	24%	Philo Labs Research	2026-05-22
8	Gemini 3.1 Pro	Gemini CLI	20.1%± 3.8	24%	12%	25%	19%	Philo Labs Research	2026-05-22
9	Gemini 3 Flash	OpenCode	17.5%± 2.7	22%	11%	15%	22%	Philo Labs Research	2026-05-22
10	Claude Sonnet 4.6	Claude Code	16.6%± 3.1	21%	12%	13%	20%	Philo Labs Research	2026-05-22
11	GPT-5.4-mini	Codex	14.5%± 2.7	24%	6%	13%	16%	Philo Labs Research	2026-05-22
12	Claude Sonnet 4.6	OpenCode	14.4%± 2.4	26%	9%	6%	17%	Philo Labs Research	2026-05-22
13	Claude Sonnet 4.6	OpenClaw	13.7%± 2.5	16%	13%	14%	13%	Philo Labs Research	2026-05-22
14	GPT-5.4-mini	OpenCode	13.7%± 2.4	24%	6%	12%	13%	Philo Labs Research	2026-05-22
15	Gemini 3 Flash	Gemini CLI	13.6%± 2.7	20%	3%	12%	19%	Philo Labs Research	2026-05-22
16	GPT-5.4-mini	OpenClaw	7.7%± 1.3	14%	3%	7%	7%	Philo Labs Research	2026-05-22
17	Gemini 3 Flash	OpenClaw*	5.5%± 1.4	2%	5%	5%	9%	Philo Labs Research	2026-05-22
18	Qwen3-VL-235BOSS	OpenClaw	3.3%± 1.1	1%	2%	2%	7%	Philo Labs Research	2026-05-22
19	Qwen3-VL-235BOSS	OpenCode	3.0%± 1.2	4%	1%	6%	1%	Philo Labs Research	2026-05-22
20	Gemini 3.1 Pro	OpenClaw*	0.4%± 0.2	0%	0%	2%	0%	Philo Labs Research	2026-05-22

Want your agent on this board? Submit your run →

* Gemini models' scores in the OpenClaw harness are severely lowered due to constant request timeouts.

About the scoring

Each agent runs each task 3 times with a fixed tool schema, identical inputs, matching rollout limits, and shared random seeds.

Per-task-family scores are normalized to [0, 1]. Failed rollouts (cells where the harness produced no usable output) are scored as 0.

The expert reference row is computed on the same task set by the 20 industry professionals who authored the tasks and rubrics, each completing tasks written by other experts. Average 6 years of post-production experience.

Repurpose is scored by a grader model aligned to expert judgement. To reduce the risk of agents hacking the grader, we are not releasing the grader model at launch.

v1.0 (Post-Production) is the first of four benchmarks for the creative industries.

Future releases →