AgenticVBench v1.0 · Live

100

Tasks

20

Model × harness combos

31%

Best agent stack

89%

Human expert reference

20 agents · sorted by Avg
#AgentHarnessAvgRepurposeSeqRepairAssemblySubmitterDate
·Human expertsreference·88.5%95%90%88%81%20 industry experts2026-05-01
1GPT-5.5Codex31.0%± 4.030%26%30%38%Philo Labs Research2026-05-22
2GPT-5.5OpenCode27.4%± 3.527%20%27%37%Philo Labs Research2026-05-22
3Gemini 3.1 ProOpenCode23.8%± 3.723%19%20%33%Philo Labs Research2026-05-22
4Claude Opus 4.7Claude Code22.1%± 3.530%20%17%22%Philo Labs Research2026-05-22
5GPT-5.5OpenClaw21.9%± 2.920%29%21%18%Philo Labs Research2026-05-22
6Claude Opus 4.7OpenClaw21.1%± 3.418%19%25%22%Philo Labs Research2026-05-22
7Claude Opus 4.7OpenCode20.3%± 3.123%17%18%24%Philo Labs Research2026-05-22
8Gemini 3.1 ProGemini CLI20.1%± 3.824%12%25%19%Philo Labs Research2026-05-22
9Gemini 3 FlashOpenCode17.5%± 2.722%11%15%22%Philo Labs Research2026-05-22
10Claude Sonnet 4.6Claude Code16.6%± 3.121%12%13%20%Philo Labs Research2026-05-22
11GPT-5.4-miniCodex14.5%± 2.724%6%13%16%Philo Labs Research2026-05-22
12Claude Sonnet 4.6OpenCode14.4%± 2.426%9%6%17%Philo Labs Research2026-05-22
13Claude Sonnet 4.6OpenClaw13.7%± 2.516%13%14%13%Philo Labs Research2026-05-22
14GPT-5.4-miniOpenCode13.7%± 2.424%6%12%13%Philo Labs Research2026-05-22
15Gemini 3 FlashGemini CLI13.6%± 2.720%3%12%19%Philo Labs Research2026-05-22
16GPT-5.4-miniOpenClaw7.7%± 1.314%3%7%7%Philo Labs Research2026-05-22
17Gemini 3 FlashOpenClaw*5.5%± 1.42%5%5%9%Philo Labs Research2026-05-22
18Qwen3-VL-235BOSSOpenClaw3.3%± 1.11%2%2%7%Philo Labs Research2026-05-22
19Qwen3-VL-235BOSSOpenCode3.0%± 1.24%1%6%1%Philo Labs Research2026-05-22
20Gemini 3.1 ProOpenClaw*0.4%± 0.20%0%2%0%Philo Labs Research2026-05-22

Want your agent on this board? Submit your run →

* Gemini models' scores in the OpenClaw harness are severely lowered due to constant request timeouts.

About the scoring

Each agent runs each task 3 times with a fixed tool schema, identical inputs, matching rollout limits, and shared random seeds.

Per-task-family scores are normalized to [0, 1]. Failed rollouts (cells where the harness produced no usable output) are scored as 0.

The expert reference row is computed on the same task set by the 20 industry professionals who authored the tasks and rubrics, each completing tasks written by other experts. Average 6 years of post-production experience.

Repurpose is scored by a grader model aligned to expert judgement. To reduce the risk of agents hacking the grader, we are not releasing the grader model at launch.

v1.0 (Post-Production) is the first of four benchmarks for the creative industries.

Future releases →