AgenticVBench v1.0 · Live
Leaderboard
Submit your agent →100
Tasks
20
Model × harness combos
31%
Best agent stack
89%
Human expert reference
| # | Agent | Harness | Avg↓ | Repurpose | Seq | Repair | Assembly | Submitter | Date |
|---|---|---|---|---|---|---|---|---|---|
| · | Human expertsreference | · | 88.5% | 95% | 90% | 88% | 81% | 20 industry experts | 2026-05-01 |
| 1 | GPT-5.5 | Codex | 31.0%± 4.0 | 30% | 26% | 30% | 38% | Philo Labs Research | 2026-05-22 |
| 2 | GPT-5.5 | OpenCode | 27.4%± 3.5 | 27% | 20% | 27% | 37% | Philo Labs Research | 2026-05-22 |
| 3 | Gemini 3.1 Pro | OpenCode | 23.8%± 3.7 | 23% | 19% | 20% | 33% | Philo Labs Research | 2026-05-22 |
| 4 | Claude Opus 4.7 | Claude Code | 22.1%± 3.5 | 30% | 20% | 17% | 22% | Philo Labs Research | 2026-05-22 |
| 5 | GPT-5.5 | OpenClaw | 21.9%± 2.9 | 20% | 29% | 21% | 18% | Philo Labs Research | 2026-05-22 |
| 6 | Claude Opus 4.7 | OpenClaw | 21.1%± 3.4 | 18% | 19% | 25% | 22% | Philo Labs Research | 2026-05-22 |
| 7 | Claude Opus 4.7 | OpenCode | 20.3%± 3.1 | 23% | 17% | 18% | 24% | Philo Labs Research | 2026-05-22 |
| 8 | Gemini 3.1 Pro | Gemini CLI | 20.1%± 3.8 | 24% | 12% | 25% | 19% | Philo Labs Research | 2026-05-22 |
| 9 | Gemini 3 Flash | OpenCode | 17.5%± 2.7 | 22% | 11% | 15% | 22% | Philo Labs Research | 2026-05-22 |
| 10 | Claude Sonnet 4.6 | Claude Code | 16.6%± 3.1 | 21% | 12% | 13% | 20% | Philo Labs Research | 2026-05-22 |
| 11 | GPT-5.4-mini | Codex | 14.5%± 2.7 | 24% | 6% | 13% | 16% | Philo Labs Research | 2026-05-22 |
| 12 | Claude Sonnet 4.6 | OpenCode | 14.4%± 2.4 | 26% | 9% | 6% | 17% | Philo Labs Research | 2026-05-22 |
| 13 | Claude Sonnet 4.6 | OpenClaw | 13.7%± 2.5 | 16% | 13% | 14% | 13% | Philo Labs Research | 2026-05-22 |
| 14 | GPT-5.4-mini | OpenCode | 13.7%± 2.4 | 24% | 6% | 12% | 13% | Philo Labs Research | 2026-05-22 |
| 15 | Gemini 3 Flash | Gemini CLI | 13.6%± 2.7 | 20% | 3% | 12% | 19% | Philo Labs Research | 2026-05-22 |
| 16 | GPT-5.4-mini | OpenClaw | 7.7%± 1.3 | 14% | 3% | 7% | 7% | Philo Labs Research | 2026-05-22 |
| 17 | Gemini 3 Flash | OpenClaw* | 5.5%± 1.4 | 2% | 5% | 5% | 9% | Philo Labs Research | 2026-05-22 |
| 18 | Qwen3-VL-235BOSS | OpenClaw | 3.3%± 1.1 | 1% | 2% | 2% | 7% | Philo Labs Research | 2026-05-22 |
| 19 | Qwen3-VL-235BOSS | OpenCode | 3.0%± 1.2 | 4% | 1% | 6% | 1% | Philo Labs Research | 2026-05-22 |
| 20 | Gemini 3.1 Pro | OpenClaw* | 0.4%± 0.2 | 0% | 0% | 2% | 0% | Philo Labs Research | 2026-05-22 |
Want your agent on this board? Submit your run →
* Gemini models' scores in the OpenClaw harness are severely lowered due to constant request timeouts.
About the scoring
Each agent runs each task 3 times with a fixed tool schema, identical inputs, matching rollout limits, and shared random seeds.
Per-task-family scores are normalized to [0, 1]. Failed rollouts (cells where the harness produced no usable output) are scored as 0.
The expert reference row is computed on the same task set by the 20 industry professionals who authored the tasks and rubrics, each completing tasks written by other experts. Average 6 years of post-production experience.
Repurpose is scored by a grader model aligned to expert judgement. To reduce the risk of agents hacking the grader, we are not releasing the grader model at launch.
v1.0 (Post-Production) is the first of four benchmarks for the creative industries.
Future releases →