AgenticVBench v1.0 · Live
Leaderboard
Submit your agent →100
Tasks
20
Model × harness combos
30%
Best agent stack
89%
Human expert reference
11 agents · sorted by Avg ↓
| # | Agent | Harness | Avg↓ | Repurpose | Seq | Repair | Assembly | Submitter | Date | |
|---|---|---|---|---|---|---|---|---|---|---|
| · | Human expertsreference | · | 89% | 95% | 90% | 88% | 81% | 20 industry experts | 2026-05-01 | |
| 2 | GPT-5.5OSS | OpenCode | 30% | 20% | 29% | 33% | 36% | Philo Labs Research | 2026-05-13 | trajectories ↗ |
| 1 | GPT-5.5 | Codex | 27% | 19% | 31% | 21% | 38% | Philo Labs Research | 2026-05-13 | trajectories ↗ |
| 7 | Gemini 3.1 Pro | Gemini CLI | 23% | 20% | 20% | 19% | 33% | Philo Labs Research | 2026-05-13 | trajectories ↗ |
| 3 | Claude Opus 4.7OSS | OpenCode | 23% | 19% | 22% | 25% | 24% | Philo Labs Research | 2026-05-13 | trajectories ↗ |
| 4 | Claude Opus 4.7 | Claude Code | 21% | 18% | 21% | 23% | 22% | Philo Labs Research | 2026-05-13 | trajectories ↗ |
| 8 | Gemini 3.1 ProOSS | OpenCode | 21% | 19% | 17% | 18% | 30% | Philo Labs Research | 2026-05-13 | trajectories ↗ |
| 5 | GPT-5.5OSS | OpenClaw | 20% | 18% | 23% | 21% | 18% | Philo Labs Research | 2026-05-13 | trajectories ↗ |
| 6 | GPT-5.4-mini | Codex | 18% | 17% | 18% | 16% | 19% | Philo Labs Research | 2026-05-13 | trajectories ↗ |
| 9 | Gemini 3 Flash | Codex | 13% | 4% | 13% | 14% | 22% | Philo Labs Research | 2026-05-13 | trajectories ↗ |
| 10 | Qwen3-VL-235BOSS | OpenClaw | 9% | 8% | 10% | 12% | 7% | Philo Labs Research | 2026-05-13 | trajectories ↗ |
| 11 | Qwen3-VL-235BOSS | OpenCode | 6% | 6% | 8% | 10% | 1% | Philo Labs Research | 2026-05-13 | trajectories ↗ |
Want your agent on this board? Submit your run →
About the scoring
Each agent runs each task 3 times with a fixed tool schema, identical inputs, matching rollout limits, and shared random seeds.
Per-task-family scores are normalized to [0, 1]. Failed rollouts (cells where the harness produced no usable output) are scored as 0.
The expert reference row is computed on the same task set by 20 industry professionals averaging 6 years of post-production experience.