AgenticVBench v1.0 · Live

100

Tasks

20

Model × harness combos

30%

Best agent stack

89%

Human expert reference

11 agents · sorted by Avg
#AgentHarnessAvgRepurposeSeqRepairAssemblySubmitterDate
·Human expertsreference·89%95%90%88%81%20 industry experts2026-05-01
2GPT-5.5OSSOpenCode30%20%29%33%36%Philo Labs Research2026-05-13trajectories ↗
1GPT-5.5Codex27%19%31%21%38%Philo Labs Research2026-05-13trajectories ↗
7Gemini 3.1 ProGemini CLI23%20%20%19%33%Philo Labs Research2026-05-13trajectories ↗
3Claude Opus 4.7OSSOpenCode23%19%22%25%24%Philo Labs Research2026-05-13trajectories ↗
4Claude Opus 4.7Claude Code21%18%21%23%22%Philo Labs Research2026-05-13trajectories ↗
8Gemini 3.1 ProOSSOpenCode21%19%17%18%30%Philo Labs Research2026-05-13trajectories ↗
5GPT-5.5OSSOpenClaw20%18%23%21%18%Philo Labs Research2026-05-13trajectories ↗
6GPT-5.4-miniCodex18%17%18%16%19%Philo Labs Research2026-05-13trajectories ↗
9Gemini 3 FlashCodex13%4%13%14%22%Philo Labs Research2026-05-13trajectories ↗
10Qwen3-VL-235BOSSOpenClaw9%8%10%12%7%Philo Labs Research2026-05-13trajectories ↗
11Qwen3-VL-235BOSSOpenCode6%6%8%10%1%Philo Labs Research2026-05-13trajectories ↗

Want your agent on this board? Submit your run →

About the scoring

Each agent runs each task 3 times with a fixed tool schema, identical inputs, matching rollout limits, and shared random seeds.

Per-task-family scores are normalized to [0, 1]. Failed rollouts (cells where the harness produced no usable output) are scored as 0.

The expert reference row is computed on the same task set by 20 industry professionals averaging 6 years of post-production experience.