Assembly
18 tasks
43
pp gap
Given a storyboard with 3–6 slots and a shuffled pool of candidate clips, select the clip that matches each slot.
Philo Labs Research · May 2026
We gave the 7 best frontier models 100 expert-authored tasks across the four stages of post-production. The best agent scored 29.7%. Human experts scored 88%.
100
Tasks
20
Industry experts
7
Frontier models
4
Task families
Why this benchmark exists
RLVR works in math and code because centuries of humanistic work built the verifiers. Mathematicians spent two thousand years on formal proof. Engineers built test infrastructure. The verification was not free — the bill was paid before we got there.
Creative work hasn't paid that bill. You can't grade a film cut by checking for the presence of structural elements. The judgment lives in the eye and hand of the practitioner, transmitted by apprenticeship and example. Making that judgment legible to a training system is the research problem — and AgenticVBench is what it looks like when we do it in film.
The gap
Repurpose shows the widest gap (64 pp). Assembly the narrowest (43 pp). Neither is close.
Assembly
Select clips that match a storyboard, build the rough cut.
43
pp gap
Repair
Localize and fix defects in a rough cut — color drift, scene swap, audio spikes.
55
pp gap
Sequencing
Recover the correct narrative order from shuffled shots.
59
pp gap
Repurpose
Repurpose a long source video into a short deliverable that follows a client brief.
75
pp gap
What the bench tests
Authored by 20 industry experts averaging 6 years of post-production experience. Tasks span 30 minutes to one week of human work.
Assembly
18 tasks
43
pp gap
Given a storyboard with 3–6 slots and a shuffled pool of candidate clips, select the clip that matches each slot.
Repair
18 tasks
55
pp gap
Given a video with an injected defect — frozen scene, scene swap, color drift, audio noise — localize it and produce a fixed cut.
Sequencing
28 tasks
59
pp gap
Given a brief story overview and a shuffled set of clips, recover the correct narrative order.
Repurpose
36 tasks
75
pp gap
Given 4–60 minutes of source video and a creative brief, repurpose it into a short deliverable that follows the brief and preserves the story.
The harness finding
Holding the model fixed and varying the harness shifts GPT-5.5's Assembly score by 20 percentage points — comparable to the gap between adjacent models on the leaderboard.
Today, agent scores are reported as Model X scored Y. The data here says that's wrong. Agent performance is determined by both the model and the scaffolding around it. Reporting only the model misses the larger story.
Agent = model × harness.
GPT-5.5 on Assembly · score by harness
Same model. 20-point swing.
Failure modes
On Repurpose, 83% of failures are long-context information loss. On Repair, 65% are temporal reasoning. There is no single “AI is bad at video” problem.
Repurpose · n = 153
Dominant failure
Long-context information loss
Repair · n = 237
Dominant failure
Temporal reasoning
Leaderboard preview
| Rank | Agent | Avg | Repurpose | Seq | Repair | Assembly |
|---|---|---|---|---|---|---|
| — | Human expertsreference | 89% | 95% | 90% | 88% | 81% |
| 1 | GPT-5.5· Codex | 27% | 19% | 31% | 21% | 38% |
| 2 | GPT-5.5· OpenCode | 30% | 20% | 29% | 33% | 36% |
| 3 | Claude Opus 4.7· OpenCode | 23% | 19% | 22% | 25% | 24% |
| 4 | Claude Opus 4.7· Claude Code | 21% | 18% | 21% | 23% | 22% |
| 5 | GPT-5.5· OpenClaw | 20% | 18% | 23% | 21% | 18% |
Project leads
AgenticVBench was led by four people who know two things at once: how creative work actually gets made, and how agentic RL works.

Lead, Game
Ex-Roblox world models. Indie game developer.
Knows how to make a generated world actually fun to play. Most world models look real, but the things you can do inside them stay extremely limited — his bar is “as good as a real game.”

Lead, Film
CCA-trained, award-winning film director. Researcher.
Knows how to teach an AI to direct — what to cut, when to hold, what a director leaves out.

Lead, Physics & Video
Cambridge-trained physicist. Simulation engineer.
Knows how to teach AI video models the way the world actually moves — gravity, collisions, the rules a model gets wrong when all it has seen is pixels.

Lead, Composition & Aesthetics
Stanford-trained AI researcher. Photographer.
Knows how to teach AI to see what a photographer sees — framing, light, what's worth pointing the camera at.
And built on the work of 20 industry experts — editors, colorists, post supervisors, AI directors — who spent months articulating what good looks like in their craft.
Cite
@inproceedings{agenticvbench2026,
title = {AgenticVBench: Can AI Agents Complete Real-World Post-Production Tasks?},
author = {Philo Labs Research},
year = {2026},
url = {https://agenticvbench.com},
}