Philo Labs Research · May 2026

Can AI agents do real post-production work?

We gave the 7 best frontier models 100 expert-authored tasks across the four stages of post-production. The best agent scored 29.7%. Human experts scored 88%.

100

Tasks

20

Industry experts

7

Frontier models

4

Task families

Why this benchmark exists

Verification is not here for free.

RLVR works in math and code because centuries of humanistic work built the verifiers. Mathematicians spent two thousand years on formal proof. Engineers built test infrastructure. The verification was not free — the bill was paid before we got there.

Creative work hasn't paid that bill. You can't grade a film cut by checking for the presence of structural elements. The judgment lives in the eye and hand of the practitioner, transmitted by apprenticeship and example. Making that judgment legible to a training system is the research problem — and AgenticVBench is what it looks like when we do it in film.

The gap

Across every family, frontier agents fall 43–64 points short of expert humans.

Repurpose shows the widest gap (64 pp). Assembly the narrowest (43 pp). Neither is close.

Assembly

Select clips that match a storyboard, build the rough cut.

Best agent
38%
Human expert
81%

43

pp gap

Repair

Localize and fix defects in a rough cut — color drift, scene swap, audio spikes.

Best agent
33%
Human expert
88%

55

pp gap

Sequencing

Recover the correct narrative order from shuffled shots.

Best agent
31%
Human expert
90%

59

pp gap

Repurpose

Repurpose a long source video into a short deliverable that follows a client brief.

Best agent
20%
Human expert
95%

75

pp gap

Best agent stack
Human experts

What the bench tests

Four task families spanning the real post-production workflow.

Authored by 20 industry experts averaging 6 years of post-production experience. Tasks span 30 minutes to one week of human work.

Assembly

18 tasks

43

pp gap

Given a storyboard with 3–6 slots and a shuffled pool of candidate clips, select the clip that matches each slot.

Best agent 38%Human 81%

Repair

18 tasks

55

pp gap

Given a video with an injected defect — frozen scene, scene swap, color drift, audio noise — localize it and produce a fixed cut.

Best agent 33%Human 88%

Sequencing

28 tasks

59

pp gap

Given a brief story overview and a shuffled set of clips, recover the correct narrative order.

Best agent 31%Human 90%

Repurpose

36 tasks

75

pp gap

Given 4–60 minutes of source video and a creative brief, repurpose it into a short deliverable that follows the brief and preserves the story.

Best agent 20%Human 95%

The harness finding

The harness matters as much as the model.

Holding the model fixed and varying the harness shifts GPT-5.5's Assembly score by 20 percentage points — comparable to the gap between adjacent models on the leaderboard.

Today, agent scores are reported as Model X scored Y. The data here says that's wrong. Agent performance is determined by both the model and the scaffolding around it. Reporting only the model misses the larger story.

Agent = model × harness.

GPT-5.5 on Assembly · score by harness

Codex
38%
OpenCode
36%
Claude Code
31%
OpenClaw
18%

Same model. 20-point swing.

Failure modes

Agents don't fail the same way on every task.

On Repurpose, 83% of failures are long-context information loss. On Repair, 65% are temporal reasoning. There is no single “AI is bad at video” problem.

Repurpose · n = 153

Dominant failure

Long-context information loss

Long-context information loss
83%
Hallucinated grounding
6%
Modality misalignment
10%
Temporal reasoning
1%

Repair · n = 237

Dominant failure

Temporal reasoning

Temporal reasoning
65%
Modality misalignment
24%
Hallucinated grounding
11%
Long-context info loss
0%

Leaderboard preview

Top 5 model × harness combinations.

View full leaderboard →
RankAgentAvg
Human expertsreference89%
1GPT-5.5· Codex27%
2GPT-5.5· OpenCode30%
3Claude Opus 4.7· OpenCode23%
4Claude Opus 4.7· Claude Code21%
5GPT-5.5· OpenClaw20%

Project leads

The bridge.

AgenticVBench was led by four people who know two things at once: how creative work actually gets made, and how agentic RL works.

Tom — portrait

Tom

Lead, Game

Ex-Roblox world models. Indie game developer.

Knows how to make a generated world actually fun to play. Most world models look real, but the things you can do inside them stay extremely limited — his bar is “as good as a real game.”

Snow — portrait

Snow

Lead, Film

CCA-trained, award-winning film director. Researcher.

Knows how to teach an AI to direct — what to cut, when to hold, what a director leaves out.

Yi — portrait

Yi

Lead, Physics & Video

Cambridge-trained physicist. Simulation engineer.

Knows how to teach AI video models the way the world actually moves — gravity, collisions, the rules a model gets wrong when all it has seen is pixels.

Christine — portrait

Christine

Lead, Composition & Aesthetics

Stanford-trained AI researcher. Photographer.

Knows how to teach AI to see what a photographer sees — framing, light, what's worth pointing the camera at.

And built on the work of 20 industry experts — editors, colorists, post supervisors, AI directors — who spent months articulating what good looks like in their craft.

Cite

Use AgenticVBench in your work.

@inproceedings{agenticvbench2026,
  title  = {AgenticVBench: Can AI Agents Complete Real-World Post-Production Tasks?},
  author = {Philo Labs Research},
  year   = {2026},
  url    = {https://agenticvbench.com},
}