May 2026

Can AI agents do real-world video post-production work?

We gave the 9 best frontier models 100 expert-authored tasks across the four stages of video post-production. The best agent tops out at 32%. Human experts scored 89%.

Read the paper Leaderboard Code & data Tasks Discord

100

Tasks

Industry experts

Frontier models

Task families

Why this benchmark exists

Verification is not here for free.

RLVR - reinforcement learning with verifiable rewards - works in math and code because centuries of humanistic work built the verifiers; the bill was paid before we got there. Creative work hasn't paid that bill. AgenticVBench is what paying it looks like in film.

It also measures the sim2real gap: the distance between how agents score on tidy lab benchmarks and how they hold up on real post-production work. Here that gap is stark - the best frontier agent scores 32%, human experts 89%.

Read the full essay →

Leaderboard preview

Top 5 model × harness combinations.

View full leaderboard →

Rank	Agent	Avg	Repurpose	Seq	Repair	Assembly
·	Human expertsreference	88.5%	95%	90%	88%	81%
1	Claude Fable 5· Claude Code	32.4%± 5.1	29%	23%	31%	46%
2	GPT-5.5· Codex	31.0%± 4.0	30%	26%	30%	38%
3	GPT-5.5· OpenCode	27.4%± 3.5	27%	20%	27%	37%
4	Gemini 3.1 Pro· OpenCode	23.8%± 3.7	23%	19%	20%	33%
5	MiniMax-M3· OpenCode	22.7%± 3.2	23%	12%	18%	37%

What the bench tests

Four task families spanning the real-world post-production workflow.

Authored by 20 industry experts averaging 6 years of post-production experience. Tasks span 30 minutes to one week of human work.

Assembly

18 tasks

pp gap

Given a storyboard with 3–6 slots and a shuffled pool of candidate clips, select the clip that matches each slot.

Best agent 46%Human 81%

Repair

18 tasks

pp gap

Given a video with defects (frozen scene, scene swap, color drift, or audio noise), localize them and produce a fixed cut.

Best agent 31%Human 88%

Sequencing

28 tasks

pp gap

Given a brief story overview and a shuffled set of clips, recover the correct narrative order.

Best agent 29%Human 90%

Repurpose

36 tasks

pp gap

Given 4-150 minutes of source video and a creative brief, repurpose it into a short deliverable that follows the brief and preserves the story.

Best agent 30%Human 95%

The harness finding

The harness matters as much as the model.

Holding the model fixed and varying the harness shifts GPT-5.5's Assembly score by 20 percentage points, comparable to the gap between adjacent models on the leaderboard.

Most benchmarks today are still model-based. The data here says that's wrong. Agent performance is determined by both the model and the scaffolding around it. Reporting only the model misses the larger story.

Agent = model × harness.

GPT-5.5 on Assembly · score by harness

Codex

38%

OpenCode

37%

OpenClaw

18%

Same model. 20-point swing.

Cite this work

Citation

If you find AgenticVBench useful in your research, please consider citing the paper.

BibTeX

@article{cao2026agenticvbench,
  title={AgenticVBench: Can AI Agents Complete Real-World Post-Production Tasks?},
  author={Cao, Zongheng and Zheng, Yi and Song, Rui and Hu, Xinyu},
  journal={arXiv preprint arXiv:2605.27705},
  year={2026}
}

Read the paper on arXiv →