Evaluations & RL Environments
GameDevBench: Evaluating Agentic Capabilities Through Game Development
A benchmark of 132 real game development tasks for evaluating agentic coding, multimodal reasoning, and graphics-aware capabilities.
Abstract
GameDevBench is an evaluation framework for assessing AI agents on real game development tasks. The benchmark comprises 132 tasks sourced from web and video tutorials, requiring agents to work with complex codebases and multimodal assets including shaders, sprites, and animations. The best agent solves only 54.5% of tasks overall, with performance declining significantly on graphics-focused work, dropping to 31.6% on 2D graphics tasks. Adding image and video feedback mechanisms substantially improves performance, raising Claude Sonnet 4.5 from 33.3% to 47.7%.
Why this paper matters
- Establishes a hard, real-world agentic benchmark grounded in actual game development workflows.
- Quantifies a clear gap in graphics-aware and multimodal reasoning for current frontier coding agents.
- Demonstrates that visual feedback meaningfully closes the gap, motivating multimodal harness design.
Keywords
Agent evaluation, agentic benchmarks, coding agents, multimodal reasoning, game development, visual feedback, embodied agents.
BibTeX
@article{chi2026gamedevbench,
title={GameDevBench: Evaluating Agentic Capabilities Through Game Development},
author={Chi, Wayne and Fang, Yixiong and Yayavaram, Arnav and Yayavaram, Siddharth and Karten, Seth and Wei, Qiuhong Anna and Chen, Runkun and Wang, Alexander and Chen, Valerie and Talwalkar, Ameet and Donahue, Chris},
journal={arXiv preprint arXiv:2602.11103},
year={2026}
}