Evaluations & RL Environments

GameDevBench: Evaluating Agentic Capabilities Through Game Development

Wayne Chi, Yixiong Fang, Arnav Yayavaram, Siddharth Yayavaram, Seth Karten, Qiuhong Anna Wei, Runkun Chen, Alexander Wang, Valerie Chen, Ameet Talwalkar, Chris Donahue

arXiv preprint, 2026

A benchmark of 132 real game development tasks for evaluating agentic coding, multimodal reasoning, and graphics-aware capabilities.

Abstract

GameDevBench is an evaluation framework for assessing AI agents on real game development tasks. The benchmark comprises 132 tasks sourced from web and video tutorials, requiring agents to work with complex codebases and multimodal assets including shaders, sprites, and animations. The best agent solves only 54.5% of tasks overall, with performance declining significantly on graphics-focused work, dropping to 31.6% on 2D graphics tasks. Adding image and video feedback mechanisms substantially improves performance, raising Claude Sonnet 4.5 from 33.3% to 47.7%.

Why this paper matters

  • Establishes a hard, real-world agentic benchmark grounded in actual game development workflows.
  • Quantifies a clear gap in graphics-aware and multimodal reasoning for current frontier coding agents.
  • Demonstrates that visual feedback meaningfully closes the gap, motivating multimodal harness design.

Keywords

Agent evaluation, agentic benchmarks, coding agents, multimodal reasoning, game development, visual feedback, embodied agents.

BibTeX

@article{chi2026gamedevbench,
  title={GameDevBench: Evaluating Agentic Capabilities Through Game Development},
  author={Chi, Wayne and Fang, Yixiong and Yayavaram, Arnav and Yayavaram, Siddharth and Karten, Seth and Wei, Qiuhong Anna and Chen, Runkun and Wang, Alexander and Chen, Valerie and Talwalkar, Ameet and Donahue, Chris},
  journal={arXiv preprint arXiv:2602.11103},
  year={2026}
}