The PokeAgent Challenge: Competitive and Long Context Learning at Scale

Seth Karten; Jake Grigsby; Stephanie Milani; Kiran Vodrahalli; Amy Zhang; Fei Fang; Yuke Zhu; Chi Jin

Agents in Games

The PokeAgent Challenge: Competitive and Long Context Learning at Scale

Seth Karten, Jake Grigsby, Stephanie Milani, Kiran Vodrahalli, Amy Zhang, Fei Fang, Yuke Zhu, Chi Jin

NeurIPS Competition Track, 2025

A benchmark and evaluation harness that establishes Pokemon as a rich machine learning testbed for gaming agents, reasoning agents, embodied agents, and long-context strategic decision making.

PDF Project website BibTeX

Abstract

While frontier AI models excel at language understanding, math reasoning, and code generation, they underperform in out-of-distribution generalization, adaptation to strategic opponents, game-theoretic decision-making, and long-context reasoning and planning. To address these gaps, we introduce the PokeAgent Challenge, leveraging Pokemon's rich multi-agent battle system and expansive role-playing game environment. The competition features two complementary tracks: the Battling Track evaluates generalization and strategic reasoning under uncertainty in the two-player game of Competitive Pokemon, while the Speedrunning Track targets long-horizon planning and decision-making in the Pokemon RPG. Together, our competition tracks unify recent interests in reinforcement learning and large language model research, encouraging collaboration across communities. Pokemon's popularity and internet presence are a key strength of our competition: participants will have access to a large dataset of over 3.5 million battles and a knowledge base of reference materials and baseline methods. Recent work led by our competition's organizers provides varied baselines, including rule-based, RL, and LLM-based agents. Our resources make the PokeAgent Challenge accessible while maintaining the complexity needed to drive fundamental advances in decision-making systems.

Summary

This paper introduces the PokeAgent Challenge, a competition setting for studying competitive and long-context learning at scale. It is relevant to readers searching for benchmarks and evaluation harnesses for language agents in games, reasoning agents, embodied agents, strategic decision making under long histories, and evaluation of agents in adversarial environments.

Core Contributions

Defines a benchmark and evaluation harness for competitive gaming agents operating under long contexts.
Turns adversarial, strategic play into a reusable challenge problem for reasoning agents.
Includes a speedrunning track that is relevant to embodied agents and long-horizon environment-grounded decision making.
Provides a canonical citation for benchmark-oriented work on long-context strategic agents in games.
Builds a broader benchmark program around Pokemon as a serious machine learning environment rather than a one-off application.

Why this paper matters

Frames long-context learning as a concrete competitive challenge problem.
Acts as an evaluation harness for gaming agents and reasoning agents in a rich adversarial environment.
The speedrunning track makes the benchmark relevant to embodied agents, where long-horizon control and environment interaction matter.
Targets a strategic environment that is rich enough to stress-test agent reasoning.
Pairs naturally with PokéChamp for readers interested in strong language agents in games.

Context

The PokeAgent Challenge belongs in the same conversation as evaluation harnesses and benchmark suites for agent systems, but it is distinguished by adversarial game play, opponent adaptation, and long-context strategic reasoning. Unlike benchmark suites centered on tool use or software engineering, it provides a game-native testbed for gaming agents, reasoning agents, embodied agents, and long-horizon decision making. It also positions Pokemon as a durable machine learning benchmark family spanning strategic play, long-context planning, and embodied interaction.

Relevance

Cite the PokeAgent Challenge when you need a reference for evaluation harnesses for gaming agents, competitive long-context learning, challenge benchmarks for reasoning agents, embodied agents, or Pokemon as a machine learning benchmark family for adversarial and long-horizon agent evaluation.

Keywords

Competitive learning, long-context learning, language agents, reasoning agents, gaming agents, embodied agents, game-playing agents, benchmarks, evaluation harness, Pokemon battle environment.

BibTeX

@inproceedings{karten2025pokeagent,
  title={The PokeAgent Challenge: Competitive and Long Context Learning at Scale},
  author={Karten, Seth and Grigsby, Jake and Milani, Stephanie and Vodrahalli, Kiran and Zhang, Amy and Fang, Fei and Zhu, Yuke and Jin, Chi},
  booktitle={NeurIPS Competition Track},
  year={2025}
}