Every
To Improve LLMs, Coach Them Like Athletes in an Arena
Midjourney/Every illustration.

To Improve LLMs, Coach Them Like Athletes in an Arena

Games will teach you more about model capabilities than benchmarks ever could

Aug 15, 2025Updated Jun 17, 2026

Comments1

Was this newsletter forwarded to you? Sign up to get it in your inbox.


This week I and several colleagues published our findings about how, with a little elbow grease and creativity, anyone can dramatically increase performance of any LLM.

The secret is in coaching. Allow me to explain.

The reason an athlete can credibly claim to be “best in the world” is because arenas and structured competition—games—exist. There are rules, clocks, standings, and tape you can study. The AI world has benchmarks—but benchmarks only check facts. Games reveal a model’s behavior, which can be recorded and studied to help models get better. That is what we did with AI Diplomacy, a project in which we turned the classic strategy game Diplomacy into a competitive arena for language models.

AI Diplomacy works because it has clear goals—try to outfox your opponents and take over Europe—and room to improvise. But subtlety and guile are key parts of the game, which centers on tactical negotiations (check out our complete list of rules). When we first set up the game environment, the LLMs were lost. After we got past a bunch of thorny technical problems, we realized that we could learn a ton about the models’ strengths and weaknesses from how they play against each other—and that we could coach them to be better. For example, prompting models to act more aggressively turned GPT-5 from a patsy into a formidable contender. Claude Sonnet 4, meanwhile, was a strong, speedy player even without specialized prompting.

These are useful differences. One model is highly steerable, the other is fast and consistent. That improvement tells you how the model will respond to a real-world task. If you have more time to craft a great prompt and need the best result, GPT-5 would be great. In a rush? Try Claude 4.

The industry is starting to realize that games can help evaluate models and push them to new levels of performance. Google has launched Google Arena, for instance, because the company says games are “[the] perfect testbed for evaluating models & agents.”

We agree. In fact, we think there’s so much potential here that we’re putting up $1,000 in prize money to see who can prompt their agent to victory in AI Diplomacy in our Battle of the Bots in September.

In the meantime, let’s break down our findings so far.

What AI Diplomacy taught us about models


Become a paid subscriber to Every to unlock this piece and learn about:

  1. How Alex and his team learned to plug LLMs into a complex game environment
  2. How prompting totally changed how models performed
  3. Quirky behavior that emerged, including models getting "drunk on power"


Create a free account to continue reading

The Only Subscription
You Need to Stay at the
Edge of AI

The essential toolkit for those shaping the future

"This might be the best value you
can get from an AI subscription."

- Jay S.

Every ContentEvery Content
AI&I PodcastAI&I Podcast
MonologueMonologue
CoraCora
SparkleSparkle
SpiralSpiral

Join 100,000+ leaders, builders, and innovators

Community members

Already have an account? Sign in.

What is included in a subscription?

Daily insights from AI pioneers + early access to powerful AI tools

PencilFront-row access to the future of AI
CheckIn-depth reviews of new models on release day
CheckPlaybooks and guides for putting AI to work
CheckPrompts and use cases for builders

Comments

You need to login before you can comment.
Don't have an account? Sign up!

We use analytics and advertising tools by default. You can update this anytime.