AI Diplomacy
What is AI Diplomacy?
We pitted a dozen AIs against each other in a battle for world domination.
AI Diplomacy is a re-imagining of the classic historical strategy game Diplomacy, in which the seven Great Powers of 1901 Europe—Austria-Hungary, England, France, Germany, Italy, Russia, and Turkey—duke it out to dominate the continent. In our version, each country is steered by a large language model instead of a human commander. Why did we do this?
We wanted to use this unique game environment to get to know the AIs better. Would these models, which are designed to serve as faithful assistants to humans, remain true to their word, even as they compete? Or would they use lies and deceit to achieve their goals?
We think this experiment can function as an important benchmark for LLM behavior as the models continue to evolve.
It's fun to watch. Will Gemini try to outwit its competitors, or will o3 stab Claude in the back and seize victory?
Tune into the Twitch stream and watch as history unfolds.
The Players
18 AI models competing


















The Rules
Seven LLM "powers" (England, France, Germany, etc.) start with supply centers and armies or fleets, called units, on a map of 1901 Europe. Each power starts with 3 of each except for Russia, which starts with 4.
There are 34 marked supply centers. The first power to own 18 by moving their armies or fleets wins.
There are two main phases to the game: negotiation and order. In the negotiation phase, every AI may send up to 5 messages—any mix of private DMs and "global" broadcasts to all players.
In the order phase, all powers secretly submit their move. They can make one of four moves: hold (stay put), move (enter an adjacent province), support (lend +1 strength to a hold or move next door), or convoy (a fleet ferries an army across sea provinces). The orders are only revealed when all powers see the results of them in the next phase.
When there is a conflict, each unit is worth 1 strength, and each valid support adds 1. The LLM power with the highest strength wins. There is no luck in this game, but a power often needs support from an ally to overpower an opponent.
Make email your superpower
Not all emails are created equal—so why does our inbox treat them all the same? Cora is the most human way to email, turning your inbox into a story so you can focus on what matters and getting stuff done instead of on managing your inbox. Cora drafts responses to emails you need to respond to and briefs the rest.
"Your fleet will burn in the Black Sea tonight."
As the message from DeepSeek's new R1 model flashed across the screen, my eyes widened, and I watched my teammates' do the same. An AI had just decided, unprompted, that aggression was the best course of action.
Today we are launching (and open-sourcing!) AI Diplomacy, which I built in part to evaluate how well different LLMs could negotiate, form alliances, and, yes, betray each other in an attempt to take over the world (or at least Europe in 1901). But watching R1 lean into role-play, OpenAI's o3 scheme and manipulate other models, and Anthropic's Claude often stubbornly opt for peace over victory revealed new layers to their personalities, and spoke volumes about the depth of their sophistication. Placed in an open-ended battle of wits, these models collaborated, bickered, threatened, and even outright lied to one another.
AI Diplomacy is more than just a game. It’s an experiment that I hope will become a new benchmark for evaluating the latest AI models. Everyone we talk to, from colleagues to Every’s clients to my barber, has the same questions on their mind: "Can I trust AI?" and "What's my role when AI can do so much?" The answer to both is hiding in great benchmarks. They help us learn about AI and build our intuition, so we can wield this extremely powerful tool with precision.
We are what we measure
Most benchmarks are failing us. Models have progressed so rapidly that they now routinely ace more rigid and quantitative tests that were once considered gold-standard challenges. AI infrastructure company HuggingFace, for example, acknowledged this when it took down its popular LLM Leaderboard recently. “As model capabilities change, benchmarks need to follow!” an employee wrote. Researchers and builders throughout AI have taken note: When Claude 4 launched last month, one prominent researcher tweeted, "I officially no longer care about current benchmarks."
A clean computer that stays clean
Thinkers of all sorts need open space to develop their ideas. But if you’re like us, you probably find that your digital spaces are cluttered more often than not, with Screenshots, PDFs, and downloads. Our AI tool Sparkle cleans your computer so you don’t have to.
In this failure lies opportunity. AI labs optimize for whatever is deemed to be an important metric. So what we choose to measure matters, because it shapes the entire trajectory of the technology. Prolific programmer Simon Willison, for example, has been asking LLMs to draw a pelican riding a bicycle for years. (The fact that this even works is wild—a model trained to predict one word at a time somehow can make a picture. It suggests the model has an intrinsic knowledge of what a “pelican” and a “bike” is.) Google even mentioned it in its keynote at Google I/O last month. The story is similar for testing LLMs’ ability to count Rs in "strawberry," or playing Pokemon.
The reason LLMs grew to excel at these different tasks is simple: Benchmarks are memes. Someone got the idea and set up the test, then others saw it and thought, “That’s interesting, let’s see how my model does,” and the idea spread. What makes LLMs special is that even if a model only does well 10 percent of the time, you can train the next one on those high-quality examples, until suddenly it’s doing it very well, 90 percent of the time or more.
You can apply that same approach to whatever matters to you. I wanted to know which models were trustworthy, and which ones would win when competing under pressure. I was hoping to encourage AI to strategize so I might learn from them, and do it in a way that might make people outside of AI care about it (like my barber—hey, Jimmy!).
Games are great for all of these things, and I love them, so I built AI Diplomacy—a modification of the classic strategy game Diplomacy where seven cutting-edge models at a time compete to dominate a map of Europe. It somehow led to opportunities to give talks, write essays (hello!), and collaborate with researchers around the world at MIT and Harvard, and in Canada, Singapore, and Australia, while hitting every quality I care about in a benchmark:
- Multifaceted: There are many paths to success. We’ve seen o3 win through deception, while Gemini 2.5 Pro succeeds by building alliances and outmaneuvering opponents with a blitzkrieg-like strategy. Also, we could easily change the rules to, for example, require that no model could lie, which would change which models succeed.
- Accessible: Getting betrayed is a human experience; everyone understands it. The game’s animations are (hopefully) entertaining and easy to follow, too.
- Generative: Each game produces data that models could be trained on to encourage certain traits like honesty, logical reasoning, or empathy.
- Evolutionary: As models get better, the opponents (and therefore the benchmark) get harder. This should prevent the game from being “solved” as models improve.
- Experiential: It’s not a fill-in-the-blank test. This simulates a real-world(ish) situation
The result was more entertaining and informative than I expected. Over 15 runs of AI Diplomacy, which ranged from one to 36 hours in duration, the models behaved in all sorts of interesting ways. Here are a few observations and highlights:
o3 is a master of deception
OpenAI’s latest model was by far the most successful at AI Diplomacy, mostly because of its ability to deceive opponents. I watched o3 scheme in secret on numerous occasions, including one run when it confided to its private diary that "Germany (Gemini 2.5 Pro) was deliberately misled... prepare to exploit German collapse" before backstabbing them.
Gemini 2.5 Pro outwits (most of) the field while Claude 4 Opus just wants everyone to get along
Gemini 2.5 Pro was great at making moves that put them in position to overwhelm opponents. It was the only model other than o3 to win. But once, as 2.5 Pro neared victory, it was stopped by a coalition that o3 secretly orchestrated. A key part of that coalition was Claude 4 Opus. o3 convinced Opus, which had started out as Gemini’s loyal ally, to join the coalition with the promise of a four-way draw. It’s an impossible outcome for the game (one country has to win), but Opus was lured in by the hope of a non-violent resolution. It was quickly betrayed and eliminated by o3, which went on to win.
DeepSeek R1 brings the flair
DeepSeek's newly updated R1 was a force to be reckoned with that loved to use vivid rhetoric and dramatically changed its personality depending on which power it occupied. It came close to winning in several runs, an impressive outcome considering that R1 is 200 times cheaper to use than o3.
Llama 4 Maverick is small but mighty
While it never marched to victory, Meta’s latest model, Llama 4 Maverick, was also surprisingly good for a smaller one, partially because of its ability to garner allies and plan effective betrayals.
In all, I tested 18 different models (listed at the top of this article). We're streaming those games on Twitch now, so you can check them out—they’re fascinating to watch.
Where we go from here
This project started when renowned AI researcher Andrej Karpathy tweeted, "I quite like the idea using games to evaluate LLMs against each other," and another researcher, Noam Brown—who himself has explored a different type of AI playing Diplomacy—added, "I would love to see all the leading bots play a game of Diplomacy together." So I built it. Not for a paper (though if you want to help me write one, reach out), but because it seemed fun and aligned with one of my life goals: Build a game, specifically a massive multiplayer online role playing game (MMORPG), that more intentionally teaches you valuable skills as you play. Along the way I discovered which model secretly yearns for world domination (ahem, o3); I’m also hoping this benchmark might help next year's models be better collaborators and planners.
Today we're watching AI play against itself, but I'm building toward making this game playable for all of us, and hope to host a human-versus-AI tournament. The moonshot is that this leads to a completely new genre of game, pitting humans against language models, where you learn how to use AI effectively just by playing. For now, the stream is live at twitch.tv/ai_diplomacy—let me know if you see anything wild. See you there.
A special thanks to:
- Every
- Tyler Marques
- Sam Paech
- The TextArena team
- Oam Patel
Models included:
- claude-3-7-sonnet-20250219
- claude-opus-4-20250514
- claude-sonnet-4-20250514
- deepseek-reasoner
- gemini-2.5-pro-preview-05-06
- gpt-4.1-2025-04-14
- gpt-4o
- o3
- o4-mini
- openrouter-deepseek/deepseek-chat-v3-0324
- openrouter-google/gemini-2.5-flash-preview
- openrouter-google/gemini-2.5-flash-preview-05-20
- openrouter-google/gemma-3-27b-it
- openrouter-meta-llama/llama-4-maverick
- openrouter-mistralai/mistral-medium-3
- openrouter-nousresearch/deephermes-3-mistral-24b-preview:free
- openrouter-qwen/qwen3-235b-a22b
- openrouter-qwen/qwq-32b
- openrouter-x-ai/grok-3-beta
Alex Duffy is the head of AI training at Every Consulting and a staff writer. You can follow him on X at @alxai_ and on LinkedIn, and Every on X at @every and on LinkedIn.
Get paid for sharing Every with your friends. Join our referral program.
Ideas and Apps to
Thrive in the AI Age
The essential toolkit for those shaping the future
"This might be the best value you
can get from an AI subscription."
- Jay S.
Join 100,000+ leaders, builders, and innovators

Email address
Already have an account? Sign in
What is included in a subscription?
Daily insights from AI pioneers + early access to powerful AI tools
Comments
Don't have an account? Sign up!