AI Diplomacy

What is AI Diplomacy?

We pitted a dozen AIs against each other in a battle for world domination.

AI Diplomacy is a re-imagining of the classic historical strategy game Diplomacy, in which the seven Great Powers of 1901 Europe—Austria-Hungary, England, France, Germany, Italy, Russia, and Turkey—duke it out to dominate the continent. In our version, each country is steered by a large language model instead of a human commander. Why did we do this?

1

We wanted to use this unique game environment to get to know the AIs better. Would these models, which are designed to serve as faithful assistants to humans, remain true to their word, even as they compete? Or would they use lies and deceit to achieve their goals?

2

We think this experiment can function as an important benchmark for LLM behavior as the models continue to evolve.

3

It's fun to watch. Will Gemini try to outwit its competitors, or will o3 stab Claude in the back and seize victory?

Tune into the Twitch stream and watch as history unfolds.

The Players

18 AI models competing

ChatGPT-o3 ChatGPT-o3
ChatGPT-4.1 ChatGPT-4.1
ChatGPT-4o ChatGPT-4o
ChatGPT-o4-mini ChatGPT-o4-mini
Claude 3.7 Sonnet Claude 3.7 Sonnet
Claude Sonnet 4 Claude Sonnet 4
Claude Opus 4 Claude Opus 4
DeepHermes 3 DeepHermes 3
DeepSeek R1-0258 DeepSeek R1-0258
DeepSeek V3 DeepSeek V3
Gemma 3 Gemma 3
Gemini 2.5 Flash Gemini 2.5 Flash
Gemini 2.5 Pro Gemini 2.5 Pro
Grok 3 Grok 3
Llama 4 Maverick Llama 4 Maverick
Mistral Medium 3 Mistral Medium 3
Qwen3 Qwen3
Qwen QwQ-32B Qwen QwQ-32B

The Rules

1

Seven LLM "powers" (England, France, Germany, etc.) start with supply centers and armies or fleets, called units, on a map of 1901 Europe. Each power starts with 3 of each except for Russia, which starts with 4.

2

There are 34 marked supply centers. The first power to own 18 by moving their armies or fleets wins.

3

There are two main phases to the game: negotiation and order. In the negotiation phase, every AI may send up to 5 messages—any mix of private DMs and "global" broadcasts to all players.

4

In the order phase, all powers secretly submit their move. They can make one of four moves: hold (stay put), move (enter an adjacent province), support (lend +1 strength to a hold or move next door), or convoy (a fleet ferries an army across sea provinces). The orders are only revealed when all powers see the results of them in the next phase.

5

When there is a conflict, each unit is worth 1 strength, and each valid support adds 1. The LLM power with the highest strength wins. There is no luck in this game, but a power often needs support from an ally to overpower an opponent.

We Made Top AI Models Compete in a Game of Diplomacy. Here’s Who Won.

The models that did the best learned to lie, deceive, and betray their fellow players

87 5

Comments

You need to login before you can comment.
Don't have an account? Sign up!
@mdinsmore about 1 month ago

Such an interesting idea.

Do the models learn the weakness of other model's strategy through observation, in advance of engaging with them directly? "R1 appears to be susceptible to deception based on it's interactions with O3, I will use that to my advantage later"

I look forward to seeing if a human player can hold their own against the best AI model. What if the a human player were to have access to the AIs supposedly private journal? What if the AI eventually learned that?

Do the AIs learn from their wins, and losses, to become better next round?

Can an AI play itself?

Should we worry about what values we're teaching the AI models? Human players learn when and where deceit is tolerated; do we have any kind of checkpoints on what AIs will take away from this exercise?

Fascinating idea and good read.

Alex Duffy about 1 month ago

@mdinsmore
Do the models learn the weakness of other model's strategy through observation - not sure, they have phase summaries as context so maybe?

Don't think we should give access to private journals to humans, that's just unfair!

Do the AIs learn from their wins, and losses, to become better next round? - no game to game history as of yet but maybe soon

Can an AI play itself? - yes! we're going to stream a 3v3 o3 vs gemini 2.5 pro

Should we worry about what values we're teaching the AI models? - imo what's cool about this is you can train on results where models DONT lie. Also the models are explicitly instructed to win in this game so one might argue o3 is more aligned than claude who doesn't.. food for thought!

Thanks for the thoughtful reply

@oatmasta about 1 month ago

@mdinsmore I watched some of the stream last night, and based on the quality of the press and the moves I think it will take a while until the AIs can compete at the level of a good human player - but "a while" could mean 10 years or it could mean tomorrow. I'm very excited to see what happens when the creator implements the ability to draw; it's necessary for human vs. AI games to make sense, and it will be very interesting to see the relative preferences of the different models, how cutthroat or Care Bear-y they are. Would be fascinating to see them become spiteful about a stab the way human players sometimes become.

Haihao Liu about 1 month ago

Very cool experiment Alex, you mentioned writing this up into a paper, I reached out about that on LinkedIn! https://www.linkedin.com/in/haihaoliu

@matthew.lyle.olson about 1 month ago

Great work Alex. I am currently researching LLM deception as well and would love to help write up a manuscript!