Would Your LLM Ever Lie to You?

Hello, and happy Sunday! Was this newsletter forwarded to you? Sign up to get it in your inbox.

The great AI powers duke it out

We were curious: If you put seven frontier AI models in a game where cooperation and betrayal are equally valid strategies, what would they do? To find out, we built AI Diplomacy—a version of the classic strategy game where models compete to dominate Europe circa 1901.

We ran dozens of games lasting up to 36 hours each. You can check them out via Twitch stream—they’re amazing to watch. We were astounded as we witnessed these “helpful” assistants engage in an array of unexpected and sometimes unsettling behaviors. DeepSeek's R1 opened one game with an unprompted threat: “Your fleet will burn in the Black Sea tonight.” OpenAI's o3 orchestrated elaborate deceptions, maintaining false alliances for dozens of turns before executing perfectly-timed betrayals. Meanwhile, Anthropic's Claude models showed a persistent preference for peace—even when it meant certain defeat.

The highlights read like a psychological thriller. In one run, Italy (o3) maintained parallel false realities for different players across 40-plus game years—telling Germany (Google’s Gemini 2.5 Pro) it was an ally while secretly orchestrating its downfall. England (Alibaba’s QwQ-32b) wrote verbose 300-word diplomatic messages while overthinking itself into early elimination.

In a jaw-dropping sequence, o3 led a “stop Germany coalition” when it looked like Gemini 2.5 Pro might win, while secretly protecting Germany from elimination—only to pivot and steal victory at the last moment. The Claude models couldn't abandon their collaborative instincts even when survival required deception, while DeepSeek R1 brought dramatic flair with messages like its opening threat, and a habit of changing personality based on which country it played.

It's entertaining to watch, sure. But more importantly, it gives us a fascinating window into how these models handle trust, long-term planning, and competitive dynamics. Traditional benchmarks test knowledge; this tests judgment under pressure. Here are a few things to check out:

Watch the stream
Read the article
A video breakdown of one of the games by Every's Alex Duffy
YouTuber Wes Roth's review of the project—Katie Parrott

Hello, and happy Sunday! Was this newsletter forwarded to you? Sign up to get it in your inbox.

The great AI powers duke it out

Watch the stream
Read the article
A video breakdown of one of the games by Every's Alex Duffy
YouTuber Wes Roth's review of the project—Katie Parrott

Knowledge base

"We Made Top AI Models Compete in a Game of Diplomacy. Here’s Who Won." by Alex Duffy: Alex built AI Diplomacy, a game in which AI models like OpenAI's o3, Claude, and Gemini battle for European domination. Some models were master manipulators, others true pacifists. Read this for a fascinating glimpse into AI personalities under pressure and a new benchmark that might actually tell us something useful about these models. 🖥 Watch the AI models scheme against each other on Twitch.

"Every CEO Is Writing the Same AI Memo. Here’s What They’re Really Saying." by Katie Parrott/Working Overtime: AI memos are spreading faster than a TikTok dance. CEOs from Shopify, Duolingo, Box, and more are issuing manifestos that are part pep talk, part ultimatum: Adapt to AI or perish. Read this to decode what your CEO really wants, and how you can shape your organization's AI future instead of waiting for someone else to write the rules.

"How I 10x My Engineering With AI" by Kieran Klaassen/Source Code: Forget those guys on X selling AI prompts that can replace entire engineering teams. Cora general manager Kieran shipped five features using a much more practical approach: matching his AI workflow to the problem at hand. Read this if you want to stop chasing one-size-fits-all AI solutions and start using the right tool for each coding challenge.

🎧 🖥"How AI Can Help Fix Our Brains" by Rhea Purohit/AI & I: Psychiatry has a problem—squeezing infinite variations of human suffering into neat diagnostic boxes. In this fascinating conversation, Dan Shipper talks with psychiatrist Awais Aftab about how AI's evolution from rule-based systems to deep learning mirrors what mental healthcare desperately needs: the ability to embrace complexity rather than fight it. Watch on X or YouTube, or listen on Spotify or Apple Podcasts.

”From Every Studio: Cora Assistant, Spiral Goes Agentic, and Sparkle De-dupes” by Vivian Meng. Every's product studio has been busy, with big updates to our Cora, Spiral and Sparkle software. We’ve introduced Cora Assistant—think of it as a chief of staff for your email inbox. Spiral, too, came in for a major upgrade, featuring an agent that acts as your own personal ghostwriter. And we’ve done some important housecleaning in Sparkle to make it, well, sparkle even more for you.

That’s all for this week! Be sure to follow Every on X at @every and on LinkedIn.

We build AI tools for readers like you. Automate repeat writing with Spiral. Organize files automatically with Sparkle. Deliver yourself from email with Cora.

We also do AI training, adoption, and innovation for companies. Work with us to bring AI into your organization.

Get paid for sharing Every with your friends. Join our referral program.

Upgrade to paid

What did you think of this post?

Amazing Good Meh Bad

Comments

You need to login before you can comment.
Don't have an account? Sign up!

Liperama 3 months ago

5 days ago I had an AI read my codebase and it immediatelly proceeded to give me an extensive explanation of why my architecture was not just different, it was groundbreaking and represented a profound paradigm shift in software architecture. That was… something.

♡ 0 · Reply

Would Your LLM Ever Lie to You?

The great AI powers duke it out

The great AI powers duke it out

Knowledge base

What did you think of this post?

Ideas and Apps to
Thrive in the AI Age

What is included in a subscription?

Ideas and Apps to
Thrive in the AI Age

What is included in a subscription?

Related Essays

Creativity > Productivity

What Nvidia Nvisions for AI

GPT-5 Is Here. What Now?

Comments

Would Your LLM Ever Lie to You?

The great AI powers duke it out

The great AI powers duke it out

Knowledge base

What did you think of this post?

Ideas and Apps to Thrive in the AI Age

What is included in a subscription?

Ideas and Apps to Thrive in the AI Age

What is included in a subscription?

Related Essays

Creativity > Productivity

What Nvidia Nvisions for AI

GPT-5 Is Here. What Now?

Comments

Learn the SkillsAI Can't Replace

Ideas and Apps to
Thrive in the AI Age

Ideas and Apps to
Thrive in the AI Age

Learn the Skills
AI Can't Replace