In this installment of Playtesting, Alex Duffy shows why games might be the smartest approach to AI training right now. As the cofounder and CEO of Good Start Labs, he’s been exploring how game environments can improve AI capabilities across unexpected domains. His latest finding is surprising: Fine-tuning a model on the strategy game Diplomacy improved its performance on customer support and industrial operations benchmarks. Read on to learn why games generate the kind of data and behaviors that make AI better at the serious stuff, and what the Every team has learned from classics like StarCraft.—Kate Lee
Was this newsletter forwarded to you? Sign up to get it in your inbox.
It’s my job to make AI play games. One board game we’ve focused on at Good Start Labs has been Diplomacy, a World War I simulation reportedly played by John F. Kennedy and Henry Kissinger. There’s no dice and no luck. As everything shifts around you, all you can rely on are persuasion and strategy.
When we fine-tuned the Qwen3-235B model—an open-source model developed by the team at Chinese cloud computing company Alibaba Cloud—on thousands of rounds of Diplomacy, we found an over 10 percent improvement in performance on other games such as the card game Hanabi and word game Wordle. But we were encouraged to see that these improvements translated to other realms. The fine-tuned model also did better on Tau2, a benchmark that tests how well AI agents handle customer support conversations, and AssetOpsBench, IBM’s benchmark for industrial operations like equipment monitoring and maintenance.
It’s not a big leap to believe that improvement in one game could boost the model’s performance on others. But how does understanding WWI strategy make a model better at helping someone change their airline reservation or monitor equipment? Simple: Games reward specific behaviors. When you get good at those behaviors, they show up elsewhere.
When I asked my colleagues at Every what games had taught them, everyone had similar experiences. “StarCraft taught me how to cook,” Every’s head of platform Willie Williams tells me, recalling the high-speed chess-like game. “You have things that take different amounts of time, and you want them to land at the same time.” Our senior designer, Daniel Rodrigues, learned English from Pokémon before any classroom. AI editorial lead Katie Parrott became a more systematic thinker from board game mechanics and applied it to designing AI workflows.
This transfer of skills from games to other domains works for AI, too—and we can measure it. Diplomacy trains context-tracking, shifting priorities, and strategic communication. Customer support, where information is often incomplete and requests shift, needs the same capabilities.
We trained our model on Diplomacy in a reinforcement learning environment where you can clearly score whether the AI did something right. Labs are racing to build these kinds of environments because they do something that feeding the models static data can’t: They give models feedback on their decisions, teaching them to strategize, not just recall facts.
When you train a model on text from the internet, it learns to predict words. If you train it in an environment with goals and feedback, the model starts to develop skills that look remarkably like strategy. It’s a glimpse of where AI training is headed: less scraping the web, more learning by doing.
Write at the speed of thought
That gap between your brain and your fingers kills momentum. Monologue lets you speak naturally and get perfect text 3x faster, and your tone, vocabulary, and style is kept intact. It auto-learns proper nouns, handles multilingual code-switching mid-sentence, and edits for accuracy. Free 1,000 words to start.
The game is the curriculum
“You become good at whatever the system rewards,” Every’s AI & I producer Rachel Braun tells me. Diplomacy rewards tracking context, planning responses, and navigating shifting alliances—exactly the capabilities with which labs like Anthropic, OpenAI, and DeepMind are trying to imbue their models.
It’s also why Arcee, a U.S.-based AI lab that develops open-source models, is using our Diplomacy environment to train its Trinity models. That includes its 400 billion parameter flagship Trinity Large models, one of the largest open-source model families from an American lab. Because it’s open-source, people can build on top of it, adapt it to their problems, and make it better for everyone else.
What Arcee and other labs are betting on is a second additional way to improve AI—not by making models bigger, but by training them differently after they’re built. Instead of just feeding them more text to read, they’re putting models in game-like situations where they practice tasks, get feedback on what worked, and develop skills they can apply elsewhere. The next big leap will come by combining learning by doing with ingesting more data.
AI researcher Andrej Karpathy put it this way: By training models in multiple games and tasks where you can score success, what are known as verifiable tasks, “the LLMs spontaneously develop strategies that look like ‘reasoning’ to humans.” The environment becomes the models’ curriculum, and whoever designs that curriculum shapes what the model becomes good at and how.
The game is also the exam
But games don’t just train models; they generate data no one else has. Our AI agents have played hundreds of thousands of rounds of the party game Bad Cards alongside 2 million real users. In the game, players get a prompt—something like, “What’s the secret ingredient in Grandma’s cookies?”—and compete to submit the funniest answer. Our agents pick punchlines and learn from the votes, generating data that shows people’s preferences for humor shift over time. That’s data that can’t be scraped from anywhere on the internet.
What users want from AI shifts faster than tests can measure, so static benchmarks become outdated quickly. Crowdsourced benchmarking project LM Arena just raised $150 million on this premise: The team is building an open platform for anyone to evaluate AI models by collecting feedback from human beings at scale.
Games are a natural fit for this continuous evaluation. They generate large amounts of data about real preferences, continuously refreshed. As more people interact with AI through play, they learn how these tools work, but their feedback—on what’s funny, for example—makes the next model better.
From StarCraft to the frying pan
Willie didn’t set out to learn cooking from StarCraft—he was trying to win. But the skills he learned showed up in his kitchen anyway.
AI development is exhibiting the same pattern. If you set a clear goal, the skills to reach it will follow.
Only people can define what those goals should be: what counts as a good decision, what’s funny, and what matters. That’s subjective, inherently human work. Games are where we focus because they turn fuzzy goals into scorable outcomes—exactly what models need to learn. Diplomacy is just one game among thousands. Each one teaches something different, and we’re just beginning to discover what translates—how war strategy can help customers, or when science-fiction video game skills will show up in the kitchen.
We’re off to a good start.
Alex Duffy is the cofounder and CEO of Good Start Labs, and a contributing writer.
To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.
We build AI tools for readers like you. Write brilliantly with Spiral. Organize files automatically with Sparkle. Deliver yourself from email with Cora. Dictate effortlessly with Monologue.
We also do AI training, adoption, and innovation for companies. Work with us to bring AI into your organization.
Get paid for sharing Every with your friends. Join our referral program.
For sponsorship opportunities, reach out to [email protected].
Help us scale the only subscription you need to stay at the edge of AI. Explore open roles at Every.
The Only Subscription
You Need to
Stay at the
Edge of AI
The essential toolkit for those shaping the future
"This might be the best value you
can get from an AI subscription."
- Jay S.
Join 100,000+ leaders, builders, and innovators
Email address
Already have an account? Sign in
What is included in a subscription?
Daily insights from AI pioneers + early access to powerful AI tools


Comments
Don't have an account? Sign up!
I'm reminded of the way animals, of many species, learn to be adults and succeed at real-world tasks, by playing. Play provides rules, guardrails, constraints, and instant feedback - whether it's an online game, or a lion cub getting cuffed and dust-rolled by its dad for biting him a little too hard. This is how biological brains are trained, too. So it's gratifying, and encouraging, to see it work for AI models. What it leads me to is, "What other real-world (human or animal) training models can be applied in some way to AI training?" What else is there - categorically different from the static data and game playing approaches? What could a quick scan of developmental psych - child development, for example - yield?
@semery this is EXACTLY the thread I'm so interested in pulling, thanks for the comment