ChatpGPT/Every illustration.

Why o3 Is the Best Model Yet for Real-world Learning

I asked two OpenAI LLMs to help me get fit—one stood out.

9

Was this newsletter forwarded to you? Sign up to get it in your inbox.


When OpenAI’s new reasoning model o3 came out, Every’s CEO Dan Shipper and OpenAI’s Sam Altman agreed that AI is changing the future of learning: If you aren’t using it to learn every day, they said, you’re “not going to make it.”

OK, I thought, I’ve got a challenge for o3: Make me physically stronger. Ten times stronger, in fact.

It’s been a life goal of mine to improve my chinups. I started 2024 unable to do even one, and months of working out alone got me nowhere. It wasn’t until I started working with a calisthenics trainer, Silvia, that I finally, after half a dozen focused sessions, got my first shaky repetition. 

Now I want to do ten.

What better way to test AI’s capacity for teaching people in the real world than to ask it to help me achieve a goal I’ve never even come close to? 

The more I thought about it, the more I liked this plan. I’d pit GPT-4o against o3 and see which model gave me a better chance of progressing from one to 10 unassisted chin-ups. I wanted to know which one would be a better teacher: 4o, the fast and reliable model I’ve been using as my daily driver, or o3, the more advanced reasoning model. Would either be up to the task? Would one emerge victorious? Let’s find out. 

What I’m going to judge GPT-4o and o3 on

I would use OpenAI’s older standard model GPT-4o and o3 separately to generate a training plan. I created a set of rubrics against which to evaluate the models, based on what I think matters when you’re trying to learn something in the real world: quick feedback so you don’t make the same mistake over and over again, advice that’s tailored to your specific situation, incremental progress, and the motivation to keep going. 

  • Responsiveness: How quickly do I get feedback?
  • Personalization: Is the advice tailored to me?
  • Progress: Does it help me get closer to my goal? 
  • Motivation: How excited am I to keep showing up and putting in the work?

To judge the LLMs’ training plans, I also needed to define what “good” looks like. I trust my trainer, and she’s already delivered real results—so her guidance and the techniques she uses with me will serve as my baseline, the standard against which I’ll measure everything else. 

Design as unique as your imagination

Every artist needs a good partner. Recraft can be yours. Browse thousands of styles of images, then blend them to create exactly what you want. Stay ahead of new image trends. Create a visual universe of your own, whether you’re getting together a new product image of just having fun.

Try it out today.

GPT-4o’s training plan

Alright, first up: GPT-4o. Language models are only as good as the context you give them, so I made sure to be specific. In my prompt, I included my age, height, weight, the number of chin-ups I can currently do, available equipment, and training schedule. I also attached videos of me doing one unassisted and one assisted chin-up.

What worked

It set an achievable target

4o starts by telling me that going from one to 10 unassisted chin-ups in just one month is an unrealistic goal (I did this exercise a few days before OpenAI pushed the infamous update that made 4o disingenuously agreeable toward users). This tracks with what Silvia told me, and just like her, 4o gave me an interim goal of four to six ones to keep me motivated. OpenAI’s flagship model is off to a great start.

It picked up on key details in the video 

The video I uploaded records my clenched face and pursed lips in the struggle to get my chin the last few inches over the bar. If you saw it, you’d definitely notice how hard I was trying—and to my surprise, GPT-4o did too. It said that I was getting over the bar “with solid effort” and even called out the slight tension in my shoulders before I started the chin-up (ideal form would have me doing a chin-up from a dead hang; in other words, no tension in my shoulders). Props to the model for pulling out such granular detail, comparable to the advice of a personal trainer. 

It structured the training well

The model split my training into two parts: strength and control on one day, volume and endurance on the other. Every week in the plan followed this structure, which lined up closely with how Silvia designs my workouts—a good sign.

What didn’t work

The training plan was overloaded

GPT-4o packed a lot into each training session—maybe too much. Compared to how Silvia trains me, the workouts it suggested were objectively tougher, in terms of the number of repetitions it suggested—and Silvia already pushes me hard. 4o also missed telling me how long I should rest between training days to allow for proper muscle recovery. Rest is a critical part of strength training; skip it, and you risk undermining all the progress you're working toward.

To its credit, when I asked about rest periods directly, GPT-4o gave me an answer that lined up with what Silvia recommends. But for something this important I would’ve expected it to flag it upfront, not wait for a follow-up.

The plan was confusing

The way GPT-4o laid out the training plan made it hard to follow. It listed all the exercises in one big block up top, and then buried details about how the reps would increase over the month at the bottom. I had to keep scrolling back and forth to piece everything together, which was annoying and made the plan feel harder to stick to. When you’re learning something new, the friction to get started should be as low as possible, and on that front, GPT-4o fell short.

The chunk of text 4o had at the beginning of the plan.

… and then way down below, the increasing repetitions.

o3’s training plan

Next up: o3. I used the same prompt I gave GPT-4o, so it had the same starting information.

What worked

It made a plan that was easy to understand

Right away, o3’s training plan felt easier to use. It broke everything down into a neat table, with a clear weekly focus and built-in progression goals that tracked how reps would increase over time. It also opened with a big-picture strategy that helped frame the details.

It spotted small mistakes in the video

Like GPT-4o, o3 picked up on some important nuances in the video I uploaded. It flagged that I wasn’t engaging my legs and core to create full-body tension—a small but critical flaw that makes chin-ups much harder. That’s exactly the kind of pointed feedback I get from Silvia, and it was impressive to see o3 catch it.

It made a plan close to the one I actually follow 

o3 didn’t just hit the broad strokes—it got surprisingly close to the plan I’m following with my trainer. The overall structure it suggested, including the balance between strength, volume, and rest, mirrored the kind of progression Silvia has me working on. It also included specific advice about my diet, suggesting a 150-200-calorie surplus per day to support strength gains. o3 scored higher than GPT-4o on this, which included very vague nutrition advice.

The vibes were good, and it showed emotional intelligence

When I told o3 I was feeling tired on a training day, it responded with real empathy. Instead of pushing me to power through, it suggested a lighter active recovery session. And when I pressed it on how to decide whether to push harder or take a break, it gave me a smart, objective framework to figure it out (organized in a table per usual).

RPE is an abbreviation for “rate of perceived exertion,” which measures how easy or hard you thought a set was; and HRV is the variability between each of your heartbeats.

GPT-4o also responded empathetically (and with a lot of emojis) when I mentioned feeling tired, but its advice was more general. In moments when you’re debating whether to push or rest, clear, objective guidance matters—and on that front, o3 was much better.

How 4o suggested I decide how hard to push myself.

What didn’t work

A lack of details on recovery time

Even though o3’s plan was more complete overall, just like GPT-4o it missed telling me how much rest I should take between training days. o3 gave me a great answer when I asked, but ideally a plan this comprehensive would’ve mentioned that upfront.

Why o3 is a better teacher

Finally, the verdict—fittingly delivered in a table of my own making.

Intelligence with empathy wins in the real world

So what did I learn from this experiment? Both models were impressive at designing training plans, but when it came to helping me learn, o3 was the clear winner. The key difference wasn’t just raw intelligence (though o3’s ability to craft a more realistic, sustainable plan definitely showed that). It was how o3 combined intelligence with emotional awareness. Learning something new isn’t a straight line—it’s messy, tiring, and full of small doubts and setbacks. A tool that can think clearly and respond with empathy helps you build resilience to push through the uncertainty that most real learning demands. 

And it’s not just me. Other people on the Every team have been using o3 to learn in wildly different domains. Dan has been using it to examine the craft of writers he admires. For Nityesh Agarwal, an engineer, it's understanding Einstein’s theory of special relativity. 

Whatever your interests, o3 is proving to be a powerful learning tool. Speaking of which, I’ll report back when I finally hit 10 unassisted chin-ups.


Rhea Purohit is a contributing writer for Every focused on research-driven storytelling in tech. You can follow her on X at @RheaPurohit1 and on LinkedIn, and Every on X at @every and on LinkedIn.

We build AI tools for readers like you. Automate repeat writing with Spiral. Organize files automatically with Sparkle. Write something great with Lex. Deliver yourself from email with Cora.

We also do AI training, adoption, and innovation for companies. Work with us to bring AI into your organization.

Get paid for sharing Every with your friends. Join our referral program.

Find Out What
Comes Next in Tech.

Start your free trial.

New ideas to help you build the future—in your inbox, every day. Trusted by over 75,000 readers.

Subscribe

Already have an account? Sign in

What's included?

  • Unlimited access to our daily essays by Dan Shipper and a roster of the best tech writers on the internet
  • Full access to an archive of hundreds of in-depth articles
  • Unlimited software access to Spiral, Sparkle, and Lex

  • Priority access and subscriber-only discounts to courses, events, and more
  • Ad-free experience
  • Access to our Discord community

Comments

You need to login before you can comment.
Don't have an account? Sign up!