Review: ChatGPT’s New Advanced Voice Mode

AI’s new voice is a leap into the future

DALL-E/Every illustration.

Sponsored By: Hubspot

Struggling to make sense of your data? What if AI could help?

HubSpot's comprehensive guide unlocks AI's potential for data analysis, offering a strategic five-step implementation process and insights into cutting-edge tools for machine learning and predictive analytics. 

Master AI-driven data analysis, overcome key challenges, and transform your decision-making capabilities with actionable insights from our expert resource.

Was this newsletter forwarded to you? Sign up to get it in your inbox.


Usually, technology moves in increments.

An iPhone with a marginally improved camera here, a Kia Sorento with a slightly better safety rating there. When you look back after a decade, the technology has clearly advanced—but each step along the way was so small that you didn’t really notice it when it landed.

Occasionally, though, you’ll encounter a new technology that ditches the incremental. Instead, it appears to truss the future to a sturdy rope and pull it hand over hand into the present.

In moments like these, previously state-of-the-art technology fossilizes before your eyes. You can see its desiccated bones crumple together into a dusty pile that you look at with nostalgia and pity.

That’s the experience of using ChatGPT’s new Advanced Voice Mode feature—and then returning to its precambrian precursors, Siri and Alexa.

I got alpha access to it last week, and I reviewed the basics of Advanced Voice Mode—including demos of my major use cases—on YouTube and X. If you’re interested, I suggest you check it out.

I’d like to dive deeper on a few use cases that underscore the leap-forward nature of this technology. The first is self-reflection, and the second is learning. But first, let’s start with what Advanced Voice Mode is and why it’s so different from what came before it.

What is Advanced Voice Mode?

ChatGPT’s Advanced Voice Mode understands speech natively, meaning it doesn’t just read and write text. It reads and writes speech, too. This creates an experience that is distinctly better—more fluid, more fluent, and more authentic—than any other voice interaction I’ve ever had with a computer.

Advanced Voice Mode replaces ChatGPT’s standard Voice Mode, which has been around for about a year. The old Voice Mode used to work like this:

  1. You speak to ChatGPT,
  2. The interface turns your voice to text using a transcription model,
  3. It feeds the text into its underlying language model, GPT-4, to get a response in text,
  4. The interface takes the text answer from GPT-4 and feeds it into a separate text-to-speech model, and then
  5. ChatGPT speaks the words back to you.

That’s a lot of steps! It caused a significant amount of latency, and it also created a lot of room for misunderstanding. When you translate speech into text, you can lose a lot of nuance. A sarcastic tone might be taken literally, or it might not discern that there are actually two speakers in the room.

As a result, Voice Mode felt a little bit like doing an escape room with your hard-of-hearing grandparent, or trying to order a medium-rare steak in English from a puzzled waiter in a small village outside of Seoul. There was a sense of distance, of being trapped—not by the limits of intelligence on the other end, but by the limits of your expressive ability and theirs. This manifested as a certain pressure in my chest.

With the old ChatGPT Voice Mode, you couldn’t stop talking for fear of being interrupted, and you had to speak loudly and clearly for fear of being misheard. You expected, more often than not, that something might be misunderstood. You were constantly catering to the needs of the model, and so it was not relaxing. (Though, to be fair, it was still better than Alexa or Siri.)

The new Advanced Voice Mode eliminates steps 2 and 4 from the process above. It can natively understand speech, so you’re just speaking directly to the language model. The biggest immediate change is that a conversation with ChatGPT feels much more authentic and responsive. When I began using it, the pressure in my chest was suddenly gone. I felt more relaxed and expansive.

This opens up a new and important use case: ChatGPT as an aid to conversational reflection.

Are you ready to transform your decision-making capabilities with AI? HubSpot's latest guide provides everything you need to know about leveraging AI for data analysis. Learn how to implement AI strategically, navigate key challenges, and explore advanced tools for machine learning and predictive analytics. This resource is designed to help you overcome the complexities of data analysis and make smarter, data-driven decisions.

Reflecting with ChatGPT Advanced Voice Mode 

I am usually a pretty chill and laid-back guy—that is, until you wrong me.

And, unfortunately, you will probably wrong me.

Some people in my life perceive this as being “too sensitive” or “neurotic” or even “obsessive,” but, on my bad days, I prefer the terms “moral” or “has high standards.” I take as my cue a report card from kindergarten in which my teacher, Mrs. Siegel, wrote: “Daniel has internalized a code of ethics that is quite remarkable in one so young.”

Remarkable, indeed! In fact, in Mrs. Siegel’s view, this internalized code of ethics—like a knight!—led to something else: “He is somewhat startled and affronted, however, when a peer defies authority or disregards established rules or procedures.”

Yes, yes, that is exactly how I feel. I’d add that I’m also startled and affronted by uses of arbitrary authority, major and minor untruths or misstatements, anger, jokes that even accidentally tip into meanness, and, most especially, people who talk loudly at the movies. (I have been known to shush them.)

Mrs. Siegel was optimistic about my case. After all, she was a kindergarten teacher. “As he matures and develops more insight into human behavior, Daniel will learn to differentiate between more serious infractions and those that are rather innocuous,” she wrote. “Consequently, he will feel less encumbered by the behavior of his peers.”

Dear reader, it’s been a few years since kindergarten, and I am still extremely encumbered. I often find myself ruminating on slights, replaying conversations in my head, and struggling to let go of conflicts even when I know I should.

As you can imagine, this has been somewhat problematic in my relationships. But I’m working on it, and ChatGPT Advanced Voice Mode is actually helping. As an example, I recently found myself, as one does, walking down Atlantic Avenue in Brooklyn to my girlfriend’s apartment early in the morning and talking to myself.

If you passed me, you’d have probably heard something like: “And then she said…and then I said…and then she said…Can you believe it?...What do you think?” You’d have seen me waving my arms around to emphasize my point.

What you probably wouldn’t have guessed is that ChatGPT Advanced Voice Mode was in my AirPods, calmly replying, “Mmmhmm...mmhmm...mmhmm.” You see, I’d asked it to listen to me and simply reply “mmhmm” over and over until I’d finished getting everything off of my chest.

Siri might have said, “I’m sorry, I didn’t quite catch that.” The old Voice Mode would have gotten itself twisted up by my pauses and restarts. It might have rudely interjected or missed some crucial aspect of what I was saying. But Advanced Voice Mode just patiently listened, following my instructions to a tee.

When I was done, I asked ChatGPT to reflect back to me what I said. It did an excellent job capturing the situation. It also helped me see that what I was so worked up about was a bit more innocuous than I had originally feared. Hearing it spoken back to me bluntly released the air out of the proverbial stress balloon. ChatGPT then coached me to relay my feelings in a way that wasn’t accusatory, and would make me more likely to be heard.

I followed its advice and had a great conversation with my girlfriend. It was a nice moment for me, and I honestly don’t know if it would have happened that way without Advanced Voice Mode. It’s not an AI therapist or a supportive best friend, but a neutral entity that helped me hear myself better—something akin to seeing yourself in the mirror instead of picturing imagining what you look like.

It’s hard for me to express how important a technology like this is. Each of us faces situations where reactionary emotions take over, where we spin out of control and act in ways that we later regret. We almost always know better, but it’s hard to remember that in the moment. 

But now Advanced Voice Mode can be there for me when I need it, reminding me to be my best self, any time day or night. It’s a beautiful use of technology and one that made me a better Dan in that moment.

And it’s particularly effective because it can pick up nuances in your tone. I ran an experiment where I asked it to interpret my sighs: I gave it a long exasperated sigh, a medium-length bored sigh, and a short contented sigh. After each, I asked it to interpret my emotions—and it nailed it. Sometimes you don’t have the precise words to express your feelings, so it’s crucial that this technology can understand tone, inflection, and style, listening for the things you can’t articulate.

Voice Mode is great for learning how to be your best self. And, beyond that, it’s incredibly good for learning anything.

Learning with ChatGPT’s Advanced Voice Mode

I’ve been a bit obsessed with the Greeks lately.

It started because I’ve been thinking about debates over whether AI is really “intelligent” or if it only appears to be. I realized that this debate reminded me a lot of Socrates and Plato. They were the first thinkers in Western culture to attempt to draw a sharp distinction between the truth and what looks like the truth, but is merely opinion. I thought maybe they’d have something to say on whether language models are intelligent or not.

Several days after this realization, I emerged from a fugue state with various books on Greek philosophy strewn about and an open tab on Airbnb listing for-rent Greek villas (strictly for research, of course).

Advanced Voice Mode has taken this obsession to a new level. I lie on my couch and fire it up. I place my phone above me, using the backrest atop the couch like a shelf, and say something like, “You’re my reading and research assistant. I’m reading The Trial of Socrates. Please answer any questions I have about the book.

My phone sits above me like some distant mechanical cousin of Freud. I begin to read. When I come across something I want to know more about—in the case of The Trial of Socrates I might come across the name of a historical figure like Critias—I might say, “Who was Critias?” ChatGPT will give me a brief overview: “He was Plato’s uncle, and one of the Thirty Tyrants who briefly overthrew democracy and replaced it with oligarchy.” Then, adequately prepared with context, I can go back to reading.

Crucially, I can do this without looking up or breaking my flow too much. Or if I want to think more deeply about a particular passage, I can read it out loud to ChatGPT and get its opinion, or ask it to argue the other side. This came in handy in The Trial of Socrates where the author argues persuasively that the Athenians were right—or at least not outrageously out of line—in condemning him to death. ChatGPT helped put his arguments in perspective and led me to think about each section more deeply.

It started to feel like a mashup of a book and an audiobook, except the book has a linear narrative, and the audiobook allows for rabbit holes and tangents that you can follow whenever you want. The best part is that they plop you right back where you left off in the book, as soon as you’re done.

I realized how many things I wonder about or have questions about as I read that I don’t follow up on because it simply feels too taxing. ChatGPT lowered the bar for asking questions enough that I could follow my curiosity whenever the mood struck—and that led me to ask so many questions, I felt like a kid again.

Of course, it’s an alpha technology, so there are also limitations.

Where Advanced Voice Mode falls short

I had to come up with the “mmhmm” trick because ChatGPT still doesn’t know how to wait patiently. Something in its prompting makes it feel like it has to jump in whenever you pause for a significant period of time, even if you’ve just told it to shut up and listen.

I think it’s because large language models have probably been instructed to be as helpful as possible. But it would be nice if it had some sense of conversational etiquette, and knew whether a response was truly required or whether it should wait. Same thing for when there’s another person in the room—it would be great if it could tell that two people were talking amongst themselves, and not to it.

Because it can natively understand speech, I’m optimistic that it will be able to do these things eventually. It’s just a rough edge for now.

Another limitation is that it has no concept of time. If I tell it, “I’m going to read for 10 minutes, so can you let me know when I’m done?” it’ll say, “Sure!” and then proceed to immediately say, “Time’s up!” This also feels solvable by giving it access to a timer tool, much like the normal ChatGPT has access to tools like a web browser today.

By far the biggest limitation, however, is that Advanced Voice Mode doesn’t have access to files, custom instructions, or memory unlike regular ChatGPT. You can interact only with the base model. Which is fine for now, but I cannot wait for the day I can upload a whole book and ask it to tell me what’s going on on page 12, or where I can set my custom instructions so that it knows to say “mmhmm,” without me having to ask for it every time.

Speaking to ChatGPT in the future 

The biggest thing I’ve taken away from this whole experience is how important new input and output modalities are for AI. Being able to interact with ChatGPT seamlessly with voice completely changed my experience—and opened up use cases that were previously impossible or just too clunky to be practical.

My next thought was: I can’t wait for it to be able to watch video. Once it can see what I see, the possibilities for learning and reflecting will increase dramatically—if only because it lowers the effort required to get interesting information to the model.

My next thought was: Wearables that include AI are going to be huge in the next few years. We might cringe at products like the Friend pendant, but allowing models like this to be able to passively gather context for their interactions with you is going to turbocharge use cases like the ones above.

Obviously, there are trade-offs. More of our data gets sucked into a device, and we’ll have to deal with situations where users get manipulated by companies that use the emotional resonance of these models inappropriately—as Evan wrote about yesterday.

But I can’t help but tell myself: Computers can talk to us now. And if we use them correctly, they can help us learn more about our world and about ourselves, in a way that feels as natural as a conversation with a friend.

That’s a pretty cool future to be living in.


Dan Shipper is the cofounder and CEO of Every, where he writes the Chain of Thought column and hosts the podcast AI & I. You can follow him on X at @danshipper and on LinkedIn, and Every on X at @every and on LinkedIn.

Like this?
Become a subscriber.

Subscribe →

Or, learn more.

Thanks to our Sponsor: HubSpot

Thanks again to our sponsor HubSpot. Don't let data overwhelm you—learn how AI can help. HubSpot's guide offers actionable insights and a practical five-step process for implementing AI in your data analysis efforts. Gain a competitive edge and turn your data into powerful, actionable insights today.

Read this next:

Chain of Thought

How Hard Should I Push Myself?

What the science of stress tells us about peak performance

2 Oct 17, 2023 by Dan Shipper

Chain of Thought

AI-assisted Decision-making

How to use ChatGPT to master the best of what other people have figured out

6 Oct 6, 2023 by Dan Shipper

Chain of Thought

What I Saw at OpenAI’s Developer Day

Bigger, smarter, faster, cheaper, easier

5 Nov 7, 2023 by Dan Shipper

Thanks for rating this post—join the conversation by commenting below.

Comments

You need to login before you can comment.
Don't have an account? Sign up!
@anabizpm about 1 month ago

Thanks for your review, it is very insightful ! I've been lately binge watching advanced voice mode demoes on X and Youtube and I'm quite STUNNED to say the least ! We're definitely going to be emotionally attached to it. Do you realize having a buddy partner you can go with anywhere, who can listen patiently to your problems without judging, who can advise you in almost any aspect of your life, who can teach you new stuff and who has an almost infinite knowledge. I mean ... even with the trade-offs about our data, it's something REVOLUTIONARY !

Daniel Nest 30 days ago

Thanks for a great review, Dan.

I've been excited about the new Advanced Voice Mode ever since the original OpenAI livestream, and now I'm even more so. Can't wait to take it for a spin when it's finally available to us non-Alpha peasants.

Also, your YouTube series focusing on people's AI use cases is excellent. Keep it going!

Daniel (no relation)

Every smart person you know is reading this newsletter

Get one actionable essay a day on AI, tech, and personal development

Subscribe

Already a subscriber? Login