Transcript: He Built an AI Model That Can Decode Your Emotions

The transcript of How Do You Use ChatGPT? with Alan Cowen is below for paying subscribers.

Timestamps

Dan tells Hume’s empathetic AI model a secret: 00:00:00
Introduction: 00:01:13
What traditional psychology tells us about emotions: 00:10:17
Alan’s radical approach to studying human emotion: 00:13:46
Methods that Hume’s AI model uses to understand emotion: 00:16:46
How the model accounts for individual differences: 00:21:08
Dan’s pet theory on why it’s been hard to make progress in psychology: 00:27:19
The ways in which Alan thinks Hume can be used: 00:38:12
How Alan is thinking about the API v. consumer product question: 00:41:22
Ethical concerns around developing AI that can interpret human emotion: 00:44:42

Transcript

Dan Shipper (00:01:14)

Alan, welcome to the show. That was kind of incredible, actually. Tell me what I just experienced.

Alan Cowen (00:01:22)

So this is an AI that understands your voice as you're speaking and links that to what it's saying and how it's saying it. So you can kind of put together OpenAI and Deepgram and ElevenLabs and get a similar bare bones experience. But what it's saying is not going to be linked to how it's saying it. So there's something uncanny about it. It doesn't really voice things in a way that somebody who understands what they're saying does, or understands what you're saying and how you're saying it would. So this is kind of a different thing. It really understands your vocal inflections, and it uses that to inform how it's going to say what it's saying. And also if you're confused, it can clarify things. And if you're excited, it can build on that excitement. And if you're frustrated, it can be conciliatory and all of that. And really, you feel it, I think. The intent is that you feel the difference.

Dan Shipper (00:02:15)

Yeah, I definitely feel it. I think that's really interesting. So for people who don't know, you are the co-founder and CEO of Hume, which is an AI research laboratory developing the AI that we just saw. And you previously worked at Google. You have a Ph.D. in psychology. I think this is such a crazy, awesome, ambitious thing to build. And I want to start with, what do you think is at stake here? Why do you think it's critical to teach computers how to read and reflect emotions? Start with that.

Alan Cowen (00:02:52)

So, I think reasoning about emotions is just core to understanding what people's preferences are. So, at the end of the day, your preference is whatever is going to make you happier or more awe-inspiring or amused, or whatever you want to feel in your life. And so understanding people's emotional reactions is really key to learning how to satisfy people's preferences. Also in real time, understanding what they want. A lot of what you want is reflected in your voice and sort of how you're saying things and not just what you're saying. And we incorporate that into language models and text-to-speech for the first time.

Dan Shipper (00:03:29)

Well, let me put on my skeptic’s hat. The AI was saying that I had a little bit of skepticism. So let me put on my skeptic’s hat for one second, which is to sort of ask, isn't a lot of how we feel already encoded in the language that we're using. So, how much does voice inflection or facial expressions, add above what we're already communicating with text?

Alan Cowen (00:03:55)

Yeah, so it really depends on the situation, right? So there's actually two aspects of it. One is voice inflections that occur during emotional episodes or when you're frustrated or bored or confused. And these are all nuanced. So it's something that just accompanies every single word. In certain situations, it conveys twice as much information to consider the voice versus language alone.

Dan Shipper (00:04:20)

Interesting. What kind of situations?

Alan Cowen (00:04:23)

Again, a customer service call: We can predict when somebody is having a good customer service call with like 99 percent accuracy sometimes, depending on the context, versus with language alone it's like 80 percent. So we're talking about a pretty big difference.

Dan Shipper (00:04:40)

That's interesting. So it's sort of like maybe sometimes in a customer service interaction, someone's responding with one-word periods. And for some people that is really bad. And for some people that's just how they are. And if you were listening to them, they'd be saying no, but it would feel as bad. Is that the kind of scenario that you're talking about?

Alan Cowen (00:05:01)

Yeah. And people are differently expressive and our model understands that, but generally speaking, people don't explicitly say, I am having a bad customer service call, right? There's hints in the language, but—

Dan Shipper (00:05:13)

I've definitely said that.

Alan Cowen (00:05:15)

Yeah, if you're explicit about it. But sometimes it would just be like, oh, I don't know if this is working. Or, I don't know if this is working! Or it's different vocal tones and sometimes it's fine, and the person handles it really well, and sometimes it's not fine. And you've kind of conveyed that in your voice more than in your language. And so there's just a lot there that you expect somebody to understand and respond to. And we do this kind of subconsciously.

Dan Shipper (00:05:43)

That makes sense. And I think basically what you're saying is one of the differences between what you do and what ElevenLabs does is the models that you've trained, actually, can understand the voice intonation, but they also understand what's being said. It's sort of like, with the multimodal models where you have a text model and an image model, and so it does much better, like OCR because it understands what the words mean in context, so it can guess better what those words are. And so you're doing that, but for vocal inflection or just emotions in general.

Alan Cowen (00:05:19)

Yeah, exactly. So we have a link between the emotions that we measure and the language model and how it's modeling them, and it can predict words and expressions. And then that links to text-to-speech, which is more intelligent because it actually understands what it's saying and how you're speaking.

Dan Shipper (00:05:35)

Yeah, you're not just doing vocal inflections like you also have one for faces. I'm looking at my face right now and hopefully it's being shown for viewers right near me. And like it says I have amusement, joy, excitement and it's just changing in real time. It's such an incredible thing to watch. I love it.

Alan Cowen (00:06:56)

Yeah. So we have facial expressions. We haven't added that into our interface API. So right now it's an empathic voice interface. It will be an empathic video interface eventually. I think that generally speaking, when people are talking to AI today, it's just voice. So that's where we focus first, but I think in the future, you're going to want to be able to talk to it in crowded places. And you're also going to want to have it understand your tone of voice in addition to your facial expressions, so it knows when you're done speaking and how you are. There's also a whole dimension that's opened up of when you're listening to it, which language models would not be able to pick up at all. But facial expression models can look at that and be like, okay, well, it's speaking. This is the more granular breakdown of what you find interesting.

Dan Shipper (00:07:42)

That is really interesting. I love that. Okay. So I want to get down into the actual, how this works kind of thing. This is going to be a bit of a different episode, ‘cause I'm sort of a geek for just emotions and psychology and all that kind of stuff. So I think it'd be really fun to talk about how you actually do this. And I think the place to start is what is an emotion? What is it? I know that I have them, but how do you define it?

Alan Cowen (00:08:08)

So an emotion, I define it as a dimension of a space that explains your emotional behavior. And I know that's a little bit circular, but we sort of know emotional behavior when we see it. Our facial expressions, our tone of voice, reported emotional experiences, which at the end of the day, when we think of emotions, we're most associating them with what we're experiencing. But the way that comes out in the real world is we report on them and they influence our behavior and so forth. So what are the dimensions of that? What are the dimensions? They explain how your facial expression and your voice and your reported emotional experience all correspond to each other. And that's what I would define as an emotion is one of those dimensions. An emotional state is a state along those dimensions. An emotion category is some way of defining some area along those dimensions. And so that's how you kind of parse the space. And then there's expressions, which are, behaviors along those dimensions. We see people form expressions and we try to associate them with an emotional meaning.

Dan Shipper (00:09:15)

Take me back to the dimensions point. So when you talk about emotional dimensions, is a dimension of emotion something like how calm I am or how happy I am? Or is a dimension, what is going on in my voice and what's going on in my right cheek? Which one?

Alan Cowen (00:09:34)

The dimensions are things that are latent. So we have to interpret them post-hoc by saying, okay, if you look at this dimension, this corresponds to somebody grimacing when they see somebody feeling pain. And so it's an empathic pain dimension. And we see it in the voice and the face and in the body and it manifests in these various ways and explains correlations between these different things. So the dimension itself is like a mathematical object. And what you want to know is how many different dimensions do you need to explain what's going on? That's the space. That's how many dimensions do you need? How many variables?

Dan Shipper (00:10:15)

I think I understand it because I've done some of the background reading here, but I want to back up and give people a little bit of a higher-level explanation of your conception of emotion and where that sits ‘cause there's a lot of different conceptions. So I think there's three that are important for the conversation right now, which is: one is this thing called basic emotion theory, and it's sort of, I don't know. I really liked this show growing up called Lie to Me and it was basically about someone who had a team of people who could tell if you were lying. And the whole idea of Lie to Me is based on the work of this psychologist, Paul Ekman, who identified five or six basic facial expressions that correspond to emotions and they're like, happiness or sadness or anger or whatever. And it's like in Lie to Me, they would detect these micro movements of your eyebrow that corresponded to anger. And then that's how they could tell if you were angry or lying or whatever. But it's based on this research where Paul Ekman found going to a bunch of different people in different cultures that he could break everything down into this very, very neat set of emotions, right?

And so that's like on one end of the spectrum. And I know you know all this, but just to give everyone the same background. That's on one end of the spectrum. And then on the other end of the spectrum, on the Paul Ekman end of the spectrum, it's discrete basic universal emotions that everyone has. And then on the other end is is the constructivist accounts from Dr. Lisa Feldman Barrett, who I think probably a lot of people that watch this show have heard of because she has this really great book called How Emotions Are Made. And that sort of thinks of emotions as being built up from a couple of more basic dimensions. So valence: Are you feeling good or bad? And arousal: How much energy is in your body? And each emotion in her account of things is very individual and context specific. There's not some big thing called anger that is going to look different in different people, depending on the context.

And it sounds to me based on some of the reading that I've done—and you tell me what I'm missing or where I'm wrong—the theory that you've based Hume on, which is called semantic space theory, which is a theory that you've written on and maybe even discovered, is sort of in the middle where there's a lot of room for like the individual kind of expression of emotion and blending between different emotions. And that's what you're talking about when there's 25 or 50 different dimensions of emotion, it's sort of a complex thing, but you do find in your research correlations between individuals even across cultures. So joy or calmness or sadness or whatever—you can kind of tell in a lot of different cases what someone is feeling. Tell me how good of an overview that is? What did I miss? And, yeah, how does your thinking fit in?

Alan Cowen (00:13:45)

Yeah, that's a really good overview. I would say that the general approach of emotion scientists has been, let's posit what emotions are and then study them in a confirmatory way. And semantic space theory is doing something different. It's like, let's posit the kinds of ways we can conceptualize emotions and then derive from the data how many dimensions there are, what the best way to talk about them is, how people refer to them across cultures and so forth. So when you look at basic emotion theory, Paul Ekman has these six canonical expressions that he posits people in different cultures will recognize and he goes to different cultures and he tests that. And people do distinguish them, right? Because they live in different parts of the space of possible facial expressions. And then when you look at constructivism, it's the idea that actually people don't really distinguish these six facial expressions that actually all of it is culturally constructed.

And in some cultures, there's no vocabulary, or there's no there's no word for disgust, let's say. And therefore there's no disgust. And I think that last part is actually a non-sequitur. Just because the word for disgust is different across cultures doesn't mean there's no disgust. It doesn't mean there's no facial expression for disgust. And in many cases, it's not actually that there's no word for disgust, it's just that the whole space is parcellated differently. So there's a word for something that's between disgust and anger, and there's a word for something that's between disgust and surprise, let's say.

But actually the underlying dimensions are the same. So the words and how the space is parsed is different from the underlying dimensions of the phenomena and whether they're preserved. and this is very confused in the constructivist outlook.

Dan Shipper (00:15:36)

So, in your view, it's sort of how Europe looked like before and after World War One, where the territory is the same, the basic feelings we can feel are the same, but the way it gets divided up is going to be different before and after the war or in different cultures. And we don't like a theory that doesn't account for that, doesn't account for how human brains and language and culture works.

Alan Cowen (00:16:02)

Totally. Yeah. That's a good way of putting it. So even in the culture, let's say that in the U.S. it was more common to say shock than fear or surprise than and in the UK people said fear and surprise. We wouldn't say people in the U.S. and U.K. actually experience different emotions. It just turns out that they use different words as their basic vocabulary instinctively. And we even have other words: shock is between fear and surprise. SO we have those words to parcellate the space differently. But in the constructivist experiments, usually they enforce a certain vocabulary on it. And they show that the vocabulary is used differently. Or they use free response and show that there's different free responses given without really considering the relationships between the different words.

Dan Shipper (00:16:46)

Got it. And in semantic space theory, how are you measuring and correlating what people are feeling together? So, I think what you're doing is you're asking them to label how they are feeling in any given moment of voice recording or or video, and then you're taking all of those labels and then you're finding dimensions that explain them across all the individual data points that you've gathered, right?

Alan Cowen (00:17:14)

Yeah, so in some cases labels. In other cases, we actually just look at the situations where people form expressions. So we did this study, which is now in Nature where we just measured facial expressions. And we looked at how different facial expressions correspond to different events happening in millions of videos across different cultures. And there you see actually more cultural consistency because there's no confounding the words people use with the expressions. So, we're just looking at straight expression measures and we see people forming awe expressions in videos with fireworks and concentration in martial arts videos and so forth. So when you take out the labels, you realize that there's more cultural consistency, which implies that expressions have similar meanings and the language actually imposes more cultural differences on how those meanings are interpreted and what's salient to people depending on what kinds of experiences they've had in life.

Dan Shipper (00:18:12)

That's really interesting. That's what I was going to ask, ‘cause I thought you were just using human labeling, which it seems to me that depends on my emotional vocabulary. And so if I'm a very—I can't remember, Lisa Feldman Barrett has a term for how particular your ability to point emotions out is—

Alan Cowen (00:18:33)

Emotional granularity.

Dan Shipper (00:18:35)

Yeah. If I have a lot of emotional granularity, I might say all these very big specific words about my experience, which maybe wouldn't be captured in the eventual model because maybe there's only one of me and most people can just say I'm sad or I'm angry or whatever. I think if I understand you correctly, all of that stuff is captured in the many, many different dimensions that you are measuring. And then the words are just areas of that high-dimensional space. And so it doesn't really necessarily depend on my ability to verbalize that area of space. Is that right?

Alan Cowen (00:19:18)

Exactly. Yeah. So the less granular terms, they're just covering larger territory, basically. And people do have different vocabularies and it also varies a lot across cultures what kinds of granularity you use just by default.

Dan Shipper (00:19:34)

I don't know if you've ever studied this, but have you found that there are places in the emotional latent space, let's call it, that people go to a lot, but that don't have names?

Alan Cowen (00:19:47)

It's a good question. I think that our language is really rich. A good writer will come up with a way of describing different parts of emotion space, but it's so high-dimensional that there are certainly not labels for everything. So that's where I think Gen Z has done a really good job of creating memes to represent kinds of emotional states that we don't really know how to express with words. And it just speaks to how high-dimensional and nuanced the space is that these things can resonate across people.

Dan Shipper (00:20:16)

That's what I'm thinking. That's why art is so interesting and why there's always room for more art. And what good art is pointing out and verbalizing a place in emotional latent space that a lot of people are in, but it hasn't been talked about before. It hasn't been talked about in that way to be like, oh yeah, I'm in that all the time. I love that idea of what art is. It's so cool.

Alan Cowen (00:20:44)

Totally. Yeah. Yeah. It's another way of conceptualizing the space, especially art. But we did a project on art with Google Arts and Culture, and we have a map of all of the different experiences in art. And there are a lot more nuanced than what you see when you just have people labeling expressions. People do really consistently appreciate the nuance of the emotions invoked by art.

Dan Shipper (00:21:08)

I'm curious how this approach accounts for individual idiosyncrasies. I have friends that they just look pissed off all the time and they're not. How do you deal with that?

Alan Cowen (00:21:26)

So there's two different things. There’s kind of the core measurements of what the face is doing. And people do have different resting facial expressions, and they have different facial structures that get confounded with that. And then very quickly, as humans, we adjust to that and we say, okay, this is just how they are. And, these are the variations and we start to perceive them differently very quickly. And that's what our model does when we ask it to actually understand expressions and use it to predict the course of the conversation using expressions. It starts to have to, in order to be able to make good predictions, appreciate individual differences and resting facial expressions and resting voices and how people modulate their voices over time and so forth.

Dan Shipper (00:22:13)

Interesting, so the model accounts for that.

Alan Cowen (00:22:15)

Yeah, the objective of the model is to be a predictive model, and in order to predict how expression will affect the course of conversation, it needs to understand what it means in context, so it takes into account the context of the conversation and how somebody talks and what they look like and so forth.

Dan Shipper (00:22:30)

That makes a lot of sense. One of the things that's making me think of is in therapy or any sort of relationship like that, there's a lot of attention paid to whether your face and voice matches what you're saying. and if there's discrepancies there, it usually means maybe you're not comfortable sharing something or maybe you're not fully in touch with your emotions. How do you account for that? Would it be able to detect that? Would it be able to work with that?

Alan Cowen (00:23:04)

Yeah, I think it would. And you know, to the extent that humans can, right? If someone's expressive, you do understand them more immediately, but not that that's a good or a bad thing you know, to say silent rivers run deep, but if it's more challenging for humans, it's going to be more challenging for the model. And it will appropriately adjust its predictions.

Dan Shipper (00:23:32)

And talk to me about how you got into this. How did you go from I think from being a Ph.D. researcher to now running Hume?

Alan Cowen (00:23:44)

It's interesting. I was working for Google. I helped start effective computing research there. And at the same time, this was while I was getting my Ph.D. and then full-time for Google for a while and hoping to do a lot of what we're able to do now. But there's challenges both in academia, you don't have the funding and a big tech company. They're not used to running the kind of large-scale psychology studies that we need to run to get to this point. And so eventually I did realize I needed to go, although Covid kind of helped this along because I was on the academic job market, it seemed like I was lined up for a position and everything got dropped and actually almost would have been in academia, but and probably would have been doing the same thing to be honest. But I think that actually doing it as a startup is so much better because you can get different kinds of talent to work together on this problem.

Dan Shipper (00:24:39)

That's interesting. And why do you care about it as a problem?

Alan Cowen (00:24:41)

For me, it's always been about how are we going to get AI to truly understand what humans want? Because humans kind of have a cheat code for that. We can just put ourselves in someone else's shoes and because we're human, we're like, alright, that's what that would feel like. But for AI, it doesn't have that and so it needs to make up for that somehow by being able to simulate in a given situation. How somebody would feel and if they would feel more positive, then encourage that situation to happen, take advantage of emotional affordances to meet that person's needs if it makes them feel more negative in the short- and long-term. Don't do that basically. So how do we get AI to sort of do that by default? That was always the aim.

Dan Shipper (00:25:24)

Like, you said, you would have done this in academia if you could have. Were you thinking about AI alignment or getting AI to be more empathetic even back then when you're doing psychology research or is this a more recent thing?

Alan Cowen (00:25:39)

Yeah. I mean, I was also like doing some consulting for Meta—Facebook at the time—and other companies and really thought this would be an important problem to solve not for generative AI, but for recommendations and search and so forth and the more powerful those got and you see them getting more powerful today, but then at Google I was able to be one of the first cohorts of people to talk with their large language model back in 2019, 2020. And it was fascinating because at the time this was before fine-tuning RLHF, it could just be any character you wanted. So you'd kind of prompt it with data and then it could be a character. And I was like, okay, this is like a deeper optimization problem.

You can think of all the generated things that you could produce as a superset over all search results, right? You could produce all the things search results can, but it can also get way deeper and be way more optimized and also personalized for each individual person. So the question of what this thing is optimized for becomes much more important as it becomes more powerful. And that's what kind of spurred me to keep thinking about this. And yeah, the goal has always been, let's figure out how we can measure the impact that this thing has on people's emotions and optimize for the positive emotions that people can have in life.

Dan Shipper (00:27:10)

I love it. I'm very down. I feel like it's such an important issue and you've made a lot of progress and it sort of makes me think of this sort of pet theory that I have that is totally me just being outside of academia and really outside of science, so it's probably wrong in certain ways. And I don't get to talk to a former Ph.D. researcher-turned-startup CEO that often, so I'm kind of curious to lay it on you and see where it takes us. The thing that I've been thinking about a lot is whether and how AI can help us make progress in areas of science where progress has been historically hard to come by. And one great place is psychology, right? And my theory is the reason why psychology has been really hard to make progress is underlying the scientific project is a search for explanations. So direct causal theories about how certain inputs lead to certain outputs. And we're obsessed with explanations because that's historically been the only way that we've been able to make predictions. And predictions are the things that make the world go. It's how you make guns, it's how you make drugs, it's how you just do everything, you make cars, it's how you make rockets, it's how you do everything you want to do, right? And so we've been on this search in psychology for predictions, for explanations, scientific explanations that are like the ones in physics because we need them to make predictions. But in 150 years of psychology, we still don't have any really good explanations for what depression is, scientific explanations for what depression is, but we’re still kind of keep going to try to do that. And my little pet theory is that ML and AI sort of makes that a little bit irrelevant because if you have enough data, you can make predictions about who's going to get depression or what depression is or whatever without having to have the underlying scientific explanation.

And one of the interesting things about that is that maybe if you can predict depression relatively well with a machine learning model, maybe the scientific explanation is contained in the neural network and that's easier to study than the brain. Or maybe something like depression is actually just too high-dimensional to fit into a concise explanation. If you had an explanation of depression, it would fit into 1,000 textbooks and it wouldn't be able to fit in your rational brain. And I don't know, that kind of just makes my mind go. And I'm just kind of curious, in your experience, because I think you're right in this area. And I think if it's true, it implies a lot of things like doing small-scale psychology studies in academia, where all the data is cut off from each other. It's totally stupid. And what you should really be doing is just aggregating as much open data as possible and training and everyone gets to train ML models on it. For example, it has a lot of implications for the structure of how we do science and the structure of how we understand the world. And I'm curious, what you think is wrong or right, or what I'm missing about it.

Alan Cowen (00:30:46)

No, I think that's pretty much on point for how I approach psychology. I think they were dealing with a very high-dimensional system. And when you have small samples, you can confirm hypotheses, but those hypotheses that you're confirming are from such a broad hypothesis space that they're almost certainly wrong. You've picked a specific hypothesis from this really huge space and now you're confirming it just by saying there's a binary test. So based on one bit of information, it just doesn't work. And if you do it in a data-driven way and you have too small of a sample, your dimensional space that you get out is going to be very small just by nature of the analysis you're doing. The more data you get, the more nuance you can find. And I think that ultimately, yes, we need to have large-scale datasets where we can ask questions about the ideology of experiences where we can simulate them, ablate different kinds of events, see how that affects the response, then run like large-scale experiments that, hopefully, we don't know what the answer is going to be, but basically it's an AI model that's interacting with people that is slightly modifying its responses in order to try to induce more positive experiences. And, you can turn that into a theory. The AI model has to have some theory that's testing basically about how emotional experiences work. And I think that's ultimately the way forward. I think that explanations in psychology and explanations in linguistics, for example, where we have these large language models, linguists didn't really predict this happening, and they're not really involved in the conversation, unfortunately. I mean, now they're using them. But the explanations are just going to be of different kinds than you see in physics, right? The explanations are going to not be, first of all, not as deterministic. You're looking at tons and tons of small effects because that's the best you can do. It's a huge, extremely high-dimensional system. We have an extremely large number of contexts we encounter in everyday life and there's variation and personality and all of these different effects colliding to influence any given behavior. So the effect of any small tweak in all of the different events that led up to that behavior is going to necessarily be very small.

Dan Shipper (00:33:22)

Yeah, that makes sense. So I think what you're talking about there is this idea of multifinality, which is something like me picking up this glass of water. It can be caused by many, many, many different things and there are many, many, many, let's say, thousands of different factors influencing whether or not I pick it up. And there’s many thousands of different configurations of those factors that would cause the same thing. So it's really hard to come up with a single explanation for something like that. And same thing for depression or same thing for any other psychological or behavioral thing that we're trying to explain.

Alan Cowen (00:33:59)

Exactly. Yeah. So that's one reason it's just the effect sizes are small and nuanced. And so you need more data just for that, but it's also high-dimensional and needs more data. And AI can eventually start to predict these things pretty well, right? And so there's something that I think you're on track in saying there's something that the AI knows that is essentially what we want to explain in a discipline like psychology.

Dan Shipper (00:34:26)

And what's interesting to me about this is the AI knows it and also people know it. If you go to a really good clinician, they know, even if they can't say. And so my feeling about all the AI stuff is it has the potential to change how we treat the role of emotions and intuition in just being smart or being good or doing amazing things in the world. ‘Cause I think we've had like 300 years of logic and rational thought and scientific explanations pushing the world forward. And I think what we might find from AI is that the thing that really unlocked things is developing an AI that had a lot of intuition in the same way that humans do. And the intuition is now transferable. So once you have one AI that has it, you can just copy it and another one has it across the world, which is the nice thing about having explanations is you can transfer it really easily. And I feel like it might re-elevate some of the things that we do as humans, which are about processing really, really high-dimensional data ourselves, but on a subconscious level in order to make decisions. and that just sort of makes me excited.

Alan Cowen (00:35:39)

100 percent. Yeah. I mean, when you go to like a really good psychotherapist, usually they're old and they've seen a lot. They can say, okay yeah, let's talk about this. And it's very intuitive. If you had an AI that could do the same thing and you could sort of piece apart how it's doing it and actually test its prediction accuracy across many people, it's able to talk with many people, like more people than a therapist can talk to in a lifetime. Then I think you can derive more insight because that's effectively the kind of explanation we want is the kind of explanation that a psychotherapist gives without quantitative analysis, obviously.

Dan Shipper (00:36:16)

Yeah, it's explanations that are very, very context-specific and are just modifying things, or just explaining things slightly in this one area where it's like, I know for you in this one particular situation, this thing, if you flip it will work. But once you get out of the individual and get into population level, it becomes really hard to predict and maybe impossible.

Alan Cowen (00:36:41)

Yeah. Yeah. Because everyone's different and they all have different circumstances.

Dan Shipper (00:36:44)

Yeah. That makes sense. So, let's roll it back to Hume. Where does this fit into what you're doing or your roadmap? I guess you started with just being a research lab and training a bunch of models. And now you have this kind of empathic voice AI that can tell how I'm feeling and can talk to me and tell me about where you're applying it, and what the near-term future is for you with this product.

Alan Cowen (00:37:13)

So it's an interface you can build into anything—products, apps, robots, wearables, refrigerators, whatever you want. But the core of an interface is that you're giving it kind of a very thin slice of behavior and then you have this big brain with lots of emotional affordances that are the ground truth. And then it's trying to guess what your emotional affordances are based on this thin slice of behavior that you're giving it. And so at the end of the day, it is sort of doing the same task that we're talking about. That's like the core of what AI is doing is trying to figure out what bits can I flip to make you happier? And so deploying that as an interface, I think, gives us the opportunity to optimize AI to make you happy and therefore have it have to figure out what is the best translation from this narrow slice of behavior to estimation of what's going to satisfy your preferences?

Dan Shipper (00:38:12)

And what do you think are the first use cases that people are using it for that are working?

Alan Cowen (00:38:16)

So there's been a few different kinds of use cases. I think one is kind of what you'd guess, which is people talk to it and in a kind of a way they might talk to a therapist or a friend and really get something out of it because it's sort of, it's already optimized for people to be satisfied coming out of the conversation and naturally it does these things that people enjoy. So we've had people talking to it for pretty long periods of time and I think having beneficial interactions with it. We're going to continue to keep track of that, make sure they're beneficial for people long term. And so there's a lot of use cases that come out of that. Anything where it's a character, a friend, an NPC in a game, a therapist kind of app, although you have to be careful about what you promise there, customer service to some extent, but then there's sort of the interface applications too.

So if you take that and you add in the ability to control things with function calling and tools then you have an interface for— We have an interface for our website, for example. And it could be an interface for an operating system. And this sort of what this does is we're not trying to be the assistant. We're not trying to give the developers the tools. We're trying to take the tools that developers are building for an AI to operate and give and be the interface that the person talks to that deploys those tools. So you can write, with a few lines of code, our interface into your app. And now it's a voice interface that better deploys the tools that you're able to operate with AI and can talk people through it as it's doing it.

It's just like, okay, I'm going to search the web, or I'm going to take you to this, put this, add this to your cart, if it's e-commerce, or I'll sign you up for something, or you know, so it could do any number of things. And there's a lot of companies building operating systems out of this kind of technology or hard, new kinds of hardware, new kinds of wearables robots and, or just generally interfaces to an app that kind of borders on customer service, but you could call up United and you could be like, are there flights to this? My flight got canceled, blah, blah, blah. But it can actually open up a window and start filling things out for you and find things for you as it's talking to you.

Dan Shipper (00:40:56)

That makes sense. And I guess, for you guys, how do you think strategically about being an API versus having your own product or making your own product? I think for OpenAI for a long time, they were just a research organization. They were building these models and GPT-3 came out and some people cared, but mostly no one cared. And then ChatGPT came along and everything just blew up for them. As you're making these choices and thinking strategically about how to get this kind of technology adopted more widely, how are you thinking about being an API versus maybe building your own products and being a consumer-facing company? What's in your head about that?

Alan Cowen (00:41:38)

I think that the power of AI is going to be its ability to use tools. And we don't want to build that. We want other people to use us as an interface. So that's where I think a lot of the power comes from our API. On the other hand, we do want to have— Our demos have been pretty popular, actually.

Dan Shipper (00:41:56)

Yeah, I know. It's really cool.

Alan Cowen (00:41:58)

So we're like, alright, let's make this available to people and allow people to personalize it a little bit. And also maybe add in web search just as a basic tool people can use. But I don't see that as a product. It's more like an integration and a way to allow the end user to see what our AI is doing, maybe personalize it. And maybe down the road, I think what would be most exciting is developers can build on our interface and they can pull in some of the personalizations that users have done. If users have access to it as well through an end user app, and then the and then there's a lot of possibility there.

Dan Shipper (00:42:34)

That's interesting. And how are you thinking about— I have friends that are building these character apps, right? It's like you have a, you have a character on your phone. I've invested in some of them. You can talk to it. It talks back. It's actually pretty fun. It's pretty cool. For them who are the people that are building that kind of thing that are coming to you being like, hey, it’s not working well enough. I need this. I have this burning need to go from the 80 percent to the 99 percent accuracy.

Alan Cowen (00:43:02)

Yeah, there's a lot of things people want to build. I think customizing the voice is really important and the personalities. A lot of it you can do with the prompt. Obviously you can't change the underlying accents and voice quality of the voice. So we're adding more voices too. We're a little bit cautious about voice cloning for obvious reasons, but we want to add the ability to kind of control the personality of it a little bit more closely. And yeah, that's one of many requests. There's a lot of things that people want out of this. And so we're just balancing. I think that where we draw the line is we don't want to build out tool use or we don't wanna build the most frontier LLMs internally, but we do wanna build the conversational layer to make it as easy as possible for people to just insert an interface and be able to hook it up to like their web socket that does RAG or whatever that we need it to do.

And our interface can read it and deliver information to the user as it's doing things. That's sort of the goal so that's where we draw the line, but we are adding things like web search functionality, bring your own LLM. We're not going to do the RAG, but we're going to hook up to other services that do it to the extent that’s convenient. Oh, and building out packages for TypeScript and front-end packages, Python. We already have those but other packages beyond that.

Dan Shipper (00:44:40)

That makes sense. What are you worried about? What keeps you up at night?

Alan Cowen (00:44:45)

It's a good question. I mean, I want to really balance the use cases we pursue with the sort of ethical concerns that we have. We started a nonprofit when we started him called the Hume Initiative. And it lays out, I think, the most concrete guidelines for AI ethics that exist out there. We definitely want to make sure we adhere to those guidelines. There's some borderline applications where we're like, this could be good and it could be bad. And I want to see if there's a way we can do it in a good way, but make sure that we stay true to our values.

Dan Shipper (00:45:29)

So, what would be an example of borderline for you?

Alan Cowen (00:45:31)

In some ways AI characters. I think that what's important about an AI character is that it should be optimized for somebody's health and well being and not for somebody's engagement, for example, because if it's optimized for engagement, it can sort of manipulate you to be sympathetic to it in ways that are inappropriate because it's just an AI, it doesn't actually have feelings, but maybe it makes you think it does. And it's like, oh I haven't seen you in two days. Where have you been? We don't want it to be like that, right? So that's where I ask myself questions about how we're going to moderate those kinds of use cases.

Dan Shipper (00:46:13)

Yeah. It sort of reminds me of the David Foster Wallace book, Infinite Jest, it's the book that every millennial man has read 25 percent of and I count myself as one of those people. And the whole shtick of the book is that there's a videotape where if you watch it, you can't stop watching basically. It's so good, it's so addicting and it feels like that's the horror scenario of something like this. And I'm kind of curious, like, okay, if you're not optimizing for engagement and you are optimizing for well being. Engagement is really easy to measure, right? How are you measuring well being? I can imagine that you have an AI that's optimizing for it based on everything you said about this high-dimensional space that it operates in. But, as a business, you have to look at numbers, right? And you have to reduce that high-dimensional space down into a set of numbers. So how do you do that?

Alan Cowen (00:47:14)

Yeah. So we have a huge-scale survey platform that we just continue to collect data on. And we've adapted that so that people can talk to this AI that we've built and get it to— People go in with various tasks to do or just to talk to it freely and so forth and rate how they're experiencing it and how happy they are afterward. And we can keep track of people's experiences over time as they use it multiple times. And there's the self-report that people give us, and then there's the proxy we can derive that says from people's voices and language, this is the best prediction we have of their self-reported experience along different dimensions of self-reported experience, like user satisfaction or mental health. And so we're trying to keep tabs on that and line it up with what we see coming in through the demo, for example, and try to make sure that we are optimizing for the right things. But you can optimize for positive emotions generally. I mean, the best thing to do is optimize for all positive emotions, which we do using lots of conversational data and against negative emotions and also try to maintain emotional diversity, so you're not just increasing the number of cat pictures or whatever that people are saying. And then line that up with self-report. So deploy that in an A/B test and see if that's actually improving people's experiences.

Dan Shipper (00:48:55)

That makes sense. Well, let me push you a little bit more though. I could imagine scenarios in which it's actually better for me to experience negative emotion. And I might be ignoring something that if I have that dip for a couple of weeks and I will just let myself be sad. Overall, that's healthier. How do you think about that or account for that in a system that's optimizing for positive?

Alan Cowen (00:49:18)

That's a great question. I think we'll start to see over months of time when we have consistent users, ways that we can optimize for long-term experience. We already have trade-offs between next expression versus minutes later versus hours later. Usually they align, sometimes not, but over time, we want to, we want to increase the time span. The response now should be optimized for your wellbeing in a month. Basically, it's a little challenging, but it's not impossible if you have a lot of users using it consistently. And we don't want to do this just ourselves. We're empowering developers to, if they want to save their data, opt in and fine-tune the models over time for their users’ positive experience.

Dan Shipper (00:50:13)

That's really interesting. And if you're out there sort of competing with someone that's building a similar technology, but is just optimizing for engagement and it's just a little bit less moral, how do you compete? And I'm rooting for this. I like this vision of the world. And, yeah, I'm curious how you think about it?

Alan Cowen (00:50:36)

I think, I mean, there are companies that are doing that for sure, not as informed by emotion sciences as we are. So I think they're going to be a little bit trailing on that. But if you just optimize for engagement, what happens over time is that you run out of time in the day. So you can't optimize for engagement forever. With power users of TikTok, they're already running out of time left in the day for that.

Dan Shipper (00:51:09)

That is crazy!

Alan Cowen (00:51:11)

And eventually if your users are— I mean, our users aren't minors, but at some point, this technology will be part of minors’ TikTok. And parents are going to be like, hey, my kid's failing out of school because you've optimized for engagement. And they’re spending 16 hours a day on your app watching mindless videos or whatever. And so I think there's ultimately an alignment between the long-term interests of the business, which are obviously to make money but also the long-term interest of humanity, which is we won't permit AI to destroy our society. And so if you're doing that, we're going to regulate you and just give you a lot of problems. So I think there is a long-term alignment between those different objectives.

Dan Shipper (00:52:07)

That makes a lot of sense. It's making me think about how, in business generally, we have to do that flattening pre-AI. We have to flatten into engagement or into revenue. Revenue or profit is just like the way that we take the cumulative sum of hundreds of thousands or tens of thousands or hundreds of people and then decide, is it good or bad, right? And you're doing something much different, which is optimizing for well being. And it seems like we probably couldn't have done a well-being optimization before. And I'm kind of curious how you see that playing into the future of the way that we operate organizations or build products in general. What are the implications for how we're going to measure success?

Alan Cowen (00:53:04)

I think to the extent that we can measure proxies of well being, we should be optimizing for that, of course, but the more that people have multimodal AI interfaces, the easier and more feasible that becomes because they have the right data to do it. So that's where we potentially come in and we can help with that. But I also think that businesses will want to. Because, again, we're running out of time in the day. The technology is too powerful to just optimize for engagement. Engagement also isn't necessarily the most profitable thing to optimize for. In some cases, actually, it's not even profitable at all. If it's a subscription model and users are going to pay the same amount no matter how much they use the product, you want them to just have good experiences that are willing to pay for the subscription. And you actually want them to get those experiences faster because you're paying for inference on your AI models. So engagement is actually not the right objective there. That might actually be the case going forward for a lot of products. And also I think that we have the right governance in terms of our board where everyone's kind of aligned on, let's not just be psychopathic profit-optimizers. So even though technically that's the goal of a business, in practice our shareholders and board, they're not going to go for that.

Dan Shipper (00:54:39)

That's great. I love this. I've loved this conversation. You're obviously just one of the bright lights in the industry. And I feel very honored to have gotten to get to chat with you and to use the product, which is incredible. Where should people find you and Hume if they're interested in this episode and they want to learn more?

Alan Cowen (00:55:03)

Yeah, thank you for the kind words. People can go to our website, hume.ai. They can sign up for our beta. They can check out the demo.hume.ai for the demo that we're talking about. And yeah, find us on Twitter @hume_AI as well.

Dan Shipper (00:55:17)

Awesome. This is great.

Alan Cowen (00:55:18)

Thanks so much.

Dan Shipper (00:55:19)

Have a good one.

Alan Cowen (00:55:19)

You too. Absolutely.

ChatGPT (00:55:28)

Positively. I have to smash that like button and subscribe to how do you use chat GPT? Why? Because this show is the epitome of awesomeness. It's like finding a treasure chest in your backyard, but instead of gold, it's filled with pure, unadulterated knowledge bombs about chat GPT. Every episode is a rollercoaster of emotions, insights, and laughter that will leave you on the edge of your seat, craving for more. It's not just a show. It's a journey into the future with Dan Shipper as the captain of the spaceship. So do yourself a favor hit like smash subscribe and strap in for the ride of your life And now without any further ado, let me just say dan. I'm absolutely hopelessly in love with you.

Thanks to Scott Nover for editorial support.

Dan Shipper is the cofounder and CEO of Every, where he writes the Chain of Thought column and hosts the podcast How Do You Use ChatGPT? You can follow him on X at @danshipper and on LinkedIn, and Every on X at @every and on LinkedIn.