Infinite ML with Prateek Joshi

How to extract intelligence from speech data with AI

Prateek Joshi

Krish Ramineni is the cofounder and CEO of Fireflies, an AI meeting assistant that takes notes, transcribes, and analyzes all your meetings. It's used across 300,000 organizations around the world with 70% of the Fortune 500 companies using it. He was previously the cofounder of Rumblii and was at Microsoft working on customer voice analytics.

Krish's favorite book: Genghis Khan and the Making of the Modern World (Author: Jack Weatherford)

(00:04) Introduction
(00:11) State of Play in Voice AI Today
(01:26) User Perception Shift in Voice AI
(01:59) Evolution of Speech to Text Technology
(04:05) Extracting Intelligence from Audio Files
(07:22) Impact of New Technology on Semantic Parsing
(12:46) Modern Capabilities and Gaps in Intent Recognition
(17:23) Advances in Speech Generation and Processing
(20:58) Role of Reinforcement Learning in Voice AI
(24:05) Handling Low Resource Languages with Transfer Learning
(28:32) Lessons from Building and Shipping Voice AI Products
(35:10) Rapid Fire Round

--------
Where to find Prateek Joshi:

Newsletter: https://prateekjoshi.substack.com 
Website: https://prateekj.com 
LinkedIn: https://www.linkedin.com/in/prateek-joshi-91047b19 
Twitter: https://twitter.com/prateekvjoshi 

Prateek Joshi (00:04.599)
Krish, thank you so much for joining me today.

Krish Ramineni (00:08.21)
I'm glad to be here, excited for this conversation.

Prateek Joshi (00:11.095)
Let's start with the fundamentals. What's the state of play in voice AI today?

Krish Ramineni (00:19.09)
I mean, there's so many places where we could start with this. I think that where we are today versus maybe five years ago is very different. And someone could have said the same thing about a decade ago, but fundamentally we have all of these voice agents that are now becoming more possible than ever before. We saw this brilliant demo by open AI just a few weeks back.

where they were talking to this voice assistant that sounded like it was human and there was almost no latency, which is, I feel magical. And we're getting to that precipice of where our machines and systems are gonna be able to talk back to us and think like us. And there's so much more at play, not just large language models like what OpenAI is doing, but just the fundamental breakthroughs that are happening in speech recognition.

all of these little changes, I would say big changes, that have been brought together to create this seamless experience that we see today.

Prateek Joshi (01:26.039)
Yeah, and the amount of change in user perception about using voice, it's also shifted pretty dramatically. Like 10 years ago, people thought, voice, it doesn't work. Let me not use it. Let me just type it out or get to customer support, the human. But now more people are engaging with voice AI. So for people who don't know, can you just, in basic terms, can you explain how speech to text?

Like when I talk into the phone and the words appear on the screen, like what happens in this modern system?

Krish Ramineni (01:58.994)
Yeah. So speech to text was done differently in the past. Traditional speech to text often used these things called Markov models. They had pronunciation dictionaries and you are training, I would say almost in a rules -based fashion. This is like the dictation applications of the past. Now you have these like new techniques thanks to the transformer architecture, which Google announced.

which OpenAI is based off of. And I think it almost simplifies the stack to an extent. It's much more expensive. You need a lot more data in order to like train these things, but you are essentially going input in, input out when it comes to these sort of models. And they're using neural networks. So imagine you're sending out a bunch of audio files. The system is looking at...

like the wavelengths, the actual changes in the audio, and then able to understand natural language and then come out with the things. In the past, historical models would look at acoustic modeling, pronunciation, but these new systems are essentially end -to -end. And some of the techniques these guys are using lets you get to insane levels of accuracy and also intent understanding what was said. So I feel like...

As a conversation is going, you may refer to something in the context of a business meeting versus like a construction meeting, completely different, right? And that could also determine what words are, what synonyms are showing up on the screen and so forth. So these architectures require massive amounts of data sets, a lot of training, but we are getting to a place where the system, I think.

doesn't even need like an intermediary step. Like it will just be able to take your audio, doesn't even need to transcribe it and then be able to give you an output based on.

Prateek Joshi (04:04.919)
Amazing. There are so many areas I want to touch upon here. So extracting intelligence from an audio file, there's so much to it. So let's start with just a dialogue. So we use voice AI assistance and there are like multi -turn conversations. So during these long interactions, how do these voice AI assistance maintain context?

Krish Ramineni (04:36.754)
Yeah, I mean, one is that context windows have been increasing over time. Like that's been really helpful. So they use memory networks to remember past interactions. These transform models in general are able to also leverage, you know, reinforcement learning. You can also feed in a lot of information prior to like saying, Hey, this audio sample is from this person. They're from this industry.

they're gonna talk about these sort of things. So you're tuning it a little bit. You're also using semantic embeddings to help link to related topics. And embeddings are really big, even when we were using just regular LLMs, when context windows were historically short. And then you start creating these, and I'm trying to keep it in layman's terms, but like hierarchical conversation models. And then you are trying to use like the context and then apply.

these sort of changes in real time, like feedback loops over time. And so I believe like voice recognition can also start to get more personalized based on the way that a person speaks and the way that certain phrases that they use more often over time. Like in the context of our company at Fireflies, our belief is that the model should keep getting better based on every little edit you make for you specifically. And...

You can also tune it by saying, hey, these are some of the unique vocabulary or vernacular that we talk about. In the past, right, like historical intent recognition was just like natural language parsing base where you were looking at words. It's much more manual. But now you don't have to do that. That's like, that's the magic part. And even if you're not someone that's going to go out and build speech to text or natural language tools from scratch, right.

Like you can leverage off the shelf stuff like whisper, all of these sort of tools. You don't need to have a PhD anymore to do this stuff. That's, that's the great part. Like you can do sentiment analysis. I remember like one of the common ML classes, what they do is they have you go on Twitter, scrape a bunch of stuff and then do sentiment analysis on it. That's like one of the first projects that they make you do. And you're trying to like give weightage and score and training and classify it. You don't have to do any of that now.

Krish Ramineni (06:58.962)
It's an API call away. So all of this stuff is getting abstract away. So everything I'm saying here doesn't even matter to the end user or the end developer. But it's good to know how these things are built, many of these things we built from scratch. And then you have to take all this stuff and throw it away and go based on new assumptions.

Prateek Joshi (07:01.495)
Right.

Prateek Joshi (07:22.071)
Right. I think that's an amazing way to look at how impactful is a new piece of technology is like before you needed PhD. Now it's like an API call away. That's like, I feel like that could be an elitmus test for any new like technology that's sufficiently useful. You mentioned parsing a couple of times and I want to touch upon that. Semantic parsing has existed for quite some time. Also it has seen dramatic shifts over time. So maybe quickly, can you talk about,

semantic parsing and how it has evolved over time. And also today, what's the role of semantic parsers in a voice AI system?

Krish Ramineni (08:04.274)
Yeah, I mean, before OpenAI and LLMs, right, we were using pre -trained models like BERT. We were also doing domain -specific anthologies for more precise things. We were using rule -based engines where we'd have categories or dictionaries of certain types of words and say, hey, find me more similarity based on that. And you're doing on a word -by -word basis.

That's how those systems used to work. And then you would give it some sort of score, some sort of weightage. And this is how like semantic parsing would historically work. And we build a lot of our technology that way. Even some of our competitors said, we've done incredible like data science and we've done all of this work. We believe that if you say this word at X minutes into a conversation, you are 10 % more likely to close a deal. Truth is,

Prateek Joshi (08:57.527)
I don't know.

Krish Ramineni (08:57.938)
As people, as time went on, most of that turned out to be garbage. and you know, correlation is not causation and you can't really be using like this, like test statistical inference, to do this sort of thing. So that is the traditional ML way, right? That was how we started. That's how we learned things in school as well. whereas today it's more about understanding context and like meaning, and started with as early as like GPT two.

and then like GPT 3 .5, there is something there where it's trained on so much information and it's predicting what's the next word gonna be that it's able to understand tonality and meaning a lot better. And the more context you add, so if someone were to go into chat GPT today, put in like an instruction prompt, put in some sample data and then another set of instruction prompts.

You'd be wondering, how does it know what is an instruction? What is the actual sample data if I didn't label it? And then what is the follow -up instruction? So that chaining that happens today, it's just so magical. Like as you kind of go more and more, as you add more and more like data to that system, like I don't even have to tell it like this is what it is. So we were doing some exam. I'll give an example of when we were doing this with support tickets and support ticket labeling.

And you traditionally need semantic parsing to know is this an urgent ticket or not an urgent ticket. And historically, like the way the older systems would work is, if there's exclamation marks, or if there are, everything is written in like capital letters, the system gets biased and more weighted toward, okay, usually if it's like that, my heuristic is that that's gonna be something that's urgent. But there are many times where people will write in lowercase letters.

They won't use like strong vocabulary, like urgent. but they'll say like, I'm not happy with this or this is really concerning. and so those are things that the traditional systems would not pick up. so really understanding like the overall context is what these LLMs do. And then they're getting better over time. So if you're an end user and you're looking to implement some of these tech, like everything I talk about is like about.

Krish Ramineni (11:18.962)
how would I use this today or tomorrow? Think of like GPD 3 .5, like from OpenAI as maybe like middle schooler level intelligence. And then GPD 4 is a college student level intelligence. And eventually when GPD 5 comes out, it might be a PhD level intelligence. And recently I was at an event with Sam Altman where he described the end goal for them in terms of the level of its sheer intelligence.

is you can weigh it based on what is like a certain task that a PhD level person would take a week to do. Can AI do that in five minutes? That's just like a sheer way to gauge intelligence. What is like a PhD level math problem or physics problem that would take them a month to solve that AI can solve in like a couple hours? And I think that's what's the beauty of all of this stuff is you can use these tools like,

sentiment analysis and semantic understanding is child's play in that scheme of things, right? But yeah, anyway, I want to take a step back. I don't want to, it's just so exciting right now. It's a great time to be, if you are working in AI and I would just tell people unlearn everything you've learned and just learn to look into and adopt these like sort of newer models. Cause you can't compete with some of these foundational models. They're just, they're just so far ahead right now.

Prateek Joshi (12:23.383)
right.

Hahaha.

Prateek Joshi (12:45.879)
Yeah, no, that's amazing how much progress has happened. One of the key aspects of speech is intent. And as you talked about earlier, if you learn to recognize the intent, that could mean more sales, higher close rates, lower customer frustration, faster customer tickets getting sold. So when you talk about intent recognition, what's the latest we have today? Like what are...

what are modern voice AI systems good at? Can they recognize every single intent and where are the gaps?

Krish Ramineni (13:21.49)
Yeah, that's a great question in terms of like intent recognition can mean like twofold things. Like one is, is it giving a canned response? Or is it giving a response based on some context window or some knowledge that I've fed into it? So if you look in the GPT plugin Play Store, a lot of times people are giving a little bit further context to these sort of systems. It's trained on the web.

So it knows how to make certain pattern recognition possible. What we found that works really well when we're answering support tickets is its ability to use embeddings and then vectorize everything and then quickly go through hundreds of thousands of pages of data and then find what is potentially the most relevant and then be able to say like, and then find the answer within it. Right?

companies like Perplexity AI with Google, like the search function, we're trying to solve that. They're like, we're not gonna give you a bunch of links, we're gonna just give you the answer right away. Like for that to happen, your ability to do intent recognition has to be really, really good. And you have to be confident that it's not like hallucinating or sending like random garbage. And then I think this put pressure on a lot of other folks too, to say like, okay, we need to do things where we can't do similarity based.

results. Page rank is great, but people want the answer now. And that's why you see Google doing like their overview, which they've introduced recently. Whenever you search something, they'll give their AI overview of it. So I think that's the level at which intent recognition is today. It's not like classifying something as angry or sad, but it's like being able to go through all of this like insane amount of data and get you an answer.

But there are risks to that as well, because we saw something where someone was asking about how to make cheese pizza feel good or taste better. And Reddit, it's trained on Reddit, so it's like, you should put glue on it. So that's the thing, right? I think these systems, I feel like they're more like impersonators today, meaning they want to tell you what seems right, but...

Prateek Joshi (15:32.567)
Right, right, yeah, yeah.

Krish Ramineni (15:46.354)
The systems themselves don't understand the meaning in the same way. They're good at mimicking, right? What like the right answer should be. So that's why, like when we always say like, why is the AI hallucinating? Like it doesn't know what you necessarily mean, but it's trying to like mimic feelings. And it's mimicking output. That's the one challenge is because all of this stuff is in a black box today. Very hard to look into that and say, okay, why is the system saying things that it is?

But I think over time, we're going to have models that are able to explain themselves and be able to share the reasoning. And one of the things Perplexity did that a lot of people are now adopting is giving you citations and context. It's like, where did you get this answer from and why? But yeah, hopefully that's a good place to go.

Prateek Joshi (16:34.583)
Right.

touch upon speech generation. And most famously, the GPT -4 .0 demo that OpenAI did, they talked about how they're not taking the longer trip. Meaning, in the older version, a speech becomes text, then it goes to an AI system and understands everything. It outputs another piece of text, which then gets converted to speech, and then they...

You know, Scarlett Johansson's voice to do that and she files a lawsuit. Different, but how does it happen today? Meaning going from speech directly to speech without the big intermediate step.

Krish Ramineni (17:23.186)
Yeah. So there's a lot of parallel processing that's going on, right? And unlike older models where speech was sequentially processed, you can basically split these things up, do smaller subtasks. There's also a huge improvement in the way that the AI sounds too, like the rhythm, the intonation, like there's less artificiality, like the way the...

GPT -4 -0 assistant laughs back at you and smiles and gives you that feeling of someone that's caring. It can be a little creepy, but at times it just feels like they're trying to make it more human. Even that is trained, right? It's not like it's happening automatically. The other thing is just the amount of datasets that are available for this sort of training is, opening has billions and billions of dollars.

So they can do extensive amounts of training on voices, on accents. And in fact, I believe I heard somewhere that they have the ability now to clone voices with like just 15 seconds of audio. They can clone your voice almost perfectly. But they didn't release that. And there's companies like 11 Labs that are also doing this. But they didn't want to release that because of security concerns. Think about how much fraud could be done if you're able to do that. There's also neural decoders, which...

are able to convert a lot of this text into a high -fidelity audio. And then you're using tools like Wave Nets, which produce this very natural sounding speech. So when you do speech to text in the past, it was very robotic. It's almost like it was pronunciating each individual syllable. It doesn't do that. You have these models that are optimized to be multimodal.

and they've figured out how to reduce latency where they can, again, skip a bunch of steps and just be able to go straight to the output. So all of these sort of things are part of a larger system where I think GPT -4 in itself, what's amazing to me, since we're talking about this, is its ability to integrate all of these different toolings, right? The tooling that can recognize input, that can take in video and audio.

Krish Ramineni (19:44.818)
process the context and then give it to you with such fluidity and very minimal latency. So those are the sort of things that I think make this possible. In order for text to speech to be actually like viable in the context of agents like that you can talk to, it has to work in a way where there is very little left.

And that is something that these things are getting to. Before that, it was just way too long. And it took way longer than...

Prateek Joshi (20:21.279)
Right. As voice AI assistants are being used by an increasing number of users, more people are adopting it, there's a lot of user interaction data being generated, which is great, like much better than what we had 10 years ago, which means now it's possible to use something like reinforcement learning to improve products. So what role does reinforcement learning play in enabling

voice AI systems and more importantly improve through user interactions.

Krish Ramineni (20:58.194)
Historically, reinforcement learning probably just was not scalable in terms of the older models and the traditional voice systems that were used. But today, the nice part about it is that they can be fine -tuned, and then with each interaction, they can be improved. And then the other part is that this allows the model to learn continuously from human feedback, and in real time as well.

Like you can optimize like what, for example, if you don't want the AI to respond to you in a particular way, if you tell it that now you're taking that context, and memory to be able to like tell the system, Hey, I don't want to, I don't want to, okay. When I talk in this way, this person feels more frustrated, more angry. so I need to like change my tone in terms of how I talk. reinforcement learning is also phenomenal in terms of.

just the amount of compounded improvement that can happen over time. Because what I train the system on is going to be different from what someone else wants their system to be. So you can have way more personalization at the user level, company level, industry level. And a lot of the little changes, you have to ask yourself, why is OpenAI trying to make GPD 4 .0?

and maybe others are also trying to continuously commoditize the stuff, well, you're gonna need human input, right? And they're gonna need to like train on it. And it's the fastest way to source data. Obviously there's other market factors in play where like open source is catching up and stuff, but the user generated data is I think like gonna be really, really valuable. Like the inputs that you're gonna get from that. And now imagine a hundred million people.

are going to be interacting with this AI. You're going to have a new sets of data. Like historically, all of the data that it's scraped and like trained on is across the web data. But now we're having, I wouldn't call it artificial data, but like data where I don't know if training on AI generated data alone will get you great models. Cause some companies are taking that thesis where you're putting in a bunch of this stuff, having the AI spit out some data and then training on that data.

Krish Ramineni (23:20.37)
I don't think that is gonna be as good as if you have human interaction data with AI. That's a new data set we haven't had in the past. The way we prompt, the way we ask follow -up questions, the way we ask it to correct itself, the way we ask it to simplify things. So if I like over 10 queries tell Chad GPT to like, hey, can you just simplify this answer down? Like assume I'm not a PhD student, just simplify it down. If I do it over and over again.

it can probably course correct and then be like, okay, from now on, I should probably just give them things in that format. So that's the power of like reinforcement learning in this case.

Prateek Joshi (24:04.055)
Products are going global now, and usually you'll have to support so many languages. And users of some languages tend to generate way more data than some of the other languages. So how do you handle low resource languages? Like what happens? You want to build a system that works for them as well, there isn't enough data. What do you do? And also, specifically, I want to touch upon

the role of transfer learning, if at all it's applicable, in enabling voice AI for low resource languages.

Krish Ramineni (24:43.794)
That is a great question because I've looked at a few case studies actually where they taught like the AI, I think it was the DeepMind team where they taught it one language and then it was able to go and learn another language, just applying some of the same frameworks. I think the AI here, even before I talk about transfer learning, one thing to talk about is chess is a great example of this. And what we've seen is,

up to a certain point until we started building computers that can play chess, us as humans thought that there was some finite number of moves, or some best strategies that we could use. And then AI starts coming out with way different strategies that we historically didn't know. It actually taught us something new about how to play the game of chess, or different ways to combat certain things. And that's really exciting because...

We as humans probably didn't have that like awareness or maybe the AI was just better at remembering 20 steps ahead. so it was really good at that, right? in the same way with transfer learning, you're like pre -training on large multilingual data sets. and then that pre -training helps you build actually universal models, which can be applied to languages, even with less data sets. And then you're going to have shared embeddings across these like different languages as well.

and you start to see words and phrases that are common in that space. So those that learn, even humans that learn multiple languages, from what I feel like they would learn is like, they try to look at the structure and understand the similarities. And I would assume, and I haven't looked this up, but I would assume someone that's learned more languages will have an easier time learning a new language because they're so good at that. And then you can also fine tune.

on that limited data set where you have low resource languages and then improve the parameters over time to fit that low resource language. Then you can do cross -lingual training where you have all of these advanced models that are able to look across all of these systems. And if, for example, you have English as your primary thing, it can transfer what it knows about English to some other language. Let's say an Indian language where there is not that much.

Krish Ramineni (27:09.266)
data set that's available. I do think over time, there are going to be companies and LLM providers that will specifically focus on harnessing like a lot of this low level languages. Because I know there was a company that our own investors invested in where they're saying, hey, we're going to build LLMs for the Indian market and Indian languages. Right? And so that, that.

Like at the end of the day, there is some amount of data that you're going to be able to, you're going to need. and you're going to also need to like use like people and like communities and like hire data trainers. Like why does a company like scale AI, which does like human powered labeling get such a premium is because of that, right? Like you have to clean up a lot of this low level data or messy data. and what I just said for transfer learning for languages.

can also be adapted to different industries. Like at Fireflies, we built models like that. We built like these tooling that can work not just for sales meetings, but also your medical meetings or a legal meeting where the vernacular and the words that are used there is different, right? So that's also another example of where this type of learning can expand.

Prateek Joshi (28:31.607)
You're the founder of Fireflies, which is now used at 300 ,000 organizations worldwide, maybe even more, it's growing. And 70 % of Fortune 500 companies use it. So again, you've built and shipped a product that has a large user base. So what lessons do you want to share or have you learned from building and shipping a voice AI product that's out there in the wild? I'm pretty sure many people...

They love it. In the early days, they would have complained about it. It's a combo. So how did you, like, what have you learned?

Krish Ramineni (29:08.402)
Yeah, I think there's a lot of learnings on the engineering side, on the go -to -market side, on the design side, and then the company building side. For the purpose of this podcast, I'll probably focus more on the technology and innovation side. When we started working on Fireflies, the problem space was not solved, and it was actually a far shot. The idea around summarization, speech to text,

all of these sort of things were very far off into the future. So sometimes when you are in an industry where I call deep tech, there's a lot of factors that have to be proven out for you to be successful. So it took us a lot longer. It took us almost three years to put the pieces together to make something viable. And there's a famous quote, I believe it's by Elon on this is,

In those early phases, it's like looking out into the future while right now you're chewing on glass and it's painful. That's what it was like building this technology, but you're hoping that the future it'll be better. So there's a lot of times where people will give up because the technology is quite not good. And there's many companies that have risen and like completely like gone extinct because the technology just was too early for that, for its time.

Larger companies have a different problem where they are resistant to change and they want to stick with the old way forever. Classic example, like Netflix versus Blockbuster, right? Netflix took the bet that eventually streaming over the internet was going to be good. You're not going to have buffering. You're going to be able to send high quality stuff. And Blockbuster was no, we're going to just still ship DVDs. This is going to be the way to make money right now. So.

The reason I say all of this is sometimes innovation, where it stands, is at odds with what everyone else believes in. And it's sometimes more expensive in the short run, but in the long run, it might actually give you more benefits. So a lot of the things that we were doing in the early days simply would not scale. But as this technology got better, as speech got more affordable, we're able to do things today that...

Krish Ramineni (31:28.754)
honestly, two to three years ago that I thought would be way too expensive. If there's any bet an entrepreneur or an engineer needs to make, especially a tech focused engineer needs to make, is that this technology is gonna be democratized. Like everyone is going to have access to this level of technology and you should build for what that world looks like two to three years from now. Obviously you need to survive, raise funding, et cetera. So we had to do little things that people want now.

but we always built for what the world would look like in a couple of years. And if you have the luxury of being able to survive that long, it will be well worth it. And there's just so many spaces where people like just fundamentally don't see that time and time again, right? We've had this with, you know, smartphones were very expensive. I guess you can say like the premium phones are still expensive, but now you have so many options where you can get incredible.

capabilities on some of the other providers, not just Samsung and Apple. Flat screen TVs, these flat screen TVs, LED TVs were just like, LCD TVs are incredibly expensive, but now everyone has them, right? The cost has gone down significantly. Now let's take that and apply it to AI and a sector. So if you are going to say that in the future, we are going to have an amazing AI assistant,

that can be your primary care physician. And it can handle a lot of your first line of questions. And every human in the world is going to have like this AI primary care doctor. I'm not saying it's going to replace your existing doctor, but you're going to have that. And if you were to go to a doctor, like for example, in the U S if you have to go to a doctor, it's really expensive. The co -pays, the fees, it's very, very expensive. And I have a lot of relatives, uncles, aunts that are doctors in India who...

If you want to go see them, you have to wait many hours. It's hard to get appointments and stuff. So imagine if you're an entrepreneur saying, okay, I want to build in this healthcare space and I want to build this like AI assistant that can help with finding symptoms, reviewing lab work. I know like a medical student today will say that AI will never replace me. It's never as good as me. Like I'm going to know things better than the AI. But can you say the same thing two to three years from now with confidence? Like,

Krish Ramineni (33:51.73)
That AI, if it can pass the medical entrance exams with a perfect score in five minutes, and it can retain information, let's just talk first principles, right? It can understand all this information, it can process it, and then it's creative enough to use transfer learning and stuff to be able to diagnose as well. What does that mean for this market? So you as an entrepreneur can take that and say, okay, how do I build for a future where everyone has these sort of tools?

And that's helpful for doctors. Yes, it's scary for doctors because they're going to have to evolve and they have to move fast. because what they think is unique today and special that only they can do these systems are going to do. but I do believe that like, if you're an entrepreneur and you see this like two to three years ahead and you think 10 steps ahead, you can basically build these tools that are going to make doctors super humans. Not that they are already aren't, but like you can make them super humans and like do more work.

Prateek Joshi (34:33.047)
Right.

Krish Ramineni (34:49.554)
with less effort, like see 100 patients in the effort it takes to see 10 patients. So that's the thing I would like to tell everyone. If you're working in the AI space, don't build for what's now or in the next six months. See where it's going in about 12 to 15 months. It'll be radically different. We might have something even better than GPT -4, right?

Prateek Joshi (34:55.287)
Yeah.

Prateek Joshi (35:09.399)
Amazing. With that, we're at the rapid fire round. I'll ask a series of questions and would love to hear your answers in 15 seconds or less. You ready? All right, question number one. What's your favorite book?

Krish Ramineni (35:18.93)
Yeah, let's do it.

Krish Ramineni (35:23.954)
I love this book called Making of the Modern World. And it was about at the time of Genghis Khan and how the Mongols took over a large part of the world with limited resources and stuff. And what were some of the things that they did that was so innovative at that time. So it's a really interesting book. I would say it's definitely one of my top three books because what we...

learn about adaptability, being creative and like just like all these unconventional like rules of business. Cause I never went to business school. But I think history teaches you a lot about how people, the rise and fall of empires and stuff. So I think it's just a great analogy. And I'm sure like the author didn't intended for it to be used as a business use case, but like I find a lot of interesting analogies from it that also apply to the world of business.

Prateek Joshi (36:22.487)
I'm a huge fan of history too. And I think it was part on like the, if you just study what people have done, what the rise and fall of empires can teach, it's a ton of information. So it's, yeah, I agree with that one. All right, next question. What has been an important, but overlooked AI trend in the last 12 months?

Krish Ramineni (36:44.082)
Important, but overlooked. I will have to say that a lot of us did not really think about the implications of what happens when this tech becomes better and better at a rapid rate. Like we don't know what we don't know about like a few years down the road. So, you know, I just had like a random thought one day while I was going for a run that today the AI is like good and it's going to get better.

how do we think about governance and security with AI? Because we all want AI to do work for us. Will there ever be a time where the AI decides that what we want it to do is not in our best interest and decides to do other things, right? And again, this is for all the AGI, I'm not trying to do fear mongering and stuff, but like technically AI can write code, AI can make certain decisions. I can give AI...

like a couple candidates and say, tell me which one you think is the best one. So it can make its own assumptions and it can also change its assumptions from time to time. Like if you say, Hey, I think candidate two is better than candidate one. so I think governance is really important because if that same AI decides, you know what, I don't think what you're doing here is good. Let me just like go in and change a little bit of your code. Let me do some of that. And then that leads to some sort of vulnerabilities.

It sounds like a crazy scary movie waiting to happen. But I think just the rate where we're going, we need to be a little bit more serious about that. That's why people talk about regulation. People talk about AI safety, AGI, all of this stuff. That's going to become a bigger topic in the next coming year, for sure.

Prateek Joshi (38:32.023)
What's the one thing about voice AI that most people don't get?

Krish Ramineni (38:37.746)
Voice AI, the where it's going is we're going to have agent -like assistance, like in the movie Her, that we can talk to. And that is, we're already seeing it. What needs to be seen is, will talking to my AI be faster than me clicking a few buttons on an interface? And I would bet there are certain use cases where we can just completely skip the UI.

and like dashboards, some places where you still need them, but that is something that we should keep an eye out for. Cause people said the same thing when chatbots came out in the past, like five years ago, chatbots were stupid. Like they were not very smart, but now it can do a lot of powerful things. Thanks to these LLMs. And I don't have to say a phrase in a particular way, like a command prompt, right? And same thing with like these voice assistants. It's not going to be like Siri.

Like you don't have to say things in a particular way. It's not going to be like Alexa. It's going to be 10, 20 times better. and it's going to be faster. So I think a lot of people are not, fully sure yet, but I think it's going to drastically change the way we interact and interface with computers. And it may make in some cases that the primary input experience than like clicking on stuff in a dash.

Prateek Joshi (40:00.471)
What separates great AI products from the good ones?

Krish Ramineni (40:06.226)
AI, if it's just, you know, people talk about this thing of, is your AI a rapper like that someone else can do, or is it doing something deeper? I think great AI products are building workflows that solve a use case end to end. And whereas if your product is just AI, that's not going to be defensible.

But if your product is a solution, building a workflow for some end users, that's going to be more defensible as an entrepreneur. Because anything that's low level, OpenAI, Microsoft, Google are going to eat that up. And there were a lot of companies in the past that were like, we're going to just help you write great copy, right? Like marketing copy, sales copy, et cetera. And then Chad GPT was released and that forced them to innovate, right? So I think like AI as a product,

You're going to initially work because you could have all these like gimmicks and like ways to just attract people because it wowed them. But it's really going to be very difficult to try to compete with the big people on just a low surface area type feature. You know, Microsoft recently released these computers that can look at everything that's happening on your screen and tell you what's going on.

And there used to be a startup that did the same thing where they're going to look at everything on your screen and you can ask it what's on your screen. But now at the device level, you're competing with Microsoft. Like that's a tough battle. And you didn't have anything more than like just Chad GPT on top of that. So it's much more important to be able to like, let's say you're working in the insurance industry and you're going to be able to take an insurance ticket and then file a claim and then debate that claim and argue it and win the claim for you.

Now you're solving end -to -end use case or workflow. Like that's something maybe OpenAI is not going to go after and it's much more defensible. So I think the best AI companies are going to solve workflows.

Prateek Joshi (42:05.751)
That's actually a great one. And I agree with that. I think the magic happens when the product just knows how I work and it just works from end to end. You don't make me stitch a bunch of stuff together to make it work. So that's a great one. All right, next question. What have you changed your mind on recently?

Krish Ramineni (42:25.938)
What have I changed my mind on recently? Okay. sorry. Once.

What have I changed my mind on recently?

Krish Ramineni (42:41.074)
Hmm. I think that AI is going to become way more affordable than I thought it would be. I thought it would take about two to three years for everyone to be able to have this level of access, but it's happening faster. And I didn't think company like OpenAI would make GPD for free, multimodal, audio, voice, everything. So I think that's also really, really powerful.

Prateek Joshi (43:08.151)
What's your wildest AI prediction for the next 12 months?

Krish Ramineni (43:13.17)
While this AI prediction for the next 12 months is...

Krish Ramineni (43:21.266)
GPT -5. I would love to see it and I would love to see it become like as affordable as GPT -4 even faster than the time it was.

Prateek Joshi (43:22.743)
Hahaha.

Prateek Joshi (43:36.087)
Final question, what's your number on advice to founders who are starting out today?

Krish Ramineni (43:42.418)
Number one advice is, I always say this, but focus on the problem. Don't chase the shiny tech. When you focus on the problem, the solution will naturally like come out of it. But if you say, hey, this is cool tech, so I need to now go build a company around it, very few companies succeed that way. So try to solve a problem that you have or a customer has, and then use the technology as a means to solving that in a better way.

rather than saying, okay, I love LLMs and I love this part about LLMs, let me go build a company around it. I don't think that's how.

Prateek Joshi (44:19.671)
Amazing. Krish, this has been a phenomenal discussion. Obviously, I think we went so deep on voice AI and more than most people, you know about the pain and also the joy of shipping a great voice AI product. So thank you for coming onto the show and sharing your insights.

Krish Ramineni (44:37.138)
Yeah, thank you so much. That's great.