Groq CEO Jonathan Ross - Next Gen AI Hardware Artwork

Summation (formerly World of DaaS)

Summation: non-obvious ideas that move the world. Auren Hoffman hosts leaders across tech, business, markets, and government.

Summation is the permanent home for the relentlessly curious.

Auren is CEO of NQB8, GP at Flex Capital, Chairman of Dialog; former CEO of SafeGraph and LiveRamp (NYSE: RAMP).

All Episodes

Summation (formerly World of DaaS)

Groq CEO Jonathan Ross - Next Gen AI Hardware

April 30, 2024 • Word of DaaS with Auren Hoffman • Episode 143

0:00 | 51:44

Jonathan Ross is the founder and CEO of Groq, a company that develops high performance microchips purpose built for AI and machine learning. Prior to founding Groq, Jonathan invented Google’s AI processor, the TPU.

In this episode of World of DaaS, Jonathan and Auren discuss:

Future of industry in the AI era
Evolution of AI processing hardware
Semiconductor supply chains
Challenges and innovations in chip development
Prompt writing tips from an AI pro

World of DaaS is brought to you by SafeGraph & Flex Capital. For more episodes, visit worldofdaas.buzzsprout.com, and follow us @WorldOfDaaS.

You can find Auren Hoffman on X at @auren and Jonathan on X at @JonathanRoss321.

Editing and post-production work for this episode was provided by The Podcast Consultant (https://thepodcastconsultant.com)

Auren Hoffman: 0:02

Welcome to World of DaaS. A show for data enthusiasts. I'm your host, Auren Hoffman, ceo of SafeGraph and GPFlex Capital. For more conversations, videos and transcripts visit safegraphcom. Slash podcasts is the founder and CEO of Groq, a company that develops high-performance microchips purpose-built for AI and machine learning. Prior to founding Groq, Jonathan invented Google's AI processor, the TPU. Jonathan, welcome to World of DaaS, thanks for having me, I appreciate being here Now.

Auren Hoffman: 0:36

I'm really excited. You have some really interesting ideas about data and compute and the idea that we're transitioning from an information age to a generative age. What does that mean and what does that mean for data?

Jonathan Ross: 0:49

We transitioned into the information age the first time we were able to make high fidelity copies of data and distribute them throughout the world, and over time that changed the way that we did business. All of a sudden, it went from being about who had the answer in the moment to who came up with the answer. There was the origination of the idea as a concept and as a monetizable thing, and we got new technologies over time. We got the internet, which was about making high-fidelity copies and distributing them instantly. We got mobile, which was about doing it in your hands, but really these technologies were the same as the printing press. They were just much better, and that degree of change was enough to break all of our intuition to some degree.

Jonathan Ross: 1:37

Generative AI is not an information age technology. It's not about taking copies of data and distributing them. It's about creating things in the moment. It's creative. It's not about taking copies of data and distributing them. It's about creating things in the moment. It's creative, it's generative and you're going to get an answer in the moment that is custom to you and that requires compute. So that changes the paradigm and it breaks all of our intuition. Instead of it being about who has the data, it's who has the compute, who can create the answer for you right now?

Auren Hoffman: 2:05

the data, it's. Who has the compute? Who can create the answer for you right now? And does that mean data is less valuable than before and compute is more valuable?

Jonathan Ross: 2:17

It's stacked. You can't have the models without data, but you also can't have an information age economy without energy, and the industrial age was really based on energy, and so what we have is this new stacking, but it does change the paradigm a bit. Oftentimes when we're talking to people it's so funny because it's almost the more sophisticated someone is, the more entrenched this idea about being close to data is. When you and I first met, did I show you the demo at all of what we had?

Jonathan Ross: 2:41

Yeah, I think so, and we were like 9,000 miles away, some number of thousands000 miles away, some number of thousands of miles away, away from where the servers were, and it was almost instant. And so, if you think about it, the number one export of India is tokens. They're just generated by human beings as opposed to generated by a computer, and you don't need locality because you have a small amount of data coming in. You have an enormous amount of compute producing an answer and then you send a small amount of data back. So oftentimes we'll get two bytes, perform 180 billion operations and then send two bytes back Like a customer service agent or something like that.

Jonathan Ross: 3:18

Exactly. We don't even put hard drives in our servers, we just completely decouple and we're just focused on the compute. And data is incredibly important. It's just not what we do. We're doing this generative age thing, but we build on top of the work of others who do things in data.

Auren Hoffman: 3:34

Let's talk about our own book more, talking about where the world is going. Here In the world that it's going. What is this transition going to generally look like and what does it mean for, like the different, where there's a data provider, there's a compute, there's obviously like the bandwidth stuff, there's the energy. Where do all these things stack up? And vis-a-vis one another?

Jonathan Ross: 3:58

how will the power shift? Well, the power will shift in one way, which is historically, you had a lot of power just because of your geographic location. And literally, you could pull oil out of the ground. And then we started building data centers and a shift there was. Where you build the data center gives you some power. It's not just where you find the resources, but where you build it.

Auren Hoffman: 4:16

Why is that? Why would Oregon have more power than Nebraska, or something?

Jonathan Ross: 4:21

Well, in Oregon's case, I don't know that they do have more power, but if they did, it's probably because they have a lot of hydro power.

Auren Hoffman: 4:27

You said somebody has more power because the data center is closer and stuff.

Jonathan Ross: 4:30

I mean more technologically. So think about it this way Right now there's a conflict between the US and China, and that conflict is about largely getting access to many technologies, but in particular AI compute, and there's a strength leads to strength aspect here, which is with things like oil. Being better at pulling oil out of the ground doesn't give you more oil to pull out of the ground. It doesn't have this cycle where the more oil you pull out, the more oil you get. But with AI and these sorts of technologies, the better you get at it, the more you're able to get better at it, and so you could actually pull away from others, and so there's a bit of a race right now to see who can actually get a lead on this. But whoever has the initial lead could potentially pull away far enough that it'd be hard for others to catch up.

Auren Hoffman: 5:23

There's this compute side, there's the chip side of it. There's, let's say for lack of a better word the algorithmic side. This would be like an open AI or some of these anthropic type of models. There's other types of things, is it? Every single piece of the ecosystem is equally as important, or do you think some are more important than others?

Jonathan Ross: 5:43

If you're missing some of it. You may not be able to get all of it, but some of these pieces are going to be just given away for free, for example, gosh. It's just open source, open source models in particular. If you think about it, linux won when it had enormous headwinds. People did not believe in open source at the time of Linux. Linux had to prove it. Now open source is considered the de facto way that things will end up going for most projects like this, and many of the people that we speak with are just operating on the assumption that eventually, an open source model will win, and the question is, which one?

Auren Hoffman: 6:19

One of the reasons Linux really started taking off in the aftermath of the dot-com crash. Prior, everyone was buying Sun and they were very happy with paying just extraordinary amounts for those Sun boxes. And then the Linux box was 10 times cheaper or something than the Sun box and because all of a sudden money was important and they had to save costs, linux started taking off. Is that like a moment that people are going to start thinking about here, or how do you think about that?

Jonathan Ross: 6:49

Probably one difference here was the rate of adoption for the early internet stuff was much lower than the rate of adoption for generative AI. It's really only been a year or so, and the rate of growth of companies doing this is crazy. Yeah, it's very different. What you're seeing is people are already starting to care about things like is this a good product For the early internet? It was. This is the internet provider that I can get access to. I'm going to use Uncle Bob's Serve, your local ISP. It was the only one you had. Oh, this is the one website I know to go to this thing.

Jonathan Ross: 7:28

Now what you're seeing is there's really a focus on who's going to be the winners, and so if you have a much better approach, people tend to flock to it now. So it's a cost. No one wants to get stuck in something proprietary. This is one of the big ones that we hear, and also the assumption that open source is going to move faster, because that's what people have seen in the past, and so you don't want to be on some proprietary thing, which, yeah, it's ahead right now, but open source is going to pull ahead as we move to this compute-driven degenerative age.

Auren Hoffman: 8:03

what are both the societal and economic implications are going to happen?

Jonathan Ross: 8:07

I can't predict what will happen, but there are a couple of things that we've noticed, which the entire point of going into a different age and why you would call it a different technological age is. It breaks all of our intuitions, and one of the most interesting ones that I think is completely broken is we keep thinking of each technology as displacing work. One of the things that's probably going to happen is we will probably create more jobs for people than we have people. There will suddenly be a lack of supply of people to do things, and I'll give you a concrete example. It used to be that no one would have a graphic in their articles, their random news articles, but now it's so easy to create one. Most articles have some sort of graphic, most blog posts have an article, and people probably spend more hours overall as human beings generating these because it is so easy to generate them.

Jonathan Ross: 9:09

And this is called Jevons' Paradox, and this was noticed in the 1860s by someone writing a treatise on coal, where what he realized was every time steam engines got more efficient, rather than buying less coal, people bought more coal, so steam engines get more. Why are they buying more coal? Well, opex goes down. More things go into the money. People do more things, and so what will probably happen is, with most of the things that generative AI makes easy, you will actually see an increase in human activity on that, and there's always going to be someone who's going to be more entrepreneurial and figure out a way to monetize that. Get a whole bunch of people working on it.

Auren Hoffman: 9:49

I haven't heard of Jebbons Paradox before. Is that a similar thing for the layman might be like okay, they made the highway from two lanes to three, but the traffic is still just as bad because more people drive on it, or that's a good way to?

Jonathan Ross: 10:02

think of it, yeah, or price elasticity is another way to say it.

Auren Hoffman: 10:12

In this kind of new world? If you had a bet on companies, what types of companies do you think have the biggest advantages in this kind of more future world?

Jonathan Ross: 10:16

I'm going to shift that one a little bit. I'm going to say there are going to be companies that are definitely going to be successful in this, and then there's going to be companies that might be and it's going to be almost impossible to predict who but they might be even more successful than those other companies. When the information age came along, it was a great time to be in material production for paper. Now, which newspapers were going to succeed? I don't know. Don't know how that works yet. So this is a picks and shovels thing, where we know we're going to need energy, we know we'll need more data that's going to be a thing but we're also going to need more compute. But these are the picks and shovels of it. Then there's going to be which generative AI company is going to be the next Google? Which one's going to be the next Microsoft?

Auren Hoffman: 11:03

Let's say we airlines or something, is it just some airlines are going to do super well and some airlines are not going to do well because some airlines will adopt it faster than others. Or do you think all airlines get wiped out or all airlines do better? And you don't literally mean airlines, you mean, yeah, exactly, it could be anything healthcare trucking just go down your list of whatever industry this is.

Jonathan Ross: 11:25

I will say this much Every single industry will be disrupted by generative AI.

Auren Hoffman: 11:29

You think faster than the information age.

Jonathan Ross: 11:32

Much, much, much, because there was a recent study where doctors competed against AIs to make diagnoses. The fascinating part about this was you're not going to be too surprised if I say the AI outperformed the human being. Given the same data, we're probably going to see AI involved in sort of diagnostic medicine. I just don't see how that doesn't happen and, ironically, it'll be a little slower. For surgery. We already have robots for surgery, but diagnoses you can just feed the data. It could already do better. The shocking part was when they paired a human doctor with the AI. The results were worse than just having the AI give a diagnosis.

Auren Hoffman: 12:14

This is true in chess too. It used to be a human with. The AI was better than the AI, and then, very quickly, the AI was clearly better than the human, plus the AI, because the human would overrule it and stuff.

Jonathan Ross: 12:26

What I expect is we'll get to a point where medicine becomes so consistent that if you go in and you don't get a common issue treated, that will be malpractice as opposed to. You just did something egregiously wrong.

Auren Hoffman: 12:40

What about things like law enforcement and national security? How will this general world play there?

Jonathan Ross: 12:47

Let me start with the national security part. So there's a concept of first, second and third offset and a lot of people refer to the first offset as when gunpowder was invented it totally changed the way war was done. The second offset was nuclear because it totally changed the way that conflicts were done. A lot of people are now saying that AI is probably going to be the third offset. Again, intuition is totally different. So with nuclear, if one actor has a small number of weapons, you want to avoid that conflict at all costs. While it increased some risk for the species as a whole, it decreased the risk of conflicts happening.

Auren Hoffman: 13:27

At least major conflicts You'll have, like proxy conflicts, like Vietnam or something happening Exactly.

Jonathan Ross: 13:33

It forced these conflicts to be smaller and more. It's not really me, but it's me. With AI, the cost to launch a sort of AI attack of disinformation or these sorts of things is so low that you're probably going to see an escalation, and this changes the dynamic. And what you're going to need to be successful is not some amount of AI capacity. You're going to need AI superiority, like you need air superiority, and whoever has AI superiority will be in a very good position. If one entity has a third of the AI capacity of another, they will probably be in conflicts all the time. You're going to have to just dramatically overwhelm any sort of adversary.

Auren Hoffman: 14:19

In a world where, potentially, even if the US has better offense, it may have more vectors of attack, which means it will have to have significantly better defense, or how do you think about that? In nuclear war, I don't know the US doesn't have necessarily any worse vectors of attack than anybody else. Everyone is vulnerable, but in an AI war, it seems like the US is more vulnerable to attack.

Jonathan Ross: 14:48

I'm particularly concerned about elections. My philosophy is all the concerns that we have about AI, as long as we get the elections to work and remain free, we don't allow foreign money in elections. Why would we allow foreign compute? And if we can preserve the elections, that'll give us time to fix the things that we discover or issues with AI in time. But that is the most concerning piece of our infrastructure it's the elections.

Auren Hoffman: 15:16

Now, before we get into all the new paradigms that you're working on, pretty much anyone who's listening to this is probably well aware that NVIDIA has been this incredibly successful company. Why have these GPUs aware that NVIDIA has been this incredibly successful company? Why have these GPUs from NVIDIA been so successful over the last decade?

Jonathan Ross: 15:31

There's a lot of reasons. First of all, let's go back a little bit in time and let's talk about why CPUs became successful Company. Intel most people have heard of Intel originally was a memory company and they begrudgingly switched to CPUs. And CPUs became a much better moneymaker. And the reason is CPUs. They're not a standard and so switching costs are higher. If you've ever heard of the framework from Hamilton Helmer called seven powers, cpus are like the personification of that. And then you see Intel, even with the Intel inside, going branding all this other stuff.

Auren Hoffman: 16:09

Because they were both harder to copy, not a commodity like the memory types of things. You couldn't swap it. I mean, AMD had a x86 chip that was, I think, pretty similar. Was it? Just because it cost $2 billion to create a plant? What was the reasoning why the power was so high for, like an Intel?

Jonathan Ross: 16:31

There's another aspect here, which is when you're deploying a chip, it's not the cost of the chip that matters, it's the cost of all the infrastructure. So if you can increase the performance of a chip, the CPU, 15%, that effectively gives you 15% more value of the power that you're sending in. Of the concrete and the data center floor, of the racks, of every single part of it gets 15% more economical, and so that means that a small performance advantage makes an absolutely huge difference in terms of your outcomes. What you would see is a lot of the data center providers would build a system with AMD chips just to try and price negotiate with Intel, with no intention of ever deploying them at volume.

Auren Hoffman: 17:21

So the AMDs weren't really copies. The Intel chips were just better. They were priced higher but they were better.

Jonathan Ross: 17:26

Correct, and this was a strength leads to strength kind of thing. When you pulled ahead you got an advantage. But there was also a double-sided market because you would write code for x86 and then people would buy x86 systems to run that code. And then there was all these bugs in x86. And that became a feature because eventually all the software had to support these weird bugs and amd had to copy the bugs of intel and there were literally people who were experts in the bugs in x86 just to make sure that people could make copies efficiently. Now, with NVIDIA, what NVIDIA did very well was there's probably at least two major things, but then there's a bunch of other things around it. The first was CUDA was a double-sided market, very much in the same way, because there's no known algorithm today to automatically compile stuff down to.

Auren Hoffman: 18:15

GPUs. Not everyone listening to this probably understands what CUDA is, so can you explain it, for like a relatively smart person?

Jonathan Ross: 18:23

There's what CUDA is advertised as and then there's what it really is. So CUDA is advertised as the development platform that you use to write code for NVIDIA chips. That is trivial to replicate. That has basically zero value.

Auren Hoffman: 18:38

Oh, I didn't realize that. So anyone could just create a new, something else to go do that.

Jonathan Ross: 18:43

That's the easiest thing in the world. In fact, this is one of the things that eroded the x86. Intel dominance was things like NET, llvm, java. Java was a little more brute force. You had to get people to target it. But these other things your C++ would compile to it and all of a sudden the switching costs went down. It would work with anything. Yeah, and the same is true of the actual CUDA programming languages. Those are trivial to re-implement.

Jonathan Ross: 19:10

The hard part is what's called CUDA kernels. What a CUDA kernel is is if I write a program let's say I create a video game I want that to run well on a GPU. The problem with that is there is no known algorithm to take that code and to get it to work on a multi-core system efficiently. So human beings write these kernels based on the code that you wrote. So if you're a major video game design studio, you send your game over to NVIDIA and they will write some kernels and then they're in the driver and when someone loads up that video game it runs faster, just like someone writing assembly to make a C++ program run faster. Now, this is needed because of the multi-core nature and this is viewed as a huge advantage by NVIDIA, and when we did TensorFlow at Google, we ourselves had to write the kernels for NVIDIA's GPUs, otherwise TensorFlow wouldn't have been relevant, and so it's got this double-sided market feeling to it that makes it almost impenetrable. So that's one.

Jonathan Ross: 20:15

The second is that NVIDIA has forward integrated beyond what most people have noticed. So typically most companies, they'll build a chip, they'll build a system, they'll build networking, they'll do some software. They don't do all of them. And what NVIDIA did was they started with the GPU and CUDA, so the software, whereas AMD was more letting other people write the software for them. And then they added a system, the whole DGX boxes. Then they bought Mellanox to bring in networking. Now they're doing their own cloud to compete directly with their customers. That's the thing They've always forwarded.

Jonathan Ross: 20:49

So there was actually a recent I forget which company it was. There was a company that made NVIDIA graphic cards for a very long time. 80% of their revenue, or some large number like that, came from selling NVIDIA cards, and one day they just announced we're done, we give up. Nvidia has made it impossible for us. They've squeezed out all of our margin. They make their own cards. We give up. Even though it's 80% of our business, we're leaving it, and so what they do is they just forward integrate, and every time they get something established, then they start forward integrating into the next part.

Auren Hoffman: 21:24

But they still are a big TSMC customer.

Jonathan Ross: 21:27

Absolutely, and that would be going the other way. They tend to forward integrate. They don't tend to go back into their vendors. They take what their customers do and they do that themselves. There's a reason why they go that direction rather than the other direction, which is there's more margin the more up the stack you go, and they're trying to capture more and more of that margin for themselves.

Auren Hoffman: 21:48

Instead of like selling NVIDIA systems to other people, who then rent them out, they could just rent them out themselves.

Jonathan Ross: 21:55

Yeah, I mean, that's what they're starting to do, and they just announced an inference service. So they're even going to go beyond that and start selling, I guess, token as a service and compete with all their token as a service customers. So when you get to that level of power, people can't really say, hey, I'm going to drop you because you're going to compete with me. They're like I have no choice, I got to keep buying this. And every dollar that they send to NVIDIA is just used to develop more R&D to replace them.

Auren Hoffman: 22:21

They're clearly the biggest GPU customer and obviously the only way to really compete is with a new paradigm. You're working on this LPU. What is different about that and why is that important in this kind of more generative world?

Jonathan Ross: 22:33

We really did those two things very differently that GPUs do. For one, we spent the first six months working on our compiler. So by the time that we started designing our chip, we already had the software working and we were able to avoid the whole kernel issue altogether, because it's just a completely automated compile. The second thing was we really built with all of that stuff, all the networking, the system, all from the beginning. So when we are running an LLM on our hardware, we're actually running on hundreds or thousands of chips, hundreds of thousands of our LPUs, not one or two or eight or whatever, like you do with GPUs. We can do that because we have our own integrated interconnect that allows us to scale up to more chips and it's completely synchronous.

Jonathan Ross: 23:23

Just imagine that you have a whole bunch of meetings scheduled and that allows you to have a fairly efficient day where you're meeting with a whole bunch of people. Imagine if you were trying to meet with a whole bunch of people and you couldn't schedule a meeting. That's the way it works inside of a CPU, a GPU or any other architecture. We have the scheduling, but we had to do the interconnect, the networking to be scheduled as well. We had to do the drivers, we had to do the compiler, we had to do everything from scratch, and that was the only way, and because you're going between all these chips, there's a need for this super low latency.

Jonathan Ross: 24:01

Correct, and that's really the key. When you're running training, you don't need low latency Training. You're going to finish in a month. All that matters is that you keep the hardware busy. But with inference it's very different. It's about how quickly you can give an answer, and that means that everything needs to be scheduled. Imagine if a task required 792 people to touch it. Well, if they're not each doing their task at an exact moment, it's just going to take forever.

Auren Hoffman: 24:27

How do you design something in a new way for super low latency? What does one have to do today that people weren't doing in the past?

Jonathan Ross: 24:37

Really, you just have to redo the entire stack from scratch. The biggest problem is everyone's trying to solve ai, compute with features rather than products. They're like oh, I've come up with this one change to one portion of that entire forwardly integrated stack, and that's my advantage. And now come write software for me, come build build networking around me, come build a system for me. Come bring all your frameworks and no one wants to do that because it's just too high of a cost for too little of a game. In our case, what we did was we just made it completely compatible with PyTorch, which is what everyone develops in. So you have a PyTorch model. It just works on our hardware, there's no effort required, and also our API is compatible with OpenAI. So you just change to point to.

Auren Hoffman: 25:26

Grok. So basically people have already written this code. It just works, so they don't have to like redo it or something. So you've kind of already can bring in the developers immediately, exactly.

Jonathan Ross: 25:36

So we went viral. Actually, it's been almost a month. Right now, we already have 70,000 developers. For comparison, it took NVIDIA about I think it was seven years to get to 100,000 developers.

Auren Hoffman: 25:54

We're on track to get there in seven weeks. Wow. It doesn't seem like Apple is usually in these conversations about generative AI, but they have this advantage. Maybe you think it's advantage, or maybe you think it's a disadvantage that they design their own chips. Are you bullish or bearish on Apple?

Jonathan Ross: 26:05

I'm bullish but because I'm bearish, I think Apple is so far behind everyone else. It's going to embolden them to make some smart decisions. They're actually going to partner with others, they're going to do things, whereas I think Google has it the hardest, because Google is so far ahead of most others that it's hard for them to recognize.

Auren Hoffman: 26:30

They have their own chips. They have their own software.

Jonathan Ross: 26:32

I mean, they wrote the attention is all you need paper. They're so far ahead and what they're not realizing is that their strategy isn't working and they need to do some other things. I would say same for some of the other hyperscalers I mean, frankly, all of them. This is going to sound a little weird. I think Apple has the best advantage because they're probably the only ones who realize that they haven't locked it down. I think Microsoft, I think Meta, I think Google all think that they're the winners at this point, based on what they're doing, based on conversations, but I would say Amazon realizes they're a little behind and I think that's going to give them a little more flexibility. And you see this all the time the folks who are the most behind sometimes end up getting ahead of everyone else because they don't find what they're doing. That precious Apple just canned their self-driving team and said you are now working on generative AI.

Auren Hoffman: 27:32

In the past, the biggest expense for innovation was salaries of engineers and potentially in the future the biggest expense is going to be hardware plus infrastructure. How does that change the industry?

Jonathan Ross: 27:45

This is a little bit of why Grok got started because we want to preserve human agency in the age of AI, and the concern is that if a small group of people has all of the compute, then they will have all the say. We want to make sure that everyone gets access to compute. This is a real concern, because you will be able to make your employees and your partners and everyone else more efficient by giving them more compute. This is not like search. With search, which is an information age technology, you build your index and then that index is retrieved on. You can improve the quality a little bit by searching deeper in the index, but it's not a game changer in terms of quality. You're building that index for everyone and they all get the same index. Now, with LLMs, you're probably building that model for everyone because it costs so much to build. But actually the more compute you give, the better the results get.

Auren Hoffman: 28:43

How linear is that? To get two extra results, do you have to give 100x compute as a work?

Jonathan Ross: 28:49

I don't know that that's well established yet. But let me also put it another way. If you were going to work with a consultant and one consultant was 10% better than this other consultant, would you only pay 10% more? I'd pay more than 10% more and that's generally true of cognitive tasks, and so, even if it gets super linearly more expensive, you're going to want that better result because you're probably banking a lot on it. You're going to be making some strategic decisions, and so it really can have an outsized return.

Auren Hoffman: 29:26

In the chips world. There's a small number of super expensive chips out there. Some of them maybe go into your phone. Some of them are these, let's say, even the GPUs or whatever it might be that are out there, and then there's a very large number of chips, many of which were made even 20, 30 years ago. That's the vast majority of things. Are we going to see something similar where you're going to bifurcate? These models will be used for these very, very important questions. I can use, but then I'll use the super cheap models over here for other stuff, or how do you think people are going to be using these things?

Jonathan Ross: 30:01

I definitely believe that more difficult questions will get more compute applied to them and easier questions will have less compute applied to them. The way that this is going to work is a lot like playing chess. So when you play chess, you can play speed chess or you can play normal chess, where you think a lot but it takes a lot longer. And if you are charged by the second for the hardware, which is effectively what you are, then you're going to want to do speed chess wherever you can. You're going to want to just go with the stream of consciousness output of the tokens. You're not going to want to have it think really deeply and try and come up with a better answer. There's also an element of, as you were alluding to, smaller and larger models. That has an effect on the cost and the quality as well. But what bigger models do is bigger models actually improve the intuition of the models. Even if you run a smaller model, if you run more compute on it, if you have it search deeper, you can end up sometimes getting better answers, and so for some tasks you may want a smaller model. Especially if it's an area you've never tried to answer something in before, you don't really need a large model. What you really need is a lot more of those compute cycles to figure out the answer.

Jonathan Ross: 31:17

But then there's another side to this, which is the larger the model, the fewer hallucinations you get, and the reason that that happens is have you ever gone somewhere? And you're like gosh, the GPS took me here. This is not where I meant to go, and it's because there's just some extra information unrelated to driving about the neighborhood, about where you're intending to go and you're like this is wrong. Well, the, about the neighborhood, about where you're intending to go. You're like this is wrong. Well, the bigger the model, the higher dimensional, the harder it is to have one of those mistakes, and so you're probably, for a while, going to see these models continuing to get larger in order to reduce the probability of these hallucinations, and so for anything where you need to reduce hallucinations, that'll be the way, but you can also do that by applying more compute. It just takes work and effort.

Auren Hoffman: 32:05

Why does the compute reduce the hallucinations?

Jonathan Ross: 32:08

I'll give you an example. What is the next word that I'm about to Say? Okay, everyone listening to this had the same word in their head If I asked you to complete this sentence. The second derivative of the square of the hyperbolic tangent is I don't know, you don't know. So how is it that you don't know?

Auren Hoffman: 32:27

It's not a common thing that people talk about all the time.

Jonathan Ross: 32:30

Well, large language models are a lot like playing chess there's a sequence of tokens instead of a sequence of moves, and what happens is, at each point, the model assigns a probability distribution across all the potential tokens and then sorts them into highest probability first and lowest probability last.

Auren Hoffman: 32:51

Everyone listening to this is familiar with, like an autocomplete or something.

Jonathan Ross: 32:55

And then typically, the algorithm will pick one of the top tokens. It doesn't always pick the top one, because there's reasons why you don't want to always just pick the most obvious answer, but it goes up here. Now, when you do that, it's like playing the first move that comes to mind when you're playing chess. But if you think about it a little bit and you play out one of those games a little bit, you end up coming up with better moves. This is like the shoulder hit in the second game of Alpha Go. It was actually a very low ranked move. It was played one in 10,000 games and it was only because the TPUs that it was running on had enough compute to go deeper and find it that it played it out.

Auren Hoffman: 33:33

That's where the compute comes in.

Jonathan Ross: 33:34

Yeah, because you can search a deeper space. The same is true of these tokens. If I ask you what is the second derivative of the square of the hyperbolic tangent and I force you to give an answer, you'll give me gobbledygook. But if you can go back and try a whole bunch of alternatives, then all of a sudden one of them is going to start to make sense. And sort of like, when you hear something that makes sense, like if I told you the answer, you'd say that sounds right. You may not know, but it sounds right. And that ability to detect a good answer is really what you need in order to be able to search. And so these models are very good at that sort of intuitive part. And then you layer on that search part and they get better. It's called beam search. That's the technique.

Auren Hoffman: 34:14

Now your chips are built in the United States. Walk us through the economic advantage. Why are you doing that? Besides the fact that you're a patriot?

Jonathan Ross: 34:25

At the time. We made the decision for a lot of reasons. One was there was a little bit of concern about Taiwan at the time. That wasn't the top reason that factored in. The other was we were able to get a better team because they were hungrier for the business to work with us. We work with Global Foundries and they fab our chips in upstate New York, they're packaged in Canada and we build our systems in the US as well and we're trying to do this fully. North American supply chain and the advantage here and why I would recommend that everyone else do this, especially if you're starting to come up and trying to become a successful entity. It's not about being the solution if something happens. It's something could happen and if I work with you, then I don't have to worry about that eventuality. I reduce my risk. It's easier because everyone's in the same time zones.

Auren Hoffman: 35:19

You can iterate faster. Some ways, like a one to N, makes a lot more sense to do something that's 10,000 miles away, but a zero to one. You're going to want it as close as possible.

Jonathan Ross: 35:30

On top of that, there is an element right now of concern geopolitically. This wasn't the case at the time. People are playing chess here on this and most supply chains are stretched across so many different countries. Any one of those countries could just veto your supply chain. That's a huge risk.

Auren Hoffman: 35:49

Also, it seems like a lot of these big chip designers are competing essentially to get slots for TSMC to manufacture them, and so they're kind of like bidding up each other to get the slot or using their own leverage to get there, and I assume that means TSMC only has a certain amount of capacity. So it means there's going to be winners and losers there.

Jonathan Ross: 36:13

To some extent. Although TSMC is not the bottleneck for anyone, it's not the GPU or, in our case, lpu itself that's the limiter. It's this technology called HBM, and HBM is the memory that is used in GPUs.

Auren Hoffman: 36:28

That's a new term for me. What does HBM stand for?

Jonathan Ross: 36:31

It stands for high bandwidth memory. Pretty much the entire world supply of HBM comes out of Korea. Two major manufacturers are SK Hynix and Samsung. Now, micron in the US would love to become a manufacturer of scale for that, but they're generally considered the distant number three, and so what NVIDIA has done this brings us into probably a third thing that they've done that's made them very successful is they're effectively a monopsony, which is the opposite of a monopoly. Instead of being a single seller, you are a single buyer of a lot of the cutting edge parts so they can sign, like these long term deals with Samsung and these other companies and SK.

Jonathan Ross: 37:10

Hynix and all these others, and they buy up all of the supply. But it's not just the HBM, it's also this thing called an interposer or co-ops, which is what the HBM goes on, which is a limited supply of. Nvidia is the largest buyer of super capacitors in the world, and so they've got a lock on that too, and there's all of these things. And actually it was interesting. Someone noted the other day to me that AMD had revised their volume projections down, and that's weird because their demand is higher Because they can't get the stuff. They can't get the stuff, so they can't actually produce anything. So it doesn't matter whether you buy a GPU from AMD or NVIDIA, you're really buying the HPM from Samsung or SK Hynix. One of the things that we did. That was very unusual from the beginning, and we made actual design decisions on this. We eliminated any of the exotic technology because we knew that we would never be able to get access to this stuff.

Auren Hoffman: 38:02

You want to be able to build as much commodity stuff as possible that's easy to access.

Jonathan Ross: 38:07

Correct. In fact, our next generation chip one version of it actually had HBM in the design. We actually bought a million dollars of hbm because you have to buy it way in advance and that was going to be part of the production run. And when I started seeing you have to buy it that far in advance, it's way more expensive than the chips are I'm just like veto. No, we took it out. It really didn't add that much for the cost and the risk. But we have a very unusual architecture. Gpus require HBM. They're built around HBM.

Auren Hoffman: 38:38

Yeah, I mean, your chips are like 14 nanometer technology, which is several generations old, but in some ways that's a feature, not a bug. And here's the other one.

Jonathan Ross: 38:48

So everyone's focused on flops, which is the amount of compute that each chip is capable of, but actually the limiter tends to be the interconnect between the chips, not the flaws.

Auren Hoffman: 38:59

If you can have faster communication, then it doesn't really matter, right?

Jonathan Ross: 39:03

You can just add them up and the best way to think about this, the reason that GPUs are so slow versus our LPUs when they're running a large language model. Let's use a car factory's analogy. If you need a million square feet of assembly line space for the cars to be produced, but you only have a warehouse that's one-tenth of that size, then what happens is you set up the one-tenth of the assembly line that you can fit, you run the cars through and then you park them in a parking lot, you tear down the assembly line, you set up the next one-tenth and then you run them through again. That's called batching. That's what GPUs do.

Jonathan Ross: 39:38

That's because they're waiting for that HBM, that high bandwidth memory, to feed them. Because we have this synchronous interconnect and we have hundreds of chips. It's actually like that full assembly line, and so a token is like the car and it just goes from beginning to end without ever having to wait for a memory load. And so not only did we get rid of the HBM and our supply chain issues, we actually made it faster. Now the concern that most people have. They look at it and they're like you need 792 chips to do this, others only need eight.

Auren Hoffman: 40:08

Yeah, but they're one hundredth of the price or whatever.

Jonathan Ross: 40:11

But actually each chip is only doing a very small part of the computation, so it then moves on very quickly and it's a little bit like saying, gosh, the factory costs so much. Cars that come out of it be much more expensive than hand-built cars, and that's an intuition thing that people are really struggling with.

Auren Hoffman: 40:27

Now a lot of people think we've reached the limits of Moore's law or we're reaching those limits. Do you agree, and how does that affect things like chip development and some of these other types of things?

Jonathan Ross: 40:39

Moore's law was an amazing suggestion that we all followed and treated as real and did heroics to keep going. I think we should tweak the law a little bit so that it can still be true. Instead of it being about shrinking the size of the transistor so you could fit more, and it being an economic law, we should start saying unit volume in 3D of space, so that now we can start stacking chips and continue to get that density and then, once we fully fill up a cube instead of a two-dimensional space, then we'll find the next loophole in that law to continue the progress. But functionally it is not done. And what's moved? Instead of it being about the chip itself, it's become about the packaging that the chip is in in order to scale that up. And that's sort of the next game that everyone is playing.

Auren Hoffman: 41:27

Just in general, as a consumer and a user of these LLMs, they do seem like they're improving at like a super fast rate. What do you expect to happen, let's say, over the next year?

Jonathan Ross: 41:36

So I made some predictions at the start of this year and my top prediction was that by the end of the year there would be some deployments where there's effectively no hallucination. There's always going to be a hallucination because it'd be perfect. Hallucinations in some cases won't be a thing because it will be solved so well. Not that everyone will have access to this, just that it will be considered a solvable problem because someone will have solved it for maybe a higher price, maybe whatever, because they're doing more compute. They've got a bigger model. That's going to change things a lot.

Auren Hoffman: 42:11

I mean already today you could define the question in a way you can find the question. In a way you can find the prompt. So you can often eliminate the hallucination. If you're a good prompt writer you can reduce the hallucinations quite a bit.

Jonathan Ross: 42:24

A bunch of people have actually taken to rewriting the prompt into a better prompt and automatically doing that with LLMs, because it turns out LLMs are really good prompt engineers themselves.

Auren Hoffman: 42:34

Perfect, yeah. So then there you go. You can make that better as well.

Jonathan Ross: 42:38

Now, this is a latency issue, because every time you do another step, it delays giving an answer. Speaking of our own book, we are much lower latency. We're a big fan of these sorts of techniques, but they are currently used and they help a lot. In fact, there's a technique called reflection, where you just ask the model hey, on this output, how could you have made it better? Great, now do that.

Auren Hoffman: 42:59

Oh interesting. And then do it again, do it again, do it again.

Jonathan Ross: 43:03

And typically the rule of thumb is every three reflections is a generational model improvement, but it's to the power. So if you want to get two generations, it's nine. If you want to get three generations, it's 27. This is why you see in a lot of these papers showing state-of-the-art results, they'll have a number in there. I forget what they refer to it as, but it's shots or whatever. How many iterations, how many shots? They do a whole bunch of instances and then they just improve the results over and over again. So speed really matters for that.

Jonathan Ross: 43:39

When you're doing your own prompts is there something you've done in a way to get better results today, things that work with people work with LLMs, so you have a few employees. Giving a very clear objective is generally very helpful with your employees. So if you say, write a story, it'll write a story, and that's going to disappoint you. But if you say, write a story, it'll write a story, and that's going to disappoint you. But if you say, write a story that is exciting, or if you define what a hero's journey is, which is something where there's tension and there's an obstacle, by the end of it the tension and obstacle that you started with turns out isn't important and there's a new one that's important. That's kind of the hero's journey arc. Define that.

Jonathan Ross: 44:16

You say now make a hero's journey, it'll do that much better. But then also just asking it to do an outline before it gets started and then having it turn that outline into output, that helps a lot. Always think what could I do to get a human being to do this better and that'll help One. That helps a lot. Asking it to imagine what an answer would look like and all of a sudden the answers get much better.

Auren Hoffman: 44:42

Interesting. I got to try that with the humans I work with too, exactly.

Jonathan Ross: 44:45

So what I was about to say is I learned that with LLMs and then I backported that to people and that actually works very well.

Auren Hoffman: 44:53

Now you're in a very interesting position as a CEO. Your employees are some of the most in-demand, most aggressively recruited people in the world. How do you run a company when you're in that situation?

Jonathan Ross: 45:12

We've always been in that situation from the start, so we're just used to that. And again it goes to Hamilton Helmer's seven powers. If you focus purely on economics and it's a commodity thing and it's like I'm going to pay you more, you're going to lose. So one thing is anyone that we hire, we always try and be lower than the highest offer that they get elsewhere, because otherwise we don't have enough signal.

Auren Hoffman: 45:32

You don't want the mercenary, you want the person who really believes in what you're doing, exactly.

Jonathan Ross: 45:37

The next thing up is by having really great talent. People generally want to stick around really great talent. I think everyone thinks that their talent is great. I remember talking to the CEOs, particularly proud of their talent density, and then they had an opportunity to meet some of our people and they're like I thought I knew what talent density was. Now I know when you just hire really amazing people, other really amazing people want to work with other really amazing people. It's really hard to pull them away.

Auren Hoffman: 46:06

The market rate for these employees has gone up dramatically over the last, let's say, seven years. It's probably gone up over 20% a year every year, whereas the average software engineer, at least in the last, let's say, seven years, it's probably gone up over 20% a year every year, whereas the average software engineer, at least in the last three years, I believe it's gone down in the last three years. Most of these more AI-related engineers, even if they're willing to take a reduced salary to come work for you, you still have to do some sort of raises and match it over time too.

Jonathan Ross: 46:33

The other thing is we don't do AI, we're picks and shovels. We actually have very few people who actually do AI at Grok. We're in a name market.

Auren Hoffman: 46:42

You're more competing with, like the NVIDIA or these other types of places.

Jonathan Ross: 46:46

That's right and so we do have people here, but one of the opportunities is when you're near this stuff and you're helping everyone get it to work, you get to learn, and so if you come in as a software engineer or whatever, or hardware engineer, you get to interface with this in a way that you wouldn't otherwise, because we actually get to work with the best of the best. As users and customers of our stuff, we might actually have more of a front row seat to what matters to them and what that engineering looks like than a lot of others, but we're not the ones doing it ourselves, and so that takes a lot of the competitive pressure off All right.

Auren Hoffman: 47:19

Two last questions we ask all of our guests. First is what is a conspiracy theory that you believe?

Jonathan Ross: 47:24

I am the world's worst conspiracy theory believer.

Auren Hoffman: 47:28

I would have thought you'd be pretty good at it.

Jonathan Ross: 47:30

So the thing about conspiracy theories is most people who are conspiracy theory junkies about conspiracy theories is most people who are conspiracy theory junkies. They can believe two things that are incongruent. One of the things that I do and that we hire for at Grok is what we call reality quotient, and we've got a whole bunch of levels of how you improve on reality quotient. But the start is what we call a malleable mindset, or when the facts change, your mind changes. That doesn't work well with conspiracy theories, because conspiracy theories get debunked and then you're like oh, I was wrong.

Jonathan Ross: 48:01

But I would say it's more that I have some weird beliefs and we have some weird beliefs that have come true. I don't know that there'd be conspiracies and one of them, shockingly, was that we thought it was obvious that inference would start to become a bigger part of the market. Everyone thought that we were nuts, it's going to be training and we're like but you spend money on training and you make money on inference. Of course inference is going to get larger. I don't know, I'm bad at conspiracy theories, I'm sorry.

Auren Hoffman: 48:27

Well, that's great. Last question we ask all of our guests what conventional wisdom or advice do you think is generally bad advice?

Jonathan Ross: 48:34

I hate giving advice, because people don't like to take advice, and so what I will say is the thing that has been most advantageous for myself and others that I've known is to try and be more fearless. And the problem is not being more fearless, it's recognizing that you've become afraid and that's what's stopping you. I'm in meetings all the time where someone says we shouldn't do this for this reason, but the reality is they don't want to do it because they're afraid. There are also groups of people who have no fear and get in trouble all the time, but the folks that you and I interact with a lot are the kind who are more afraid.

Auren Hoffman: 49:14

Is that true or do you think maybe a typical founder personality might?

Jonathan Ross: 49:18

A founder is less afraid, but I mean engineering, because if you make a mistake and so you're more cautious, you want to get it all right, sort of like a Nassim Talebism of the value of the risk isn't really priced in on high risk things, and so you should really be going after higher risk things, because the low risk things in life, that real risk that's under there for everything, isn't priced in. An example I like to give is everyone has to have fire insurance on any property that they rent or whatever, but no one needs pandemic insurance. Yet over the last 200 years the average person working in an office building has been out of an office more often because of pandemics than because of fires. For buildings you got to look at things that are riskier and do those, because then the price to value is a little better. Startups are one of the best, lowest risk things you can do, because if you're in a big company, you could be in a layoff.

Auren Hoffman: 50:15

You don't grow as fast. You know all these other types of things.

Jonathan Ross: 50:18

So look for the things that everyone else is afraid of. Go do those things, and then all the things that no one else is afraid of. Be a little fearful of those. I think that's a mongerism, and then you'll be much better off. That's great.

Auren Hoffman: 50:30

Thank you, jonathan Ross, for joining us on World of DAS. This has been really interesting and, by the way, I follow you at JonathanRoss321 on Twitter. I definitely encourage our listeners to engage with you there. I learn a ton, so this has been super interesting.

Jonathan Ross: 50:42

Thanks and thanks for having me.

Auren Hoffman: 50:46

I appreciate it. If you're a super data nerd, go to worldofdascom that's D-A-A-S. Worldofdascom and sign up for our weekly Get as A Service Roundup newsletter. Thanks for listening. If you enjoyed the show, consider reading this podcast and leaving a review. For more World of Das and Das is D-A-A-S. You can subscribe on Spotify or Apple Podcasts or anywhere you get your podcasts, and also check out YouTube for videos. You can find me at Twitter at at Oren. That's A-U-R-E-N Oren, and we'd love to hear from you. World of DAS is brought to you by Safegraph. Safegraph is geospatial data for physical places. Check it out at safegraphcom. And by Flex Capital. Flex Capital invests in data companies like those we talk about at World of DAS. Check it out at flexcapitalcom.