Infinite ML with Prateek Joshi

Databases for AI Workloads

July 23, 2024 Prateek Joshi
Databases for AI Workloads
Infinite ML with Prateek Joshi
More Info
Infinite ML with Prateek Joshi
Databases for AI Workloads
Jul 23, 2024
Prateek Joshi

Tim Tully is a partner at Menlo Ventures, a VC firm that has invested in companies like Uber, Anthropic, Pinecone, Benchling, Chime, Carta, Recursion, and more. He was previously the CTO of Splunk, a publicly traded company that was acquired by Cisco for $28 billion. Prior to that, he was the VP of Engineering at Yahoo for 14 years.

Tim's favorite book: Infinite Jest (Author: David Foster Wallace)

(00:01) Introduction
(00:07) Evolution of Databases
(03:17) Enduring Business Models in Data Management
(04:41) Challenges and Trade-offs in Database Choices
(06:20) Modern Database Architecture
(09:06) Separation of Storage and Compute
(10:35) Role of Indexing in LLM Applications
(13:20) Handling Different Types of Data in Databases
(14:50) Distributed Databases Explained
(16:20) Real-time Data Handling and Requirements
(18:53) Architecting Data Infrastructure for AI
(21:29) ETL in Modern Data Infrastructure
(24:53) AI's Role in Database Optimization
(27:17) Network Architecture
(30:13) Hardware Improvements and Database Performance
(33:35) Technological Breakthroughs and Investment Opportunities
(35:11) Rapid Fire Round

--------
Where to find Prateek Joshi:

Newsletter: https://prateekjoshi.substack.com 
Website: https://prateekj.com 
LinkedIn: https://www.linkedin.com/in/prateek-joshi-91047b19 
Twitter: https://twitter.com/prateekvjoshi 

Show Notes Transcript

Tim Tully is a partner at Menlo Ventures, a VC firm that has invested in companies like Uber, Anthropic, Pinecone, Benchling, Chime, Carta, Recursion, and more. He was previously the CTO of Splunk, a publicly traded company that was acquired by Cisco for $28 billion. Prior to that, he was the VP of Engineering at Yahoo for 14 years.

Tim's favorite book: Infinite Jest (Author: David Foster Wallace)

(00:01) Introduction
(00:07) Evolution of Databases
(03:17) Enduring Business Models in Data Management
(04:41) Challenges and Trade-offs in Database Choices
(06:20) Modern Database Architecture
(09:06) Separation of Storage and Compute
(10:35) Role of Indexing in LLM Applications
(13:20) Handling Different Types of Data in Databases
(14:50) Distributed Databases Explained
(16:20) Real-time Data Handling and Requirements
(18:53) Architecting Data Infrastructure for AI
(21:29) ETL in Modern Data Infrastructure
(24:53) AI's Role in Database Optimization
(27:17) Network Architecture
(30:13) Hardware Improvements and Database Performance
(33:35) Technological Breakthroughs and Investment Opportunities
(35:11) Rapid Fire Round

--------
Where to find Prateek Joshi:

Newsletter: https://prateekjoshi.substack.com 
Website: https://prateekj.com 
LinkedIn: https://www.linkedin.com/in/prateek-joshi-91047b19 
Twitter: https://twitter.com/prateekvjoshi 

Prateek Joshi (00:01.717)
Tim, thank you so much for joining me today.

Tim Tully (00:04.666)
Thanks for having me on, I really appreciate it.

Prateek Joshi (00:07.829)
Let's start with a brief history of databases. You know, you've built and shipped some of the largest infra -heavy products of the last, I know, last about two decades. So can you talk about how databases have evolved in the last 30 years?

Tim Tully (00:28.698)
I mean, it's a long, long history. I mean, you really go back to the 1960s really to start to think about the beginning of relational databases. And that sort of continued on to the 70s and then the 80s. And you come across things like DB2 and Informix and then eventually Oracle. And that's all great for transactional processing largely. And then in the 90s,

Prateek Joshi (00:31.349)
Right.

Tim Tully (00:54.49)
comes around and suddenly data warehouses come into vogue. And then you really had the advent of two types of databases. There's all TDP databases and there's all app databases and one's for transactional processing and the others for really analytics really to be honest, respectively that continued on. And then, you know, you had applications and things that sit on top of it. You had the micro strategies of the world, the crystal reports and so on and so forth that are really apps that sit on the database. And then probably the largest

inflection point, or it's two really large inflection point. One is, is large scale data processing, i .e. MapReduce and Hadoop, right? And that's obviously not a database. It's more of a data processing paradigm. And that, that was really transformative because suddenly you could go from processing, you know, megabytes or potentially, you know, gigabytes in a query in a OLAP system to terabytes at a time and eventually petabytes. And that was, that was really massive. And obviously I was there for that at Hadoop and

I was fortunate enough to manage the teams that not only had a lot to do with creating Hadoop, but we were largely the consumers of it also because it was all at Yahoo where I spent a lot of time. That evolved heavily until the advent of Spark, which came around 2012, 2013 out of UC Berkeley. And that really changed the MapReduce paradigm into more of a graph processing paradigm. And that was large because it made the queries much, much, much faster.

But also, really, it made the expressibility of the data problem much more accessible and natural. You can move from writing a MapReduce job, which is not a natural thing to do, because it feels like you're pushing a square peg through a round hole a lot to get an expression of the problem you're trying to solve in MapReduce, to writing it in just code. Effectively, that's what Spark opened up, is you could write it in code. And so that obviously led to Databricks. I think that the biggest.

changed since then really is cloud databases. And so I'm lucky enough again to be involved in Neon, which is a serverless Postgres company. But Snowflake really was the one that put that and AWS Aurora put things on the map there for cloud databases. And Snowflake, really what they did is they separated storage and compute. That was the large thing.

Prateek Joshi (03:17.781)
Right. And as you said, if you look at all the various sub -sectors at Epson flows, there's been peaks, there's been valleys, but data, the work of managing data has been one of the most enduring business models. So all the way from any given decade, there's like a very, very strong data company and obviously Oracle, Snowflake, Databricks, and all of them have built giant businesses. So

Obviously it's a very, very enduring business model. What are the most common challenges that companies, that average company encounters that makes them want to keep spending money on database management? Like how do these data companies keep attracting more and more revenue over the decades?

Tim Tully (04:06.938)
I mean, it's the heart and soul of any business really. I mean, every single company, no matter who it is, I don't care. You can name any company and I'll tell you that it probably runs on data at its core. And that's the issue is that everything, every application, every internal process, it all has data behind the scenes to drive it. And so if there's a way to do it better, you're going to see spend. If there's a way to do it more cost effectively, you're going to see the spend go move over to there. So it's just...

Prateek Joshi (04:14.773)
Right. Right.

Tim Tully (04:36.026)
It's nonstop and that's what's attractive about it for me as an investor.

Prateek Joshi (04:41.813)
And if you look across different databases, obviously, there's no one magic database that serves every customer. So clearly, different customers have different needs, and they spend money on different products. So when you think about the basic properties of databases, what trade -offs do different databases make when it comes to handling data? And what is attractive to what type of buyer?

Tim Tully (05:08.858)
And the biggest trade off is really the type of data that you're storing. And that's what determines the type of database that you're using. And now the way I think of it is databases are kind of tools in a toolbox. There's not one catch -all database that can solve every single problem. I would say Postgres is getting better at that with its plugin architecture. And that's one of the reasons why Postgres is taking off so much. But the biggest trade off there is really

you know, capabilities in terms of how it can handle different, different types of data, really. And then the other trade -off that's, that's, that's big right now is just the architecture of the database itself. Right. So, you know, largely for cost purposes, you see this, what I mentioned earlier, the separation of storage and compute really what that is, is to drive margin in the, in the business of the cloud vendor. Right. Like that, that's really what it is. It's not because you're trying to make the database faster. It would be faster if all the data was, you know, in memory.

and accessible for a query, right? Keeping it on, I don't know, Amazon S3 is not obviously the fastest way to get to data. This is a margin driver. And so you have trade -offs like that.

Prateek Joshi (06:20.341)
As we sit here in mid 2024, what are the key components of a modern database? What goes into the architecture? And also, for example, Postgres, it's been around for a while. What makes you look at a company and go, that's an angle. That could be a big business. Because database is so established at this point, what are the angles of

Tim Tully (06:46.234)
It's a great question. I think, I mean, you're asking a broader question, which is, you know, what makes something interesting as an investment? I'll start there and then we'll go to, you know, why a database? Like I look for people in product. so I look for great, great founders who I feel like, you know, wake up every morning and want to run through a wall. I'm sure you do as well. And then you look for great products. Now that goes to the, to the question that you asked.

You know, what makes a great database these days? It's, you know, dev X matters a lot to me right now. I mean, that's, that's, that could be an answer for really any investment that you or I make right now, but especially for databases, like you want it to be easy to stand up, easy to use, you know, uptime matters a lot. The margin on the thing matters a lot. That's, you know, these things can be very expensive to manage. I just mentioned uptime, like five nines is probably the lower bound of what's accessible. Sorry. Acceptable.

right now, so that matters a ton. And then being able to handle a variety of workloads, really. I'm not that interested, I would say, in a single database that can only do one small thing. And that's why we keep going back to Postgres a lot, is Postgres can serve a variety and multitude of workloads. And that's really fascinating as well. But I'll mention one of my investments, shamelessly. Neon, they have a serverless architecture that can spin up

Compute on demand as needed as the workload increases, they separated the storage and compute and Postgres. That's, it's really fascinating. And it has something called branching, which lets you branch the database. So branching can do things like be a backup for you, or it can be really nice for teams developing on it so that your developers can go to the database and just effectively fork it and get a snapshot of the database and start driving against what was production one second ago. So.

There's some really fascinating features that folks are coming up with right now that really create some differentiation.

Prateek Joshi (08:46.101)
And you mentioned the separation of storage and compute. Can you elaborate on that? And especially in the context of running AI applications, can you just talk about how that happens in practice and what does that give the developer? Like what superpower does that give the developer?

Tim Tully (09:06.49)
Yeah, so separation of storage and compute is basically the idea that the compute and the storage of the database don't reside on the same nodes, right? The compute has been abstracted away, sorry, the storage has been abstracted away to live most likely in an object store, you know, the most famous one being Amazon S3. And you can effectively page buckets in and out on demand that have rows that you need for specific tables, right? And so what that does is it drives

Again, we keep talking about margin. It drives margin on the database business because the storage on S3 is much, much, much cheaper than keeping it on local disk or RAM locally where the database is actually running the queries itself. And so that's separation of storage.

Prateek Joshi (09:51.861)
In the LLM era, many people, so many people are using LLM products and all of them need data infrastructure. So I'm going to touch upon a couple of different flavors of LLM products and talk about databases in that context. So the first context is people go to LLM products to get answers to questions. It's become a good business for Plexity as a good example, but many other...

companies are building this feature where you come in, you type a question, and it gives you a good answer. Can you discuss the role of indexing and also how it affects a query performance in this context? And also what should a good database do in this case?

Tim Tully (10:35.482)
I mean, there's two places in that sort of text -to -text LLM sort of use case where indexing will show up. One is fetching and retrieving content really to push into a query. And that's really a vector database. And then there's sort of the application space aspects of any normal application, be it LLM or not. And so what we can talk about is the indexing piece to your question around vector databases, which is a very specific kind of database.

Effectively, what's happening under the hood when you have these types of applications is rag, which you've augmented generation. And so what that's doing is to avoid hallucinations, it's pulling content or retrieving content associated with the question that's being asked and using vector search or approximate nearest neighbor underneath the hood inside of the index to pull content relevant to the question and use that to form a smarter prompt to have the LLM write a response for you.

Prateek Joshi (11:33.941)
Right. And another flavor of an LLM application is people come in and they just want to generate text. Like no rag, they just want to come in and, hey, generate a paragraph in the style of William Shakespeare, for example, and I want to write a marketing copy for like a credit card product. So in that case,

What happens in terms of is the generic database OK? Do we need specialized databases to handle this part? Because there's no database to go through. There's no right, the model has been pre -trained on a very large model of text. And it's going to generate an answer without referring to any internal talk. So what happens here?

Tim Tully (12:43.386)
yeah. I mean, it depends. It depends on your architecture, right? So if you were doing pure rag, you know, there's a, there's a possibility you had to vector database being used right next to it. But again, we come back to Postgres, which is, you know, this amazing Swiss army knife. Postgres has something called PG vector, which is a plugin that goes into Postgres that lets you do vector search as well. So, you know, you could have Postgres simultaneously doing the vector approximate nearest neighbor search for you, or also.

fetching content from the database that's been stored, not using vector search to answer the question as well. So that's what makes this thing so great.

Prateek Joshi (13:20.405)
Right. And earlier you talked about storing different types of data. There's text, there's images, video, LIDAR data, so many different types of data. Now, when it comes to choosing a type of database, does a vector database abstract that away so that the user, they don't have to worry about, for this image, I need to use one database for text, I need to use another. And part B is on a more, on a broader level,

If you are a developer and you are working with text, images and video, what is the right thing to do in terms of data infra here?

Tim Tully (13:59.482)
Well, I mean, a vector database doesn't really care what you're storing inside of it, right? I mean, really, what is a vector database? It's an index. I mean, in the most semantically truthful way to look at it, it's an index of vector embeddings that points to some arbitrary piece of metadata. And that metadata can be a JSON blob. It could be a piece of text. It could be a base64 encoded image if you wanted. It doesn't really matter.

And then what was the second part of your question?

Prateek Joshi (14:31.221)
Yeah, if you're a developer and you're working with text and images and videos and LIDAR and you're building data infrastructure to handle all types of data. So I guess you already answered the question in the sense that vector databases can actually serve all your needs in this case. Right? Yeah.

Tim Tully (14:50.842)
Yeah, I mean, it doesn't know. In fact, it has no idea what you're storing, right? All it knows about is embeddings pointing to some blob, effectively.

Prateek Joshi (14:59.381)
Right, right. And let's move to distributed databases. So just to start off, can you quickly explain your view on distributed databases and also what are they good at and what are they okay at?

Tim Tully (15:16.506)
Yeah, I mean, a distributed database is effectively a multi -head cluster that can execute a query in parallel. So many, many, going back to the separation of a storage and compute paradigm, imagine one compute node suddenly scaling to five or 10 to be able to parallelize a data -intense query and make it happen much, much, much faster.

Prateek Joshi (15:45.141)
Right. And also when you look at many applications today, AI applications, real time data streams are becoming more and more important. And handling real time data is a bit different than handling static data. So how do you view the requirements, database requirements, when you're dealing with real time data? And also,

How would you architect such a system here?

Tim Tully (16:20.33)
yeah, I mean the other aspect to distribute databases is replication, right? So it's not just distributed within a given cluster like I mentioned, but there's also things where you have massive worldwide databases that need to be in sync. And so that's the hardest problem with the real time data aspect is consistency, right? There's this so -called CAP theorem, right? And the idea there is you can only have two of the three. And so one of them is consistency. And so being able to have consistent

Prateek Joshi (16:39.957)
Right.

Tim Tully (16:49.69)
Reads and writes out of that database is really, really hard in a distributed database when it's worldwide, because you have, you know, distance problems to sort of overcome. And then there's obviously data skew problems that come with that. And so that's probably the biggest challenge is getting accurate, reliable data out of it. Although I'll give you a hot take, which is, increasingly I sort of questioned the need for, for real time data outside of transaction processing for, for analytical processing.

I don't know. I've tried in my career many, many times to build real -time analytics systems, you know, down to the millisecond or sub -second level. And what I found over and over again is one, it's hard, two, it's solvable, but three, people don't care. And what I found from that is that it's cool and it's a neat engineering problem. And again, it's hard. But what you find from the users who actually consume the data is people that make decisions on data.

Need to see patterns like human beings make decisions on patterns of behavior and evidence. And if I see some anomaly anomalous blip in my real time analytical database that happened 500 milliseconds ago, you know, I'm not likely to take action against that, right? I need to see a couple more examples of some kind of pattern there to make a decision. And ultimately what I've sort of learned in my, my long career is that five minutes is probably good enough.

for real -time analytics processing.

Prateek Joshi (18:21.845)
But I see a great insight, like a practical insight, meaning, as you said, it's a cool engineering problem. It's fun to solve it. But at the end of the day, the users, they seem fine. Some like millisecond every five minutes. If they care equally about the two things, then you'd rather do once every five minutes, right? So.

Tim Tully (18:40.89)
Yeah, and you just move to mini -batching or sliding windows sort of processing, and then that's it. Yeah, it's unfortunate because it's a hard and fun problem to solve, which I've done, unfortunately, many times.

Prateek Joshi (18:46.133)
Yeah, yeah.

Prateek Joshi (18:53.365)
Right. And if you look at the, and again, this is maybe a thought experiment. There's so many great database founders. Historically, we have had amazing database products. If you had to sit down and design a data infrastructure product that is hyper optimized for AI applications, it's AI native, and it has all the features that an AI developer would want. What does...

on that list of here's my dream database. Here's all the things I wish it would do.

Tim Tully (19:29.786)
Yeah, I'll just up level the question, just tell you like what would the Pentultimate database look like? It'd be able to do all the aspects of database querying in a sort of like A plus level. So vector search, OLAP, OLTP, geospatial querying, key value pair, lookups for sort of NoSQL style processing. I mean, you ask all the great questions. It'd be awesome at real time indexing.

It'd be worldwide distributed database. It'd be consistent. It'd have six nines of uptime. Yeah, I mean, it's sort of like the aggregation of everything we've been talking about for the last 30 minutes all rolled up into a single system. It had completely fault tolerant, never goes down. Just the level of concurrency that it can handle would be very, very high.

Prateek Joshi (20:28.181)
Right.

Tim Tully (20:29.114)
Probably nothing that is shocking necessarily, but the aggregation or the superset of all of all great databases.

Prateek Joshi (20:37.653)
I think it's the thing that each individual item on the list may not be shocking, but I think making it all work together in like a single product, it's a fairly hard, hard problem. And yeah, that's a great point. Also, let's switch up on ETL for a second, right? Historically, for example, Snowflake built a big business and Fivetran came along and said, hey, we'll do ETL and it's gonna work on these databases.

One, in this next generation of database products, should ETL feature be part of the data infra offering? If yes, why? If no, why? And also, two, maybe we'll start there. What do you think about ETL being a separate business versus should it be part of a data infra?

Tim Tully (21:29.434)
I mean, ETL is a hard problem. It definitely should be a separate business. These are nasty, hairy problems that guarantee that your data is clean. And there's a lot of aspects to it. There's the actual execution of the jobs. There's the tracking of the jobs. There's the sourcing of the data. There's the writing of the data to different. You have to have a bunch of connectors to read and write data out. Just those integrations alone make it worthwhile.

And these are, these are basically, again, we go back to that question we talked about earlier. That's the backbone of any enterprise, right? Is, is ETL. I mean, that's the sourcing of the data. That's the guaranteeing that the data is not, not dirty, right? Because as you know, garbage in garbage out, right? So like it absolutely is a standalone business for sure. you know, I do sort of wonder, you know, over time, the snowflakes sort of increasingly move in that, in that direction.

Why they haven't, I couldn't exactly articulate why. I've sort of asked myself that question for years now. But that's certainly something I would be thinking about if I was there.

Prateek Joshi (22:36.885)
Right. Yeah, it's always, I think the integrations alone, as you said, makes it a standalone business. It's a lot of like complex plus sometimes boring work, but it's extremely important because clearly they're willing to pay good money to use ETL software. So in the AI native world, should we have different ETL software for like...

different modalities of data. Like, hey, the work is big enough that text should have its own ETL images and video. Because earlier you mentioned vector databases, they've abstracted away. Regardless of text, image, it's all fine. You convert it to a vector and we'll go from there. But ETL, can that be abstracted away in ETL software? Or do we need different softwares for different modalities?

Tim Tully (23:30.682)
I think you need different software for different modalities. I mean, I'm fortunate enough to be an investor in one of these companies. It's called Unstructured. And effectively, what they do is ETL for AI. They read and source data from a multitude of sources. But it's a different kind of data than traditional ETL. Traditional ETL is usually CSV files. It's text files. It's rows and columns, right? A very, very highly structured data.

Unstructured as the name is sort of implies operates against unstructured data, right? This is unstructured text, it's images, it's video, it's PDFs of any kind of things full of tables and images. It has to be able to know how to handle that. And it has to be able to know how to apply chunking strategies against that data to create the vector embeddings that go to vector databases. So it's a different kind of job that requires a different set of specialties from a different kind of data. And I think that's enough to say like, Hey, this is, this is a different.

Prateek Joshi (24:26.485)
And we've been talking about using databases to build AI applications. And that's great. It's very useful. And people are willing to pay good money to use those applications. Now, in the database itself, where can AI be infused to make a database faster, better, cheaper? Where do you think AI can play a role inside the data in itself?

Tim Tully (24:53.338)
I mean, people have been thinking about this for a while. I mean, there's some simple examples, like automatically indexing columns based on the relational structure of the tables. Should there be a foreign key in this column because of some relations between two tables? That sort of existed for quite some time. I think we can do better with query optimization as well.

At the, on the flip side of that, it's also hard, I would say, because, you know, you, you don't want to get too creative inside the database because what you want is reliability and consistency and truth coming out of it. And so I think. Anytime you have some kind of non -deterministic process running against what is effectively fact data all the time, it's a little bit tricky to apply. And so that's why I think you see AI still being applied around the edges around, you know, Hey, go index this column.

I'd be smart if you did this. But will we see agentic sort of workloads running inside the database? I don't know. Maybe triggers and things like this can be smarter. Maybe PL SQL jobs will have more AI inside of Oracle moving forward. Who knows? But I like my databases to be trustworthy and reliable and deterministic.

And so I don't want to see a lot of sort of LLM magic sprinkled inside of the database itself, at least right now.

Prateek Joshi (26:25.493)
I think that's a great, great point. Like many people, or at least all the developers, database is almost like a calculator, meaning we don't want any probabilistic behavior, meaning it's a very deterministic thing where you say two plus three should always be five. I don't want it to like probabilistically guess what the answer is. So in this case, similarly database, if I want to retrieve an item, I want to retrieve an item. That's it. There's no like...

Tim Tully (26:45.658)
Right. Yeah.

Prateek Joshi (26:51.957)
No likelihood, right? It says like a hundred percent likelihood that it.

Tim Tully (26:56.282)
Yeah, I think that's a far more articulate version of what I was trying to say. Yeah.

Prateek Joshi (27:00.565)
Yeah, that's a great point. 

Prateek Joshi (27:40.181)
Right. Okay. Moving on to the next topic here, network architecture. Again, I think this is not an area that's talked about too much when it comes to data and flow, but what impact does network architecture have on the data infrastructure that you're setting up?

Tim Tully (28:02.49)
it's massive, right? Because the performance of the database is entirely predicated on read and write speed against it. And so again, our favorite friend, storage and compute separation, right? Every modern database these days is following that snowflake paradigm of separated storage compute. Unless you've been querying the same data over and over and over again, it's probably not in page cache on the compute node. And so you're going to be pulling rows across the network. And so the network architecture is just

Prateek Joshi (28:25.461)
Right, right.

Tim Tully (28:33.161)
It's paramount to the performance of the database. It's everything. It's just as important as your ability to schedule and query. Sorry, schedule and execute queries. Sorry, it's important.

Prateek Joshi (28:45.469)
And in terms of network architecture, is that something developers handle in -house? Should it be productized or is it part of the data infra provider? Like where does the responsibility sit within this stack?

Tim Tully (28:59.482)
Yeah, I mean, in terms of the database itself, it's completely abstracted away from developers, right? Where the rubber hits the road for them really is, is the EC2 machine that's actually going to go execute queries against the database in the same VPC, right? Those are the network architecture questions. That's same VPC, right? Those are the network questions that get asked mostly for security reasons. In terms of, also performance reasons, really.

to be in the same VPC for performance reasons as well. But in terms of thinking about, is the compute node on the same switch as the networking? No, you don't see that as a developer. It's completely open.

Prateek Joshi (29:43.637)
Right. And we've been talking about software all this time. Maybe quickly touch upon, obviously hardware keeps getting better and better and people can leverage hardware improvements to make some of these products faster, better, cheaper. Is that something you think about or rather it's more like next big improvement, like what comes from hardware versus software improvements in databases?

Tim Tully (30:13.402)
I mean, really, it's the ability to feed the CPU fast enough. And so ultimately, that comes down to the PCI bus on the motherboard. And so PCI bus speed improvements over the years have been pretty great. Is it to the point where you're able to feed the CPU fast enough from the network? No, it's not. Network speeds are getting better and better and better, but the bus speed is not at the rate that can feed the CPU fast enough. And so really, that's the ultimate bottleneck right now.

Prateek Joshi (30:42.325)
Right, right. That's a great point. And earlier, like people, especially on the training side, people have been building new architectures, like Grok is doing its thing and recently Etched, they released like an ASIC for just like transformers, meaning it doesn't do anything else, but if you want to run a transformer model, it's like the world's fastest AI chip. So when you think about

the data movement work for training purposes. And people are spending so much money and compute on training. What sort of data optimizations are you seeing, especially on the training side, meaning, hey, we'll help you train better. And this is how we're doing it. Are you seeing any innovation on that front?

Tim Tully (31:32.922)
No, I'm not really seeing any innovation there. Although on the data of movement, I mean, the earlier part of your question, I think one of the really sneaky acquisitions that NVIDIA made that is under the radar and I think going to be absolutely killer down the road is Mellanox. Nobody talks about this, and they should be. I know some of the providers that you talked about, you mentioned the Grox and so on and so forth. They're using, if I recall correctly, not using

Prateek Joshi (31:48.117)
Yeah.

Tim Tully (32:02.554)
I think I do know what they are using and it's not going to touch the transfer speeds of what InfiniBand can do. And so I don't know. I think this is smart acquisition. I just want to make sure the audience is aware of that one.

Prateek Joshi (32:18.197)
100 % and I agree like more people should be talking about this. I mean, I think it's obvious now to people who are looking at it, but in a few years, it'll come out as a greatest acquisition of all time because not only did it help Nvidia like scale to new heights, it was very passionate of Nvidia to do that at the time they did it and it has been impact has been phenomenal. All right. I have, yeah.

Tim Tully (32:43.066)
Yeah. I mean, to, sorry, to the other person, I pay attention a little bit to the data format stuff. I think there was like a new version of Parquet that came out recently. I want to say, I forgot the name of it exactly, but I look at it, but it's not, it doesn't help me with my job, I guess is the best way to like, if there was some like venture backable data format company, I'd be all over it that, but, you know, that's sort of like an, a feature of, of sort of larger companies that I want to invest in, not, not a business maybe on its own.

Prateek Joshi (33:00.501)
Yes.

But, alright.

Prateek Joshi (33:13.525)
Right. I have one final question before we go to the rapid fire round. What technological breakthroughs in data infrastructure are you most excited about, specifically in the context of investments, meaning what investment opportunities are the opening up for you?

Tim Tully (33:35.002)
and data.

Tim Tully (33:39.418)
I mean, I sort of, sort of related. I mean, I'm not going to, I'm going to, I'm going to go tangential just a little bit, but I, I can't get away from the idea of what agentic workloads are going to do, especially as, you know, they're going to rely on data just as like everything it does. But I think there's this whole sort of dimension of, of opportunities around agents and where they run, how they get executed, how they get scheduled, how they access data. Like this is just going to be a massive, massive.

Greenfield for you and I to be looking at. And so I'm really excited about that. I'm excited about the hardware that's coming out. You mentioned a few of the companies, you know, I've been lucky enough to see some of these things and they're doing really, really fantastic stuff. You know, I haven't made any investments there, but I'm really, I'm excited by what GROK can do. I've used it. I'm sure you have as well. Really, really amazing performance on that, on that thing. But I think agents are going to be, agents are going to be massive. And so.

you know, for your listeners there, I would definitely keep my ears open.

Prateek Joshi (34:39.413)
Yeah, I think that's a fantastic angle. And yes, I agree. I think agent tech design, agent tech workflows and products around that, I think is gonna play a huge role. And Andrew Eng has published a phenomenal set of articles on agent tech design, what should a product look like? But I think there's so much more to do here. So yeah, I agree with that one.

All right, with that, we're at the rapid fire round. I'll ask a series of questions and would love to hear your answers in 15 seconds or less. You ready? All right, question number one. What's your favorite book?

Tim Tully (35:11.258)
wow, okay, alright.

Tim Tully (35:16.922)
Infinite Jest by David Foster Wallace.

Prateek Joshi (35:19.541)
love it. All right, next question. What has been an important but overlooked AI trend in the last 12 months?

Tim Tully (35:29.914)
Good question.

Tim Tully (35:36.314)
I think trust and safety within LLMs is being ignored a bit more than it should right now. I mean, hallucinations are obviously still a thing that's there, but I think not enough attention is paid to how important this is for developers.

Prateek Joshi (35:53.077)
What's the one thing about databases that most people don't get?

Tim Tully (35:58.362)
just how complex they are. They're effectively like operating system level complexity. People don't realize this.

Prateek Joshi (36:04.821)
What separates a great database product from a merely good one?

Tim Tully (36:11.418)
the, the ease of use, right. The ability to stand up and create and get going very, very quickly that developer experience is just over.

Prateek Joshi (36:22.229)
What have you changed your mind on recently?

Tim Tully (36:27.706)
agentic workloads. I was dismissing them for a while earlier on. And then eventually over time, I started to see more and more early stage startups starting to adopt that style of application developments in AI. And I've become a fan.

Prateek Joshi (36:29.173)
What?

Prateek Joshi (36:45.461)
What's your wildest AI prediction for the next 12 months?

Tim Tully (36:51.994)
We're further away from AGI than anyone thinks. I think it's five plus years away. It's not as close as people think.

Prateek Joshi (36:55.637)
Yeah.

Prateek Joshi (37:02.325)
Yeah. All right. Final question. What's your number on advice to founders who are starting out today?

Tim Tully (37:11.994)
Follow a problem that you're passionate about, not something that you think is the problem you should be solving because that's the one you should be solving. Like if you don't deeply, deeply care about this problem, you're not going to have success. Like your buyers will be able to tell that you don't really, really care about their outcomes because what you're selling is outcomes, right? And you have to really care about this problem to the point that you're waking up and thinking about it seven days a week, right? This has to be something you're just maniacally obsessed over.

Prateek Joshi (37:37.397)
Right. I think that's a fantastic way to put it. I think customers buy outcomes and many founders, maybe they know it, but they ignore it or it becomes like third or fourth priority. And I think that should be front and center. You're selling something, obviously, you're selling a tool, a product, but really they're not buying the thing, they're buying an outcome. And that's fantastic.

Tim, this has been a brilliant discussion. I loved the depth of your insights on data and so thank you so much for coming onto the show and sharing your insights.

Tim Tully (38:09.85)
Yeah, thanks for having me on. It was a lot of fun.

Prateek Joshi (38:12.469)
Give me once