Infinite ML with Prateek Joshi

Designing Antibodies with AI

Prateek Joshi

Surge Biswas is the cofounder and CEO of Nabla Bio, an AI platform to enable precise drug design and high-throughput measurement of drug properties. They recently raised $26M Series A led by Radical Ventures. He has a PhD in Bioinformatics and Integrative Genomics from Harvard University.

Surge's favorite book: Permutation City (Author: Greg Egan)

(00:01) Introduction
(00:07) Generative AI in Drug Design
(01:19) Traditional vs. AI-driven Drug Discovery
(03:42) Designing Antibodies
(05:06) Therapeutic Antibodies Design Process
(07:39) Data Sets for AI in Drug Discovery
(10:48) High Throughput Measurement in Drug Discovery
(13:14) Setting Up High Throughput Screening Assays
(18:46) Multiplexed Screens in Drug Discovery
(21:55) Protein Characterization Techniques
(24:33) Protein-Protein Interactions
(28:12) AI in Protein Characterization
(30:55) Technological Breakthroughs in AI and Bio
(32:36) Rapid Fire Round
(36:27) Conclusion

--------
Where to find Prateek Joshi:

Newsletter: https://prateekjoshi.substack.com 
Website: https://prateekj.com 
LinkedIn: https://www.linkedin.com/in/prateek-joshi-91047b19 
Twitter: https://twitter.com/prateekvjoshi 

Prateek Joshi (00:01.38)
Serge, thank you so much for joining me today.

Surge (00:05.07)
Of course, happy to be here. Thanks for inviting me.

Prateek Joshi (00:07.364)
Let's start with fundamentals. Can you explain how generative AI is being used for drug design today?

Surge (00:19.31)
Yeah.

highest level, a good drug needs to have, you know, a variety of different properties to be safe, efficacious, manufacturable for human use. And the search space over, you know, all possible ways you can make a drug is just so enormous that a lot of times brute force, you know, screening or search approaches don't work all that well. And the

Real promise of generative AI is to be able to just directly generate solutions from high probability areas of drug space that satisfy all these different criteria a good drug needs to have. And then the dream is you just have, you know, sort of one generative model that just kind of internalizes all these properties you need and just directly sample solutions at the intersection of all the properties you care about. Yeah. And I'm happy to dig into like deeper into that.

Prateek Joshi (01:19.012)
Yeah. Yeah. And discovering drugs, it's been, that's what that's an entire former industry. Like you discover the next big blockbuster drug and then you get to be the most powerful company. Now, when you think about the transition that's happening with the generative AI, what is the biggest advantage that GenAI provides that?

that traditional discovery methods maybe couldn't have.

Surge (01:51.31)
Yeah. So I think it's important to know or be specific about what we mean about traditional drug discovery. So broadly speaking, traditional drug discovery centers on like brute force screening or search. So you have a very large molecule space. So in the case of antibodies, you know, the search space is enormous, like, you know, 10 to the 20, some, you know, giant search space like that.

You know, for small molecules, maybe you're looking for one small molecule and, you know, a large library of millions of different potential drugs you could try. And, you know, essentially what you're doing is throwing spaghetti at the wall and seeing what sticks. And for, I guess, a lot of the lower hanging fruit and drug design brute force screening and search like that can work. And like, that's where a lot of the medicines we have today have come from.

But for some of the hardest drug design problems where, for example, you know, the kind we're focused on at NAVLA, you need an exceptionally high degree of atomic precision in how the drug sort of is designed and, you know, how it binds with, you know, certain proteins and the human body. It's got to be manufacturable, formulable. It's got to, you know, bind only one target. So there's this whole slew of other properties.

that just by random guess and checking, it's just very, very, very unlikely you're going to find a solution that satisfies all these different constraints. And again, this is where I think generative AI can really shine in solving these multi -property, high degree of precision type problems.

Prateek Joshi (03:42.18)
You mentioned antibodies. So let's talk about designing antibodies. So actually, maybe for people who don't know, can you just explain in basic terms, like what is an antibody?

Surge (03:55.054)
Yeah. And antibody is basically, you know, our immune system's way of fighting, you know, a bunch of different diseases, particularly like infectious diseases. So, you know, a lot of folks might've heard about antibodies during the pandemic and, you know, a natural part of our immune response is to develop immunity by antibodies against, you know, a virus like.

coronavirus, but we can also make antibodies as medicines. So, you know, in the biotech industry, you know, huge number of companies are focused on building and designing synthetic antibodies that are sort of made in the lab and then administered to a human to, you know, fight various diseases from infectious disease to cancer to your own problems, cardio, etc.

Prateek Joshi (04:51.908)
And when you think about designing therapeutic antibodies, what are the key steps involved in that? And also, where can AI help in that process?

Surge (05:06.286)
Yeah. I think it's helpful to talk a little bit about like sort of one of the main end goals and like how we build toward that. So, you know, broadly, antibodies need to have like kind of four, I guess you could call it categories of properties to be useful as a therapeutic. So the first is safety.

you know, when you administer an antibody to a human being, that antibody needs to go to the right place in the body. It can't set off the immune system. It needs to have minimal side effects. It's got to be efficacious. So it's got to sort of bind the thing that it was designed against and, you know, turn it on or turn it off, whatever, you know, the application needs to be and do so with, you know, a high degree of activity.

And then we have to be able to make enough of it. So the drug itself has to be manufacturable and there's a lot of properties the antibody needs to be, to have in order to be manufacturable. And then at the, you know, maybe the last sort of category is the antibody needs to be formulable. So we need to be able to store it for long periods of time, transport it, and then, you know, dose it to a human being in a way that's, you know, not burdensome. So working back from.

you know, these end goals. Typically the way antibodies are made is you can sort of go full brute force strategy, or you can use, you know, techniques like machine learning to first create what we call a library or a collection of potential drug candidates. You know, the library can be as small as let's say a thousand antibodies all the way up to a trillion antibodies in some cases.

And then through a variety of techniques, sort of whittling down that collection of antibodies to find the ones that are satisfying all of these constraints. And so when you're working at really large library scales, you need really efficient technologies that allow you to like plow through that large number of potential drug candidates, get it down to a smaller set. And then once you have that smaller set, do more detailed characterization on each antibody to determine if you.

Surge (07:33.038)
you're sort of satisfying those end drug goals.

Prateek Joshi (07:39.588)
And when you have to build an AI system to assist you in this task, where does the data set come from? Or rather, actually, even before that, what should be in the data set? And also, where is that coming from?

Surge (07:56.206)
Yeah. So.

It's worth mentioning that.

The data we have today is...

Surge (08:13.07)
not completely aligned with how we ultimately want to be making drugs. So I think this is like, in many ways, different to other AI domains, let's say like vision or NLP, where like, broadly speaking, you know, training a language model on internet text is already like quite useful. Like you can interact with that language model and like get utility out of it. Like obviously a lot more needs to be done to align that for, you know, like.

human interaction, but like you're, you're pretty much there, or at least in very large part for proteins and, you know, let's say drugs, bi -molecular drugs, generally the data sets we largely have today as an industry are, you know, just taking proteins as an example, like individual protein sequences. So like, you know, billions of

of such protein sequences, which isn't a lot, several trillion tokens. And then, maybe not quite trillions, but close. And then you have maybe hundreds of thousands of protein structures or small sort of complexes of protein structures. And...

So that's really great sort of training data. But the challenge is like most of this data is a collection of individual proteins or sort of small groups of proteins. And people have built really great generative models of proteins and other biomolecules with this kind of data, but it's sort of devoid of the context of like how any molecules you sample from these models ultimately must exist in the context of a human. So.

you know, when you administer a drug, it's interacting with millions of other proteins and, you know, has a complex path from the route of administration all the way to where it needs to get to at the end of the day. which is just like a much, much more complex setting than the training data that these models are, are being trained on. so there's only so much we can expect these models to like generalize to like actual therapeutic use. and yeah, I think there's like.

Surge (10:35.214)
some interesting like AI research directions there. But also we just need like fundamentally better and more human aligned data to get us there as well.

Prateek Joshi (10:48.58)
Let's talk about high throughput measurement of drug properties. And in the literature on the website, you've talked about it. And there's so much to discuss here. So maybe to start with, what is this concept of high throughput measurement and why is it important in drug discovery?

Surge (11:10.67)
Yeah.

So even with like state of the art generative approaches right now, what those models are really good at getting us is what do good looking protein sequences and structures look like? And we can sample from that space. And that's important. Like generally for, you know, a protein based drug, it should be, you know, well -filled, it should be stable. And, you know, these are things that you can get.

from a well -trained and generative model, but, springing this back to what I was saying earlier is, you know, drug design is a much more multi -objective problem than just having sort of stable well -folded, proteins. so, and, and, and there's nothing really in the way we train these models where we expect some of these higher rel, higher relevance human relevant.

properties to sort of like emerge just from like training these models. So this is where like, we have to be quite empirical. So we can use generative modeling to sample, you know, a large number of candidate drugs. You know, typically we're doing, let's say 10 ,000 to a hundred thousand in the lab. But then we need to empirically measure which ones of those are satisfying like our end therapeutic goals. So for example, you know, ultimately,

like a drug has to sort of evade the immune system and it has to be manufacturable in like a large fermenter. And it's very unrealistic to expect our generative models to like somehow just like learn that. So the high throughput property measurement comes in there in that we're using proxy measurements or a lot of times as much as possible, just directly measuring the property we care about for all, let's say a hundred thousand drug candidates that we've sort of sampled from our model and then making empirical decisions about.

Surge (13:10.766)
which ones to double down on and move forward with.

Prateek Joshi (13:14.82)
Can you explain the process of setting up a high throughput screening assay? Like what does it involve and what do you expect out of it?

Surge (13:24.846)
Yeah, so there's maybe like kind of two ways you could break this up. The first is.

You could try, you know, imagine in the first setting you're, you have, you know, a collection of test tubes and in each test tube, you're running, one protein and measuring one property of that, that protein. so if you want to measure, the properties for, let's say a million proteins, you need a million test tubes. and like, that's one way of sort of scaling up.

But obviously that has limitations, like doing a million test tubes is like, as I'm sure you can kind of imagine, like a lot quite challenging to do. So now where possible, we try to do this thing called multiplexing where in a single test tube, we are running parallel property measurements on, let's say millions of drug candidates just in a single test tube.

Prateek Joshi (14:10.244)
Yeah.

Surge (14:32.91)
so the key sort of trick there, and, you know, maybe this is a little bit technical is that we need some way of, Coupling sort of the protein drug candidate. and, and when sort of subjected to a property stress, we're able to separate those protein drug candidates that are quote unquote working from the ones that aren't in the test tube. and then if we can separate the, the.

protein molecules that are working, then we have some tricks that we can use on the backend to go in and actually determine the sort of identity of those protein drug candidates that are working in that test tube from ones that aren't. And in that way, basically, in a single test tube experiment, read out which of several million drug candidates are actually working versus not. And sort of the key insight there is,

Like in a single test tube, you can fit like trillions of molecules. So basically if you can find a way of like restricting sort of the space you need for an experiment down to like maybe micron scale, as opposed to like a full like test tube, then you can use these techniques.

Prateek Joshi (15:50.66)
All right, and the properties that you measure during this process and also the data that comes out of it, how do you analyze it and how is AI making that faster, better, cheaper, especially in this setup?

Surge (16:10.83)
Yeah. so actually on the property measurement side, we're not using, a lot of AI. but, there, there, there was a fair amount of like computational work that, that goes into this, too. So I can talk about that. so maybe backing up a little bit here, in,

Probably the most like common digital to real world interface we have with biology is at the level of DNA. So we have good technologies that allow us to write any DNA sequence we want. And from that, we can basically make any protein drug candidate that we want. It's completely programmable. And then we also have good ways of reading DNA. So.

you know, given sort of a physical DNA molecule, we have good techniques to sequence that DNA. And by sequencing the DNA, we can tell, you know, what that protein drug candidate was by virtue of its sequence. So when we're developing these assays, the trick again is as much as possible, we want to do everything in a single test tube.

And then imagine that within this test tube, we have tiny little micron scale bubbles that contain our protein of interest. And also contained with that bubble is the DNA sequence and coding for that protein. And whenever we're measuring a certain property, you can imagine that the bubbles that contain a protein drug candidate that sort of

have good property values, let's say the bubble will sort of glow or fluoresce, or maybe have some other kind of physical property. And the ones containing protein drug candidates that don't do well are not going to glow. So we can actually separate the glowing bubbles from the non -glowing ones, isolate the successful protein drug candidates, and then by virtue of sequencing the DNA, which requires technologies like next generation sequencing and some...

Surge (18:33.358)
by informatics work, we can go back in and identify which are those protein drug candidates that are working with respect to that property versus not. And then we can actually do this for many different properties at once.

Prateek Joshi (18:46.66)
And we briefly touched upon this a little bit, but I want to maybe spend a little bit of time on multiplexed screens. And can you maybe just explain in basic terms what are they and how are they being used for drug discovery? And also the advantages of using multiplexed screens over, say, single parameter assays.

Surge (19:10.798)
Yeah. Yeah. So.

Surge (19:17.934)
The main advantage of multiplexing is...

Again, a lot of times you're interested in screening, you know, let's say a million different drugs and it's just going to be not possible to do that one at a time. So multiplexing can be very, very useful in that regime. 

Prateek Joshi (19:46.5)
So how are multiplex screens useful in drug discovery and also their advantages over single parameter assets?

Surge (19:58.926)
Yeah, so the typical ways we use multiplex screens are.

Surge (20:09.262)
whenever we can.

sort of represent a human relevant property and sort of simplify it and measure it in a way that involves just our protein of interest or our protein of interest interacting with some other agent of interest. So, a very typical example is we want to know if our protein drug candidate is binding another protein of interest. Let's say that's a human protein that we need to drug.

So these types of what we call drug target interactions is something that's highly multiplexable. And we can measure protein -protein interactions at million -fold or billion -fold scale fairly easily. Where multiplexing is less useful is measuring extremely relevant properties. For example,

you know, how is this drug going to behave in a human being? It's hard to fit like billions of humans into a single test tube and like multiplex in that way, right? So again, the trick is to take high relevance properties we care about, come up with a proxy version of that property that's multiplexable. And then, you know, the main advantage of multiplexing is being able to plow through a really large number of drug candidates very quickly.

Prateek Joshi (21:35.236)
Let's talk about protein characterization. Now, when you look at protein, it's being studied so much, extremely important in biology. So just for starters, how do we determine the protein structure? Like what goes into it?

Surge (21:55.31)
Protein structure, yeah. So there's a few different ways. The most common ways are two techniques. One is called cryo -electron microscopy, and the second is x -ray crystallography.

In the first case cryo the idea is you take a sample of your protein and you sort of, I'm simplifying, but you sort of suspend it in essentially water and you like add it to this thin wafer. And then that gets frozen, flash frozen. And so what you end up having is sort of a.

very thin film of ice with your protein embedded in that ice. And the protein is sort of frozen and like a bunch of different orientations. And then using like a powerful microscope and electron microscope, you sort of image that sort of plane of protein in a top -down way. And you get all these different sort of 2D projections of your protein. And then using machine learning and you know, this is a place where.

computer vision and generative methods and vision have proven quite useful. You can, from those 2D projections of that protein, reconstruct what the 3D structure of that protein must be. So this is turning out to be quite a powerful technology. I don't think it's quite on an exponential curve yet in terms of the number of structures it's giving us, but it's definitely better than linear. And yeah, people are working on improving this method a lot.

The more old school way is x -ray crystallography. The basic idea is you like take a protein, a pure protein sample, and under the right conditions that sample will sort of crystallize and your proteins will be sort of ordered in a very regular pattern, sort of like a salt crystal. And then if you shine typically x -ray beams onto that crystal,

Surge (24:13.358)
the regular ordering of the protein will actually diffract the light in or diffract the X -ray beams in a predictable pattern. And from that sort of diffraction pattern and some math that I don't fully understand, you're able to sort of reconstruct like the atomic positions of the atoms in the protein sample.

Prateek Joshi (24:33.924)
Right. And when we talk about proteins, protein -protein interaction is a key topic. So can you, maybe two -part question, can you explain what protein -protein interaction is? And also, why is it important? Why do we have to analyze it?

Surge (24:54.318)
Yeah, I'll start with maybe why it's important. So.

I think this is not an ex exaggeration, but, basically all information in biology is, propagated through interactions of biomolecules and, you know, proteins being a major kind of, of biomolecule. So what I basically mean is like, let's say you have one protein and another protein, they sort of like physically interact by like binding and touching one another.

And oftentimes this causes one of the proteins to change shape. And now it's sort of imbued with some new information. And now that protein with a different shape is going to go interact with another protein, you know, maybe change its shape. And then that protein is going to go interact with another protein and so on. You can have information propagate through these physical interactions. And this is basically how.

cells are able to kind of compute. So like given certain, you know, environmental conditions, the cell can sort of integrate all of that information, process it, and then act in a certain way. Like maybe the cell needs to divide more or, you know, move or something, or sometimes they get that whole process wrong. And in cancer, there were sort of not the appropriate feedbacks in place. And then the cells will just like continue to divide despite when they really shouldn't be.

so, and, and, you know, protein, protein interactions again, are like the core of, of how all that that happens. So, if you, so it's not crazy then if you want to have impact on disease in the context of developing therapeutics, the, the way you need to do that is by influencing these interactions between these, these biomolecules. So if you want to go make a drug, and, you know, the hypothesis is that.

Surge (26:55.982)
there's this one protein on the surface of a cancer cell that's maybe a little too active and that's what's causing the cancer cell to divide, then a good way to make a drug would be, okay, let's go build another protein, let's say an antibody that binds that overactive target protein and slows it down or inhibits it as a way of having an effect on the overall cell and stopping the growth of, let's say, cancer.

Prateek Joshi (27:24.996)
Fascinating. And in addition to coming up with viable drug candidates, right? So AI can help in so many ways. One of the more popular ways is about just coming up with a giant list. And then obviously in the lab and in the real world, we have to validate that, hey, this is valid and that is maybe not practical before we put it into humans. When you look at the rule of AI,

not just generative AI, but just general AI. How else can it help in protein characterization? And just kind of in this entire process start, and some could be simple, some could be very complex, but where do you see AI playing a role in making this faster, better, and cheaper?

Surge (28:12.398)
in protein characterization specifically.

Prateek Joshi (28:14.34)
Yeah. Yeah.

Surge (28:17.806)
Yeah.

Surge (28:23.566)
Well,

think, so a lot of instruments are measuring fairly raw signals like so you will, you know, perturb a protein in a certain way, maybe, you know, shine light at, you know, mixed wavelengths onto the protein. And then, you know, you're sort of measuring what comes back to you as some proxy of some property you care about.

And then, so there's like the raw signals coming back to the sensors that is, you know, probably contains a pretty comprehensive picture of, you know, what that protein is or how it's behaving. But then there's this other layer of those raw signals then get sort of, you know, in some sense of dumbed down so that the user can interpret, okay, this, you know, means the protein is stable at this temperature or not.

And it might be actually a lot more valuable to have like an AI just kind of like plugged directly into those like raw sensor values, because it's going to be able to sort of consume all of that information much better than a human can. And there might be a lot more information contained in that than just like the thing we might be looking for. And a large part of that is like, we are kind of imposing our mental model of what these proteins should look like. So we think, okay, it should, you know,

have this stability value and should not be unfolded at this temperature or it shouldn't be interacting with this hydrophobic column more than 10 minutes or something like that. And that's all sort of seen through very human eyes, but I kind of like one general takeaway in AI is as long as you set up the right learning conditions, these models want to learn, they can actually learn from very raw forms of data.

Surge (30:24.654)
And that's actually how you get the best models at the end of the day. And that's not something we really respect when it comes to precision measurement.

Prateek Joshi (30:31.332)
All right. And I have one last question before we go to the rapid fire round. And it's about all the technological innovations that are happening right now. So what technological breakthroughs are you most excited about, specifically in the overlap of AI and bio?

Surge (30:55.95)
Yeah. so I think one thing, we're very excited about at NABLA and spend a lot of time working on, you know, going back to something I said earlier of the way these models are trained today, you know, are largely just devoid of the broader context of, of human biology. so the generative models, we can sample from them and we'll get,

you know, very useful drug candidates, but it's still very far removed from end human use. so it naturally like begs the question of like, how do you take these generative models? and in much the same way you would align, you know, a model trained on natural language, to, be aligned for like human use and human interaction, let's say by RLHF, like what's, what's the like bio or, or drug equivalent of that. So can we take.

general generative models of proteins and align them to generate drug candidates that are aligned with human end use at the end of the day. And so the way you combine, you know, training of these generative models with like lab collected data to sort of satisfy these end human use cases is like a very interesting problem. And I think there's a lot of like low hanging fruit there.

Prateek Joshi (32:36.476)
Perfect. With that, we're at the rapid fire round. I'll ask a series of questions and would love to hear your answers in 15 seconds or less. You ready?

Surge (32:51.246)
Sounds good.

Prateek Joshi (32:52.388)
Alright, question number one. What's your favorite book?

Surge (32:57.078)
Permutation City by Greg Egan.

Prateek Joshi (33:01.444)
Amazing. That's the first on the podcast. I'll check it out. All right, next question. What has been an important but overlooked AI trend in the last 12 months?

Surge (33:12.814)
Yeah. The trend is definitely not 12 months old. It's much older than that. But like, I think people are consistently surprised by how applicable the bitter lesson is, you know, both in AI generally, but even in bio.

Prateek Joshi (33:29.348)
What's the one thing about AI powered biology that most people don't get?

Surge (33:36.206)
Yeah, we're I think in biotech and truck development. We are not really limited by AI We're fundamentally limited by our knowledge of biology and how to manipulate it And what questions to ask of it?

Prateek Joshi (33:51.14)
What separates great AI products from the merely good ones?

Surge (33:57.326)
Yeah, maybe stepping outside of FireTech a bit here. Maybe two things, simplicity and like having an opinion.

Prateek Joshi (34:09.86)
That's actually amazing. I think having an opinion is such an important, but actually underrated people so many times they're either too afraid to have an opinion or they're like, let me appease a whole bunch of people. So it's like a bland generic, nothing. So that's actually a really good one. All right, next question. What have you changed your mind on recently?

Surge (34:31.662)
yeah, I think, I generally believe a lot in the power of scaling. I think that's going to continue to work for a long time, but at the same time, I don't believe scaling what we have now, is going to lead through the breakthroughs. we really need an AI and especially direct design.

Prateek Joshi (34:56.324)
What's your wildest AI prediction for the next 12 months?

Surge (35:02.222)
I think we're going to have, small AI proxies of ourselves that interface like meaningfully with the real world, on our behalf, on like low stakes stuff.

Prateek Joshi (35:19.46)
That's actually a good one. All right, final question. What's your number one advice of founders who are starting out today?

Surge (35:29.774)
Yeah, that's a good one. I think like...

Surge (35:37.422)
set out to solve like the most ambitious version of a real problem. Speaking from firsthand, like raising the bar doesn't actually make the work any harder. But it helps actually make everything much easier. For example, like motivating people, recruiting people, like the satisfaction you get from like working on stuff. And then, okay, maybe one more thing I would say is like, I think we have a not a great,

version of this in biotech, but like, just like, don't worry about competition. like especially in therapeutics, there are like, I don't know, a hundred times the problems to solve as there are like people working on them. so yeah, just don't like pay attention to what other people are doing. Like just make sure you're solving real problems.

Prateek Joshi (36:27.108)
That's a phenomenal way to end the episode. I think a combination of ambition, like pursue the most ambitious thing. In fact, the more ambitious it is, as you said, it's going to be easier to attract people because as humans, we are attracted to ambition and people want to do great things. So that's actually fantastic. And yeah, I think the amount of work to be done in bio is so much, right? So literally just there's no, yeah, we just need more people to do.

good stuff and not worry too much about who's competing because it's not like a unlike say marketing automation, SaaS tool. There's just too many companies in there, but here and then biology, there's just a lot of work to be done. So Serge, it's phenomenal. Loved your insights, loved your knowledge on the space. And I just love discussing all things AI and bio. Thank you so much for coming onto the show and sharing your insights.

Surge (37:23.534)
Yeah, thank you so much for having me. This was fun.