What’s the BUZZ? — AI in Business

Augment Your Generative AI Model With New Data (Guest: Harpreet Sahota)

Andreas Welsch Season 2 Episode 19

In this episode, Harpreet Sahota (Developer Relations Expert) and Andreas Welsch discuss augmenting your Generative AI model with new data. Harpreet shares his insights on going beyond the limitations of off-the-shelf Large Language Models (LLM) and provides valuable advice for listeners looking to create AI-based applications that generate tailored, traceable information.

Key topics:
- Learn about different techniques to get better results (prompting, fine-tuning, retrieval augmented generation)
- Get advice which technique to use when
- Understand the five steps for building RAG-based systems
- Find out how to improve RAG-based results

Listen to the full episode to hear how you can:
- Assess common LLM limitations such as knowledge cut-off and hallucinations
- Choose Retrieval Augmented Generation (RAG) to provide current, traceable data to your LLM
- Fine-tune a model to get tailored results based on adjusted weights and datasets
- Become aware of data requirements and chunking your data for optimal results

Watch this episode on YouTube:
https://youtu.be/pUzlhAFYyHI

Questions or suggestions? Send me a Text Message.

Support the show

***********
Disclaimer: Views are the participants’ own and do not represent those of any participant’s past, present, or future employers. Participation in this event is independent of any potential business relationship (past, present, or future) between the participants or between their employers.


Level up your AI Leadership game with the AI Leadership Handbook:
https://www.aileadershiphandbook.com

More details:
https://www.intelligence-briefing.com
All episodes:
https://www.intelligence-briefing.com/podcast
Get a weekly thought-provoking post in your inbox:
https://www.intelligence-briefing.com/newsletter

Andreas Welsch:

Today, we'll talk about how you can augment your generative AI model with new data. And who better to talk about it than someone who's actively teaching others how to do that. Hapreet Sahota. Hey, Hapreet, thank you so much for joining.

Harpreet Sahota:

Thank you for having me, Andreas. Good to be here.

Andreas Welsch:

Why don't you tell our audience a little bit about yourself, who you are, and what you do?

Harpreet Sahota:

Yeah, definitely. I'm currently working in developer relations. I've been in the DevRel kind of space for about two and a half years. Right now, I'm at Deci. AI. Before this, I was at Pachyderm and Comet. In the eight years before I got into DevRel, I was a practitioner, a data scientist. I've worked as an actuary, biostatistician senior data scientist, lead data scientist. So definitely been quantitative for quite some time. In addition to that, I've got some LinkedIn Learning courses that I've done: one on computer vision, one that I'm going to be actively recording next month. That's all about LangChain. And another one that's using LlamaIndex for retrieval augmented generation. So yeah, that's just the me in a nutshell.

Andreas Welsch:

It's awesome. I know you're super passionate about the topic and about helping others learn about it. And what you mentioned just now about Retrieval Augmented Generation, that really fits the theme and the topic of today's episode. So, super excited to have you on.

Harpreet Sahota:

Yeah.

Andreas Welsch:

For those of you in the audience, if you're just joining the stream, drop a comment in the chat where you're joining us from today. I'm always curious to see how global our audience is. Should we play a little game to kick things off?

Harpreet Sahota:

Yeah, let's do it, man. Let's do it.

Andreas Welsch:

Perfect. Alright, so this game is called In Your Own Words. And when I hit the buzzer, the wheels will start spinning. And when they stop, you see a sentence. And I'd like you to complete that in your own words. You only have 60 seconds to make it a little more interesting. Are you ready for What's the Buzz?

Harpreet Sahota:

Ready as I'll ever be. Let's do it.

Andreas Welsch:

Perfect. If AI were a month of the year, what would it be? 60 seconds, go.

Harpreet Sahota:

I think AI would be the month of May, because people always talk about AI winters. But when an AI winter comes through, what do we have? We have the month of May. When it's cold outside... We got the month of May. May is the beginning of spring and that is where we are in AI right now. We are in the spring of AI. So what better month for AI to be than May?

Andreas Welsch:

That's perfect. I'm already looking forward again to spring before it gets cold here on the east coast in the US. Really like that answer. Awesome. By the way, folks in the audience, if you have questions for Harpreet, feel free to pop them in the chat as well. We'll go through them in a few minutes. But maybe we can kick things off with our first question. There's been a lot of talk about off the shelf models having a cut off date. Think of things like GPT-3.5 or 4 generating just generic content. Sure, you can fix a lot of things with better prompting, but there are limitations in these off the shelf models. On the other hand, you have fine-tuning, which is resource and budget intensive. I think over the last couple of months, especially, one approach that's emerged that seems to me like it's a middle ground. You already mentioned it in your intro. Retrieval Augmented Generation, RAG. I was wondering, from your point of view, what's the benefit and what is it even about? Maybe to begin with.

Harpreet Sahota:

First, let's talk about the differences between fine-tuning and RAG, touching on some of the technical aspects of them at a high level and then get into some of the benefits of using RAG and then just close on when do you switch. So what is fine-tuning? So fine-tuning basically means that you're updating a pre-trained model's weights via backpropagation on either a domain or task-specific dataset, right? So when you do this, you end up refining the broad knowledge that the model has learned during its initial pre-training, and you're now adapting its behavior to a specific task or domain. So I think about it this way. I have a graduate degree in mathematics and a graduate degree in statistics. Tons of quantitative knowledge. But I need to fine-tune my knowledge for machine learning somehow. So I train and study and learn up on books. And that kind of fine-tuned my worldview for machine learning and AI. And that's how I like to think about RAG. Like I've got general quantitative skills but those quantitative skills might not directly map to AI or ML. I need further training on that. But the thing with RAG is that you need technical expertise to get started with fine-tuning. So everything from data curation to managing computational resources, to understanding how to allocate and utilize those resources, all the way down to how do we set up the right evaluation metrics to assess the model performance. So fine-tuning really is a model training kind of pipeline. You have to prepare, curate, and manage high quality data. You've got to clean and pre-process the data. You need to select a base model as your kind of foundation that you're going to build on. Then you've got to train the model. You've got to adjust the weights on that new data set. Then evaluate the performance and finally deploy it. So it's a machine learning task at the end of the day. Retrieval Augmented Generation though this is combining So now instead of just solely being reliant on its internal training data, you have the model query some external database for relevant knowledge. So that's going to help assist in generating a response. I've got a ton of books back there, right behind me, touching all the quantitative topics that I know. Sometimes I just need a quick refresher on some topics. I need to get some additional context. I'll grab a book. I'll look it up and then I'll bake it into whatever it is I'm doing. So that's the kind of retrieval augmented generation there. And so implementing this requires a medium level of technical expertise because you're having to encode your data into embeddings, then you have to index that into a vector database. And so then when a user is posing a query, you need to convert that query into an embedding, and then have a system that kind of searches your vector database for similar embeddings, retrieves those embeddings, and then injects it into a prompt that can give the LLM more context so it can answer the user's query. So there's a number of things you need to get started with RAG. One, there's really no training pipeline at a high level. You can always fine-tune your embeddings and things like that, but we're not going to discuss that here. But there's really no model training that's going on. You need a data ingestion pipeline, data storage, you need embeddings, you need a vector database. So then your system design or pipeline for RAG is essentially five components. You've got the indexing pipeline. So this is just a way to index data into your vector database. You've got your user's query because you need to transform the user's query to an embedding so that you can search for the relevant documents in the database. Then there's the retrieval aspect of it. So how do you extract relevant data based on the user's query? So you can use some classical information retrieval techniques. You can do cosine similarity, you can do bn25, whatever you choose so that you can comb through this database and retrieve the relevant vectors for your user's query. But then even once the doc, once those vectors are retrieved, you need a way to select the right vectors, right? Because you want to be able to filter out those less relevant documents from those retrieved documents so that you can have better context for your model. I hope it's becoming clear how RAG is overcoming the limitations of fine-tuning, but just to make it absolutely clear we can start by stating like the biggest limitations of a pre-trained model. You mentioned before fixed knowledge cutoff dates. So this is going to lead to outdated or stale responses. And then there's also a tendency for a language model to generate just overly generic or broad content because of its generalized training. And so this is where RAG steps in, right? RAG is able to query a database and make sure that the responses that we're getting are now timely, they're relevant, they're factual. It's also helping to augment the capabilities of a large language model, because now we're going to be able to get more informed context aware answers. You could also use RAG to do kind of source attribution as well. And I guess the last thing I need to talk to you about was like a general framework of when to choose what, right? So when it comes to fine-tuning, the questions you need to be asking yourself is, okay, do I want a model that needs to have deep domain understanding or specificity in a particular topic? If yes, maybe go with fine-tuning. If you need consistent or predictable behavior in the pattern of responses, then go with fine-tuning. If you got a ton of domain-specific data that you can use to tune a model so that model can make better decisions or produce more consistent output beyond what you can do with prompting a few shot examples, then you might need fine-tuning. And RAG, you would use RAG, Retrieval Augmented Generation. If you have a use case where an LLM needs access to real time current data, knowledge, information, all that. If your application requires like broad, general, topical coverage or the ability to adapt its responses, but just needs access to new data, then RAG is a good one as well. Yeah, it's just a cost-effective, scalable solution. So it doesn't really require any constant retraining, just constantly retrieving.

Andreas Welsch:

Awesome. Thank you for walking us through that. I think it's always important to level set on what are the different terms and techniques that we're talking about anyways. And then in this case, specifically, what are the pros and cons? And if I look at use cases in business, it's rarely about just general and generic information. It's about proprietary information. It's about information that's specific to your business, maybe to your products, to your services. So you have that, but you might not necessarily have that stored somewhere in the database. Maybe it's in a file or it's in a bunch of files. Maybe it's a policy or it's an FAQ document or many other things. So I think that's where, to your point, this can really play out its strength while being a happy medium between the power of natural language generation and the depth and relevance of the information that the model itself doesn't have access to. Awesome, thank you for walking us through that.

Harpreet Sahota:

Yeah, no problem.

Andreas Welsch:

I think data has always been a problem when you're building your own model. I think you alluded to parts of that earlier, that you obviously need data, you need your own data. It comes in different shapes and forms, maybe it's structured, maybe it's unstructured, but how does the requirement for data or good data change or maybe even stay the same when you're looking at things like RAG?

Harpreet Sahota:

Oh, it's like the least important part of the equation. Obviously, I'm joking, right? Obviously, I'm joking. Remember, yes, let's talk about, remember, what does RAG do? RAG is combining our model with an external data that's going to be retrieved based on a user's query or context, right? It goes without saying that the quality of data in that external repository or database that your RAG system is going to interface with is extremely, absolutely critical. And why? Because the performance of RAG is going to be directly influenced by the quality and relevance of that document store. So if that corpus of data you have contains outdated or irrelevant documents, then your generations are going to look not good. They're going to lack accuracy. So you can think of this actually as the same song that Andrew Ng has been singing about data-centric AI. We're holding the LLM and the embedding model fixed, and we're just iterating on the data that we can shove into the context window of the LLM. Just concretely talking about a few ways that data quality can impact a RAG system. One is the accuracy of responses. To get accurate responses, you need accurate data. So high quality, accurate data is going to allow your language model, whatever one you're using, to generate more reliable and accurate responses. If that external data is outdated or inaccurate, then the LLM's generation is going to reflect that. Then there's the issue of the relevance of the retrieved documents. Because it's just crucial for generating meaningful and contextually appropriate responses. So high quality data, organized, properly tagged with metadata, is going to result in more relevant information being retrieved out of your database and utilized in the generation process. Efficiency and speed, obviously, if you have well structured, clean data, then it means that it will be more quick to be retrieved easier to index. This is going to lead to lower retrieval latency, faster response times, and less computational overhead, of course. Trustworthiness is another huge key. If you have a high amount, you better have a high amount of trust in that external data because this is going to impact again the trustworthiness of the generated response and this again goes back to the quality of the external data. Data freshness is key as well. So just make sure that the external database remains up to date. This is crucial for RAG so yeah, high quality data is crucial. You have to spend some time cleaning data. Like I recently wrote a blog post where I was using LlamaIndex to I was doing like a crash course in LlamaIndex using using our Deci LM with the state of AI 2023 report. And as I converted each slide into a document, I noticed that, okay, every single slide has like the header and the footer, and it's just a lot of extra data that would be clogging my vector database. So you have to go through, clean it, parse it. Yeah, so data quality is definitely, definitely crucial.

Andreas Welsch:

Awesome. There was another question in the chat. Can we combine fine-tuning with RAG? And if yes, how does it work? I think that's a really good question because with different different techniques, right? The question is, can you layer them on top of each other? Does one replace the other? What would you say there? What do you recommend?

Harpreet Sahota:

Yeah, that's a good question. And I don't know if my answer will be a good one. When I think of fine-tuning, I think of changing the behavior of a model. I'm not necessarily fine-tuning to augment the parametric knowledge of a model. You want to fine-tune to change the behavior of a model. Whereas RAG, it's injecting new knowledge into the context window of an LLM. But that being said, you can fine-tune your embeddings model over your particular document corpus so that you get more relevant retrievals. So in that sense, yes you can combine combined fine-tuning with RAG. I haven't seen any papers or read much about actual fine-tuning to change the behavior combined with RAG, but there's so much happening in this space that it's hard for me to keep up. So if anybody has links to that. Let me know.

Andreas Welsch:

So obviously we're a lot more technical in today's episode than we typically are, but I think it's important to also cover these aspects and more and more of these aspects as the field matures as we go through different techniques. I feel earlier in the year, it was all about, hey, you should prompt your model and don't think about fine-tuning. And I didn't see a lot of people talk about retrieval augmented generation. Now I feel that's shifted, especially as there's more adoption, there's more exploration. What can we do with it? Where are the opportunities? Where are the limitations? I don't know. Maybe we'll talk about fine-tuning a couple of months from now. Who knows? But since we're already talking about this foundational layer in the technology, are there specific models that you see that are best suited for RAG? Can I just use my GPT-3.5 or GPT-4 or Claude, or others? What do you see? What's best suited for RAG?

Harpreet Sahota:

You can broadly categorize large language models into two categories There's the base LLM and then there's instruction-tuned LLMs. So base LLM. It's just designed to predict the next word based on the training data. So they're not really designed to answer questions or carry out a conversation or help you solve a problem. If you just feed a bunch of context into a base LLM it's just going to try to complete the next token, right? For however many tokens you have to generate for and this is different from the instruction tuned LLM. So an instruction-tuned LLM instead of trying to auto complete your text, it's going to try to follow the given instructions using the data that they've been trained on, plus whatever context you provide it with. So in that sense, opt for an instruction-tuned or a chat-tuned model for LLM, because if you try to shove a bunch of context into a base LLM, you'll end up with not the result that you're looking for.

Andreas Welsch:

What would be some examples of instruction-tuned versus the other option you mentioned here?

Harpreet Sahota:

So GPT-3, the original GPT-3, that's that's a base LLM. It's a completion model. If you look at any of the major releases that happened, whether it's Llama, Mistral, Deci LM there's always two different models that you'll see. You'll have one be the base model, And then one be the instruction-tuned model. All models that are out there that I've seen released in the last few months we'll have that release pattern base model with the instruction-tuned model.

Andreas Welsch:

Awesome. So to make it actionable and make it concrete, because I know a lot of leaders are watching the show and people who want to move into leadership positions are in an expert role right now. What's your advice? How can leaders and experts get started with RAG? What do they need to know if they're either just hearing it for the first time or want to learn more about it and go more in depth.

Harpreet Sahota:

For leaders, definitely first and foremost, just what does RAG actually allows you to do. Remember it allows an LLM to retrieve real time data from external databases at inference time. So instead of just solely relying on parametric or pre-trained knowledge. Also like any technology, it's essential to approach your implementation of it thoughtfully. So I guess just at a high level, like leaders for sure would be just know a little bit about the system design, right? That there's data pipelines that you gotta think about continuously updating and pre-processing the data for retrieval for your LLM. The embeddings and indexing, right. Because we have to convert data to vector representations and then index that into a vector database. Understanding just the advantages of RAG in terms of like cost efficiency. Because there's lower upfront investment compared to fine-tuning data security is obviously always critical. Technical complexity. We talked about how RAG doesn't require as much technical expertise as fine-tuning a model. And just maybe strategic kind of implementation for it. Because RAG is nice because we have a balance of real time data access with computational efficiency. But in order to successfully deploy it, you need to have really solid data infrastructure and all that comes with that. Also just pros and cons of each. When we talk about fine-tuning some of the pros and cons, fine-tuning you get more uniform and predictable responses because you're training it to behave in a particular way. We can also get a little bit of more domain expertise because the model's going to gain proficiency in a targeted area or a specific domain. You'll get more influence over the model's output because it's conditioned by the data that it's trained on. There's some cons of that as well. It's resource intensive it requires a ton of data, large volumes of high quality domain specific data. There's also the risk of overfitting. You can train a model to the point of what's called catastrophic forgetting, where it becomes overly specified, specialized, and might even forget things or lose its broader applicability. RAG also has pros and cons. Pros is definitely that dynamic knowledge, right? We have access to continuously updated data. We can adjust the model's responses based on the data that we're retrieving. And that helps reduce the, that Tendency for an LLM to give like a generic answer. It's a lot more cost efficient as well, because we don't need to constantly fine-tune or do any kind of resource intensive work, because we're essentially just querying an external database. RAG's not without its cons either, right? There is still a complex system, because we have an additional layer of complexity on top of just the language model, because now we've got that retrieval later as well. So that's gonna have that additional bit of complexity. There's also things you need to keep in mind in terms of data relevance. You need to constantly update that data store so that the retrieved documents are relevant, so that could make that kind of maintenance process ongoing and could possibly get demanding. There's a number of other things you need to worry about. How do we actually index data is a thing in RAG. Because a user's query might not actually align semantically with the documents that we have. How does similarity search work, right? Similarity search is often going to retrieve documents that have the same words or context as the question, but sometimes those documents are not going to have meaningful answers to the user's query, right? So you need to come up with good ways of indexing your data. And then once we get the data into the database, like how do we chunk it? Like that, that matters a lot, right? Because if the data that's indexed to the vector is large chunks, that's going to have a lot of diverse information in it, then maybe important details will be diluted. Or we might get irrelevant documents retrieved. So we need to make sure that data chunks are concise. Yeah, so there's lots in the middle. There's all sorts of stuff pros and cons.

Andreas Welsch:

That's great. You can clearly see how it starts at a high level view of understanding what you can do with it. And it goes all the way down to the actual data in bits and bytes. And I must say I never thought that after college I would need to use vectors again. I don't know about how you in the audience feel. But to avoid the catastrophic forgetting that you mentioned for our audience, can you summarize what are the top three takeaways as we're wrapping up today's episode.

Harpreet Sahota:

I think top three takeaways is just understanding what it is that retrieval augmented generation does. We're allowing a large language model to interact with an external database. That's essentially what RAG does. How is it different from fine-tuning? Essentially, we're trying to change the behavior of a model, change the way that it responds. So that's a bit of a difference there. I'd say that pros and cons of fine-tuning and RAG is important to consider as well. And then also just realizing that the third takeaway here just make sure that when you are building your vector store that you need to consider how you're indexing and how you're chunking your data. It's really easy just to take we've seen a lot of tutorials out there with RAG, we just take a document and just embed it as is and just that's it. But be thoughtful about how we're chunking and indexing your data, because this is really going to affect the quality of generation.

Andreas Welsch:

Awesome. Thank you so much. I really appreciate the depth that you bring to the subject and for sharing your expertise with us today. It was great having you on.

Harpreet Sahota:

Thank you so much, man. Appreciate you bringing me here. And if anybody has questions, by all means hit me up, shoot me a message.

Andreas Welsch:

Sounds great. And for you in the audience, thank you so much for learning with us.

People on this episode