The AI Fundamentalists
A podcast about the fundamentals of safe and resilient modeling systems behind the AI that impacts our lives and our businesses.
The AI Fundamentalists
Model Validation: Performance
Episode 9. Continuing our series run about model validation. In this episode, the hosts focus on aspects of performance, why we need to do statistics correctly, and not use metrics without understanding how they work, to ensure that models are evaluated in a meaningful way.
- AI regulations, red team testing, and physics-based modeling. 0:03
- The hosts discuss the Biden administration's executive order on AI and its implications for model validation and performance.
- Evaluating machine learning models using accuracy, recall, and precision. 6:52
- The four types of results in classification: true positive, false positive, true negative, and false negative.
- The three standard metrics are composed of these elements: accuracy, recall, and precision.
- Accuracy metrics for classification models. 12:36
- Precision and recall are interrelated aspects of accuracy in machine learning.
- Using F1 score and F beta score in classification models, particularly when dealing with imbalanced data.
- Performance metrics for regression tasks. 17:08
- Handling imbalanced outcomes in machine learning, particularly in regression tasks.
- The different metrics used to evaluate regression models, including mean squared error.
- Performance metrics for machine learning models. 19:56
- Mean squared error (MSE) as a metric for evaluating the accuracy of machine learning models, using the example of predicting house prices.
- Mean absolute error (MAE) as an alternative metric, which penalizes large errors less heavily and is more straightforward to compute.
- Graph theory and operations research applications. 25:48
- Graph theory in machine learning, including the shortest path problem and clustering. Euclidean distance is a popular benchmark for measuring distances between data points.
- Machine learning metrics and evaluation methods. 33:06
- Model validation using statistics and information theory. 37:08
- Entropy, its roots in classical mechanics and thermodynamics, and its application in information theory, particularly Shannon entropy calculation.
- The importance the use case and validation metrics for machine learning models.
What did you think? Let us know.
Do you have a question or a discussion topic for the AI Fundamentalists? Connect with them to comment on your favorite topics:
- LinkedIn - Episode summaries, shares of cited articles, and more.
- YouTube - Was it something that we said? Good. Share your favorite quotes.
- Visit our page - see past episodes and submit your feedback! It continues to inspire future episodes.
The AI fundamentalists, a podcast about the fundamentals of safe and resilient modeling systems behind the AI that impacts our lives and our businesses. Here are your hosts, Andrew Clark, and Sid Mangala Mongolic. Hello, everybody. Welcome to another episode of the AI fundamentalists, where today, our episode is we're going to continue with model validation. But today focus specifically on performance. Before we get into that topic, though, I want to turn to the guys and ask, you know, how are we showing up today, it's October 30, is the day of this recording. And, of course, that is also the day of the release of Biden's executive order on AI for the for the United States. And
Andrew Clark:yeah, it's I'm still trying to do a deep dive on it, there's a lot there. So I think it's gonna take everybody, we'll have to do a subsequent podcast where we could actually address it fully. As we're all still kind of digesting it. It seems very wide reaching for the federal government, which is where most of the applicability will be. But then also, one thing we're very excited to see is, in the initial a couple months ago, the Biden administration, when they were talking about like, how are we going to approach it, like everybody knew that an executive order was coming, it kind of looked like they were going to create their own new framework. So I'm really happy to see that they definitely put their cards behind NIST and are asking for a couple expansions to the NIST AI risk management framework, which were a part of that whole journey, creating that I think we did two or three different rounds of responses. So very excited that NIST is going to be the the framework of choice for the administration. But there are a couple couple hidden things in there. That could be interesting. So do you want to talk about that red team comment that's in there.
Sid Mangalik:I haven't had a lot of time to dig into the red team part of it. But I think it's gonna be interesting to see, you know that testing isn't just going to be about passing some set of requirements, but it's actually going to involve evaluating the models against adversarial attacks, which we do expect to see going forward are these models be put into deployment?
Andrew Clark:Definitely, the one one, couple sentences that I kind of pulled out that I thought was interesting is I'm going to do a direct quote here, in accordance with the defense production Act, the order will require companies developing any foundation model that poses a serious risk to national security, national economic security, or national public health and safety, must notify the federal government when training the model and must share the results of all red team safety tests. So I thought that was a little interesting, that's might have the most bite of anything that I've seen. And it's a huge document, we're still trying to digest it, it came out a couple hours ago. So we'll do the full podcast. But I that's pretty wide reaching, that would mean open AI barred all of these different foundational models, because the public health part is and safety part is the one that got like, I know, you know, if your Booz Allen or Lockheed Martin developing something. Yeah, that makes sense. But this actually, I don't know how wide reaching this is, or how it would stand up in courts. But I found that part pretty interesting.
Susan Peich:Yeah, and look, where is, this is the US, this is probably like the biggest announcement and executive, you know, most comprehensive executive order that has been put out on AI by the US where we're also looking at country, you know, we know the eight EUA a EUAI act is very far ahead in this. And we could also argue that there's, you know, some countries like Singapore or China have also been moving very aggressively and forward with AI, regulations and proposals. So I think this all the combination of this gets the US ready for global AI safety Summit, coming up along with some things that are happening in the UK. So it's a wait and see for sure. Like we said, we're still digesting it. But definitely some things in there that are way more than just way deeper than a surface level. Show your work show your requirements. So to be determined.
Andrew Clark:Yes. And also some of those like that the quote I had, I don't know how that would, if you're not a defense contractor or a government contractor is gonna be interesting to see how much teeth there actually is going to be there without getting Congress involved. So it's some of the parts seem a little extensive as an executive order. So we'll be interested to see how that plays out.
Susan Peich:When we were preparing for this episode, he shared an article really more along the lines of artificial intelligence, and you know how it relates to physics.
Andrew Clark:Yeah, so I was very excited to see this come through. As you know, if you've been a longtime listener of the podcast, you know, we're not massive fans about the new LLM in large language models and how AI is just kind of being you know, hack, like here's a bunch of data, it might be artificially synthetically generated, just like hey, go do it. Ai kind of is smart, right? We're not really Along with that premise, the way that these models have been trained, there's a lot of gaps and it's made to sound like, sound correct or sound human not actually be correct. So what this article is, it was over the weekend in a Wall Street Journal that was published, I think it's for the weekend edition. It's called stacking boxes treating cancer AI needs to learn physics first, which was great. So it basically said, yeah, for like, what people are wanting AI models to do, you need to start it was like a lot of scientists building robotics and things like that are like, yeah, you kind of need to start building in What the What's, what's the physical ramifications and a lot of stuff we've been talking about is, so instead of just having AI, like learn how to play chess, by just looking at a bunch of data, it'll learn how to, it will just do things you're not allowed to do, it'll do, it will cheat, it will do things like that, essentially, what this is saying is, well, let's build these operating environments where it has to respect physical laws. So if you're building a robot, it knows it can't walk through a door, like sorry, you can't walk through a wall, if it hits a wall, it's got to turn around a mover for a lot of these things. But it's a different approach. It's like let's build scientifically built a scientifically based AI models that have to respond to the physical laws that we know from physics. So this, this was a great way of, and there's been a lot of people talking about, like these different approaches to modeling. And don't, don't get us wrong, we're not anti AI on this podcast, where we don't think there's a lot of a future and making these cognitive systems based off of the certain bias stuff, the large language models that currently exist. But I thought this was great. It's like, hey, maybe the research is finally starting to move a little bit into this direction of let's undo when we know the physics and we're building a physical thing, let's make sure the systems knows the rules of the game. And then it can learn within they're actually talking about digital twin songs and jet engines and stuff, too. So there's actually a lot of applicability to what we've talked about in the past.
Susan Peich:That all being said, let's move into our topic for today on model validation, and really digging in on performance.
Sid Mangalik:Yeah, so let's be a little bit specific about what we mean by performance today. Right? So performance is not literally how fast is your model, compute outputs? How fast is the process of acid the system? When we're talking about performance? today? We're talking about evaluating it for does it generate the outcomes you expect it to generate? Right? How good is it in terms of how correct is it. And that can take a couple of forms. And we want to just talk about the large umbrella of ways that you can do this type of performance evaluation, the ways in which we sometimes get it wrong, or the sometimes the ways that we oversimplify our understanding of it and not deeply engaged with the correct is evaluating different types of models?
Andrew Clark:Definitely, and that's one of the key parts of like, the first principles, the fundamentals matter. A lot of the key things we talked about in this podcast is data scientists often have a, you know, a pet metric that they like to use for everything. And one of the goals of the podcast today is we're going to kind of give a survey of well, there's a lot of metrics, and then the What metric you should use, or what metrics use depends a lot on your use case, what you're doing and why. If you have a physics based system, understanding those those limitations and understanding what makes sense, really, what is the goal that you're trying to accomplish is a really good starting spot. So we're gonna kind of break the discussion into a couple different types. We'll focus primarily on classification models or regression models, as those are like the two the main two main classes, and then we'll get into a couple other just kind of interest metrics as well. But, Sid, if you'd like to start us off with classification,
Sid Mangalik:yeah. So when we think about classification, we're thinking about, you know, simply, you know, is this a cat? Or is this a dog type problems, right? And these are problems that have correct answers, right, objectively correct and wrong answers. And you can either be right or you can be wrong, there's not a lot of fuzziness. And like, it's kind of dog, you have to make a stance one way or the other. And when you think about how you want to map out your predictions versus the actual data, you need some way to, let's say Boehner categorize this, right. So if you have 100, correctly labeled cats and dogs, and you have 100 predictions, we want to compare these two true values versus predicted values. And so maybe we just go down the list and make a little two by two matrix, which is going to show where we correct and guessing what we thought we were correcting. Where were we wrong impression what we thought we were guessing, right? So this is the just classic confusion matrix, which on one axis is true and false, and on the other axis is positive or negative. And so the four types of results you can get then are where your true and positive meaning where you correct about a positive outcome, where you a false positive meaning you're incorrect about a positive outcome, true negative or you correct about a negative outcome, false negative or you incorrect, about a negative outcome. And maybe we can swap over to to cancer screening because it's a little bit more obvious as a what is it positive or net? get an outcome. But let's start with that. And then let's see what we can build off with just those four things.
Andrew Clark:Definitely. So that's a great overview of those, like the four building blocks of what we normally call a confusion matrix is where you actually can can put that grid and see how they work. And two of the, well, there's a lot of different actually more than two, there's there's really three standard metrics that are composed of those elements. And then there's variation of those on top of it, and you have a mean, three, I call it three standard, but what are the building blocks? What I mean, that's, that could be debatable, but really what, what you're gonna get, and you use them for what is the goal that you're trying to try to calculate? So I'll define what those three are then said, you can talk about kind of the goals of those who can get into the more in depth metrics. So there's accuracy, which is the overall proportion of instances correctly identified. And that's really everything that was the number of true positives and false positives, divided by the number that was in there. So that's really what accuracy is, is just how accurate was I in predicting cat or dog, right? So we have a we have 100 dogs, 100 cats, and I say 150 percentage, or 150, or dogs and you can well, your you can calculate your accuracy from there. Where this can get really interesting where accuracy can be if you're you have kind of an evenly set of if you have to 100 100 in this example of 5050 probably be easier for sake of illustration, if you have 5050, then accuracy is probably could be a pretty accurate metric, no pun intended for doing that calculation. But let's say you have 99 cats, and one dog, well, you can be 99% accurate saying it's always cat. And that's where it really starts getting into the nuances, what that brings us into the next two, which are called recall and precision, which are also variations of it. So we talked about number of true positives, which has number of correctly predicted positive outcomes and number of correctly predicted negative outcomes. That's what the accuracy is, then you go to recall, which is the completeness of positive predictions. How many of the actual positives did we get correct? Which helps you adjust a little bit if it's that imbalance in it? So for instance, if there's 99 cats and one dog, how accurate am I if I want to know how accurate I am at getting dog, like, and if I just always say the night, if I'm always saying it's a cat, then I'm going to be the accuracy could be high, but the recall would be extremely low, because I'm not getting what I care about. I'm not calculating when that dog exists.
Sid Mangalik:Yeah. So if recall, is basically the quantity aspect, precision is the recall aspect, which is that for positive results, how good do we do for the positive results? Right? So this is kind of like, you know, on your billboards you throw darts on, if you are going to hit the center, how often are you hitting the center? Right, so of the good results? How often are you getting the correct result? Ignoring the negative results? And then what's nice is that you have those two pieces, right? The quantity meaning are you getting the correct number of positive outcomes, and the quality? Are you getting those ones right when you do get them. And then you can combine those two metrics into the f1 score. And this is a very popular you've probably heard about, but it's good to think about what it's doing to these two values. Right? So it's the harmonic mean, of recall and precision, basically, weighting these two equally combined them together into one score. And that's the f1 score.
Susan Peich:Yeah, I have a question. Because you, you, you framed precision, precision is an aspect of recall. Did I hear that correctly?
Andrew Clark:We might have misspoke. We both have said a couple. Couple
Susan Peich:is it now? It actually sounded in pattern because it was like precision, recall is an aspect of accuracy. Precision is an aspect of recall. And I was wondering, is there some kind of pattern there in the criteria?
Sid Mangalik:So there's no real hierarchy of these rates. So the confusion matrix just tells us about our true positives, true negatives, right, the four elements, their accuracy, recall and precision come directly out of the table. Okay. And then you can combine those to make f you can combine recall and precision and make f1. So you can think that f1 is made of tools pieces, but accuracy has its own piece.
Andrew Clark:And it's just using those components of the true positive true negative false positive, false negative and using putting them in different orders for accuracy recall precision with the accuracy is the overall proportion you get correct. The recall is the completeness of positive predictions. If I want to know dog and there's like it's imbalanced, how many times am I getting dog and then the precision is that quality of how many times am I hitting it when I when I think I'm hitting it, those three parts and then that's the harmonic mean of them together is the f1 score, which is kind of like a good average of instead of focusing on one of those three elements by itself, you They kind of have a good aggregated if you if I only had to choose one thing for every use case and classification. F one is kind of that I don't like the choose one, but it's kind of like the one that incorporates the best balance. So one thing we often see with data science is you know, you'll just default accuracy, which can work sometimes. Or you'll be looking at like accuracy recall or precision separately, which is fine, but it's very, it's more difficult to compare models. And then what's more important to you recall our precision and it's very easy to mix thing mixes up. We've even so far in this conversation said and I both made little typos in our in our talking where you sent the wrong wrong with true negative false positive, whatever it's very, if you see if you're comparing a lot of models, or doing like a parameter sweep, like we've talked about with Monte Carlo and things like that, well, what are you optimizing more for recall precision accuracy. So you need to know what what your goals are. And that's why f1 is kind of like a good basis. But that's not where you should stop f1 If you had to only choose one metric, but maybe you care a lot more like you still want that harmonic mean, you want that that kind of one statistic to go off of that, that incorporates all three of these elements. But maybe for you, you have an imbalanced fraud classification. Well, in that case, and you want to know when when fraud doesn't exist 98% of the time, but when it does exist, we want to make sure we catch it. Well, in that case, recall is going to be very important. But you also don't want to be always just saying, well, everything is always going to be fraud, so I make sure I catch it. And that's really bad precision. So one thing that's kind of emerged as a good as a good metric is called f beta score. So what f beta score does is it expands that f1 score, which allows you to weight recall and precision. For instance, if I'm using a highly imbalanced fraud classifier, like we talked about, and I want to be whenever I want to try and get fraud and fraud exists, but I don't want to just always say fraud happens, which is also that inverse of the accuracy, you can wait. So I would five I care five times more about recall than I do precision as an example. Or I care two times more and allows you to kind of tune that, based on your use case, which is up for a lot of classification models. If you understand what you're doing. That's, that's really that extension of F one that we recommend.
Sid Mangalik:Yeah, and so you know, accuracy, recall, and precision, these all kind of fall out of a problem of what does your data look like if you have like the best case scenario, and it's cats and dogs, and it's 5050. Honestly, accuracy is probably all you want and need. It's very interpretable it's very straightforward. It's very easy to use. But if you have like interesting a dataset, which is 99% cats and one dog, and you run an accuracy on the model that always guesses cat, you'll get 99%, you'll say, Well, we did a great job. And then you'll turn to that model. And you'll have no idea that you're never even guessing the minority class, which is the dog. So recall and precision then fall out as a result of handling imbalanced outcomes, because you can't expect in the real world that you're always gonna have a 5050 classification task, in fact that that's often not even the most interesting problems.
Susan Peich:So that's a great synopsis. And let's go through now and consider performance metrics, particularly regression, common performance metrics.
Andrew Clark:Yes, so we just finished up the classification, which is usually like we've seen cat dog could be cat, dog, ostrich, whatever the classification is classification, which is very common for data science. So we talked about regression, that's that I'm predicting a continuous outcome you could think of, I'm doing a credit score, or I'm doing a risk, or let's just say risk score, where zero means no risk whatsoever. It's free money. And 100 means very, very, very risky. I'm just making this example up. That's where you have that that's that regression, it's a continuous output, one, zero to 100. So if you think about that, you can't really be doing an accuracy using that same calculation because we use those building blocks of true positive, false positive, true, negative, false negative, right? Those are like those building blocks. Well, those don't work in email. Well, when you have 100 different outcomes, and it's on like a continuous scale. So there's different types of models you would use, we can always do another podcast on the different kinds of models and you'd select, but the metrics have to also change. So there's really, I mean, it's, of course, debatable, but there's like four main metrics that are pretty common that we'll talk about here. The first being mean squared error, which is really it penalizes large, it's it's a function that basically looks to see what's the how close am I to those lines, like it's calculating how accurate I am if I have a training set of the different information and calculating that that mean squared error, if the model training data, this exact it or the the actual data right here says the outcome is 50. And my model predicted 60. Well, you take the mean square of that. So 60 minus 5050, is 10 square, it's like that's the error is that's 10. So you look that over that whole dataset of your validation data set. So I'll hand off for you for talking anything else around those and then the next metric Next.
Sid Mangalik:Yeah, that's good. That's good. So I'm just going to give a chance for the odds to be reminded, mean squared error, but you can think of it as like three operations. So error is the first operation. So we have our list of numerical true values, say the price of a house, and our predictions of the prices of those houses. Air is just literally the subtraction, right? What is the literal distance between the true value of the house and your prediction of house, we can then take the square of those errors, squaring makes negative values all positive, and it makes really big distances quadratically more drastic than ones that are closer to the true value, right. So you get a big penalty for for large errors. And then we just mean up those errors, those squared errors, and that gives us a single value, right, so you get a single metric from around this test. What people like to do then is, you can't really interpret what a mean squared error is, that's, it's just kind of a number. So what we like to pick the, the root of that number. And if we're routing that number, what should happen is that our number will look closer to the original error, right? So if we were 100, as often guessing the house there, the root mean squared error might be like, 103, right? It's not going to be exactly the same, because there's not commutativity in these in these operations. But at least they'll look like the original numbers. Now, let's say we don't care about this, like aspect of like penalizing really big mistakes, maybe you just want to say like, it was a normal amount of mistake. Let's scale the penalty linearly. That's where Mae comes in. And this is just as valid as a metric. You know, it's it's the elbow and loss you learn about when you're studying for introduced. But the use case here is that you don't have to punish large errors highly. And so instead of squaring the air, like the distance between your prediction of the true value, you just take the absolute value of that number, right. So all your negative numbers, just your composite numbers deposit and receive the same. That's the air. So very straightforward metric for doing hair.
Andrew Clark:I think I think that's great. Great. That was a really clear explanation. Thanks for that. And there's a lot of other ones as well, you can do like an information theory, which I'm not a huge expert information theory. But there's like AIC and bi C, which you can use as comparison, it's usually like it's based on the information component, we'll get into that a little bit more later, lower is better. But submission for machine learning. Specifically, MSE is kind of like one of your best metrics. But also all of these, it's a little harder, like accuracy is easier to understand to someone the promise of these regression metrics, you gotta really know what you're comparing against smaller is always better. But it's hard to know right off the bat. But where MSE gets very helpful for machine learning is it's differentiable, which means you can take the derivative of it, which allows you to calculate like the gradient descents and things like that for convex optimization, which is how a machine learning model trained can be trained by looking at, you know, how do we know the gradient descent is like I'm on top of a mountain, how do I know how to get down the mountain fastest, right? So you see what has the higher greediness and essentially, it's like a person walking down in the sea is differentiable route, MSE. RMSE is not different role, and MA is not very differentiable either. So that's why MSE is often that metric that's easier to use, because of the convex optimization. It's differentiable.
Sid Mangalik:Yeah, and the interpretability is a really big problem with these regression models is that it's not obvious to someone that you could say like, the error for this model is 10. Like, What does 10 mean? Like, how good is the model, it's not a square between zero and one. It's an unbounded error. And so one metric which came up which has been really nice for explaining what these models do, is that are two metric, right? And so there's a, in a idealized, simplified world, you can think of our two as basically being the square of the Pearson correlation between the true values and the correlate and, and the predictions, right. So instead of measuring error, we do the correlation between those two, and then you do R squared. Sometimes people call this our two, but it's really just R, which is Pearson correlation, squared. And this is a nice internal value, because this is between zero and one. And that'll tell you how well the independent variable is explaining the variability of the dependent variable.
Andrew Clark:Excellent, great explanation. This is used very often in statistics for like regressions and things like that. So statistical modeling often uses R square and AIC, which is that information criterion. where that's not as helpful with machine learning is with machine learning, you often have more variables. So when you have a lot more inputs, and MSE is the model as a whole versus r squared is like that, that Pearson correlation of how those independent variables explained the variability. adjusted R squared helps with this a little bit because it adjusts it penalizes the value if you have more explainers because when you have only like, you know, three inputs trying to explain one output and a linear regression model, r squared works great. But that's when you start getting machine learning land, you might have 50 explainers, it gets a lot harder to see what's good or what's bad. So there's adjusted but that's why for machine learning and then the the differential boldness of MSE is why that's kind of like the good factor. So we had to choose, you know, in classification f1 is kind of like your good benchmark, go to MSC is kind of like your good benchmark, go to in regression.
Susan Peich:Let's switch focus for a little bit on another aspect of performance. And it's, it's hard. We know we mentioned at the top of the podcast, and yeah, it's not necessarily the computational speed, but performance and network optimization.
Andrew Clark:Yes, we talked about operations research, some last time, I always love operations, research type things. So there's a lot of different interesting problems that are kind of rooted in graph theory is the basis for a lot of them, which is, edges and vectors and nodes, it's really no edge nodes and edges, or, let's just say nodes and edges. Where nodes are, are the are the nodes, essentially, that are connected to so it can be like locations, and they're connected by like streets, if you think of, like how Google Maps would work, right? So you have different edges, which are the streets connecting the nodes or vertices, and you want to see what's the like the shortest path problem, and there's a bunch of different optimization algorithms and stuff there that you can wait, which is how do we find like the shortest path to get from point A to point B? So it really layers on this topological graph, essentially, from graph theory, and you find where those, how do I find the path between two vertices or nodes, that's the smallest. So there's a lots of interesting calculations you can do there and start when you start looking at different spaces. There's a lot of interesting use cases well, and depends on what you're working on. Same with even some clustering, you can get into some like distance metrics, as well, there's all this whole field, we can't really spend that much time we've been focusing primarily on classification regression. But there's a lot of really interesting stuff. We're just trying to survey
Sid Mangalik:here. Yeah. And so I mean, there's a lot of fun distances, and you could talk about distances all day. But usually the most popular benchmark one is just the normal old Euclidean distance you learn. And the Euclidean distance is Pythagoras theorem, would you've learned works great for triangles and finding distances between points, just expand that out to the nth dimension, right. So that's the same equation, you know, and love, a little bit modified. So we can scale up to running 50 columns of data, right? And determining how close is one, one row of data to another row of data. And so that's, that's, this is like the classic benchmark, because it just weights all the dimensions evenly. And that's fits if that's all you want. That's what people will often use.
Andrew Clark:Additionally, there's there are checking which it's a little bit different. But when you're checking distributions, that's another different check. So we're trying to like outline here a bunch of different checks you can do, depending on what your use cases, and that's the thing to really hone in on is, what are you doing? And why? Because your your metric shouldn't just be your pet metric that you always use, it should be what's the exact use case. So for some use cases you're checking to see and statistics and things like that. What's the distribution, if you talked about models or feature drift, those sorts of things, you'll have a statistical distribution, that a lot of people are familiar with the standard normal bell curve, you know, it's a normal distribution. Here, monotone, we're not huge fans of normal distribution from frequentist statistics, I think it's everybody make likes to make things normal, because the math is easy. But in the real world, things are messy. So personally, we like a lot of nonparametric, which means you don't assume to the distribution. So you basically will rake so there is distributional tests you can do that will calculate the difference between two normal distributions, looking at the standard deviation, and then means and things like that. You can also calculate between like gamma distribution, and normal and Gumball, and all just bunch of different distributions, but then you have to know the distributions you're checking. So we like to use nonparametric statistics, which could be a podcast itself, which basically does ranking and ordering. So it calculates the differences in those values. So if I ranked a bunch of like SATs scores against another bunch of s, AP scores, I can compare those ranks against each other. So you can you don't have you don't you're not assuming a distribution. So if it's continuous data, there's a KS test that essentially does that, which is does that rank ordering and lets you compare distributions and check still gives you P values and things like that to check. What's the difference between how similar are these distributions, they're very similar, you'll have a p value, that is showing that they're very similar, you won't be rejecting the null hypothesis that these distributions are the same. So you'll have a very high P value, meaning like, point, point, 5.4 or something like that. We can get into politics and other podcasts as p values and hypothesis testing and all that as well. But if it's categorical, you have a chi square test, which is basically think of another just excel spreadsheet which has categories it counts like how many times a male do this how many times a film female did this accepted deny type thing, think of like, if we're doing bias testing that kind of a, like a four by four matrix, it basically calculates the difference in those frequencies. Same same way it works. It's it's just checking those frequencies. So those distributional checking is also there's a whole bunch of different tests there. And those don't really fit. It's harder to it's not like the classification, it's a little bit more nuanced on, on what tests use and why you can't really apply a classification metric to distributional checking. But you do still have to really understand those statistical assumptions to know which test to use.
Susan Peich:But Andrew, you were when you were talking about shortest path. You mentioned Google, you mentioned like, you know, this is the edge. This is what you see in Google Maps. And I know on a previous podcast, you were, you know, when we, when we alluded to the fact that we're gonna be covering this topic, and you said, we're going back to Google Maps, like what was so significant about shortest path and Euclidean distance. With regard to that.
Andrew Clark:I think there's just so much really, really well done work in graph theory and operations research that have like solved some really difficult problems and logistics, or think about like how logistics work and some global supply chain and all those things. And it's really these complex algorithms that are finding the fastest way to get shipping goods from point A to point B through different ports, given the different constraints, where you have all these like linear programming, multi optimization things where you have have weights and constraints to figure out how to do something, it's just I think it's very impressive, and specifically that there's theory behind it. And then there's implementations that are very accurate and rarely wrong. So I know like some some of these things are a little bit more definite, there's the fastest path that you won't always have in machine learning. But I think the big takeaway is just like some very research backed, very efficient, effective methods that we've developed in other fields that are working so well, people don't even know that they're models doing things. It just works. You just put it in your phone, and it's, oh, there's a crash or or it's just works. How does it work? I don't know. But it works. It's we've gotten to the machines that are being that good at these calculations, versus then we come to machine learning. It's like, literally 1945 or 1845, Wild West, everybody's doing a gold rush. And there's like it is the same. There's it's a lawless country out there with no rules and nothing and like, this level of bad performance is just accepted, because it's cool. Versus when you come into these other fields don't get pressed, because they're so efficient. So it's just I think it's a good area for inspiration, while we're building systems. And with that physics to comment we made earlier is understand the physics of the systems. It's my biggest pet peeve with machine learning community and the computer science community is the opening eyes of the world is this hubris that we're doing it here, it's never been done before. We're the experts were the geniuses and we're just swagging it, we're downloading information off the internet. And it's isn't that it's going to be great. Versus understanding how we got from point A to point B using all of the existing theory that exists and getting inspiration from different disciplines. So I'll get off my soapbox. But that's been very intrigued by that by that area.
Susan Peich:Or since setting set up so eloquently for, you know, really some more aspects in regard to curves.
Sid Mangalik:Yeah, so, you know, AUC and Roc kind of fall out of an industry problem where they don't want to necessarily figure out what the right data is for the f1 score, and they don't want to figure out this precision recall, balance. So is there an easy way to do that? Well, it's not easier, but it's, it gives you the result you want. And so AUC stands for area under the curve, we're going to be doing a little bit of integral calculus, where we're going to be comparing the false positive rate to the true positive rate. And then we're going to be moving through the threshold for making a decision, right. So that's the like the probability you get out of a problem of a decision, how high is our ability to be free to make a decision one way or the other, move that along, create a bunch of data, get the area under that curve. And that is a nice score that, you know, goes all the way to one. It's not like some weird, fuzzy metric. So this has been really popular in industry where they don't want to worry about the rates of the different outcomes, like they don't want to care about how many dogs how many cats are in there, they want a single member. So this has been a really nice metric, which is which has been catching on lately, but not seen too much outside of like industry and some academic papers.
Andrew Clark:So to wrap up our whirlwinds, we'll talk about cross validation, which is not it's less of a metric, but it's a it's a way that machine learning can use to try and find out about our method, again, based on the kind of data mining approach that we're not as fond of versus like the network optimization route. But essentially, what you do is you take your training data and you cut it up into little segments, so your model can like it artificially inflates your training set essentially lets your model not overfit on a data set as much see different parts of the data set because you're like flipping through to eliminate some of the luck of like if you Do a train test split in your training and you just do an 8020, which is common 80% Training 20% validation? Well, some of the most interesting, interesting difficult samples could be in that 20%. So your model is not getting them to train off of, versus cross validation, you can do like the stratified solution, where you essentially get some of the most interesting in the training, and it lets me see learning algorithm better learn all the information it can learn from your dataset.
Sid Mangalik:Yeah, so the big lift of pointing out cross validation here is like, it may not be the score that you report at the end of the project. But it's the score that you're going to look at when you're training the model. Right. So this isn't a type of evaluation, that's not seen at the end, right? This is not how you're going to necessarily show off how good your model is. But you're going to use a cross validation type of metric, when you're first building the model to make sure that it works, how you think it works with the trend that's been given to it.
Susan Peich:And then I think we've have a final aspect with regard to entropy.
Sid Mangalik:Yeah, and so you know, in our oddball metrics for performance, entropy is, is probably one of the least understood ones. And this is most seen in tree based measures. Like if you've worked in the decision decision tree before, you can know at the end, you know, the decision tree, make some classifications, or make some regressions, and then use the metrics we talked about before. But what is a good tree? When you're building it right before the tree is finished? How do you know how good a tree does? And this is where all those nice people working in information theory to come in, and they give us this measure of entropy, right? Literally, how much information gained do you get when you make one cut on the tree? Right? So what is if you're going to be on this huge dataset? Are you going to look at tail length? First, are you going to look at for length, are you going to look at color of color of fur, what's going to be the best feature that's going to help us split this data between cats and dogs. And so Gini impurity, then is a simple measure of the information gain you get from any individual cut on a tree. So this is a really important metric for evaluating are your cuts doing what you think they're doing.
Andrew Clark:Entropy is a very interesting, it's a wide field. And it's very interesting. It has its roots, I think, in classical mechanics and thermodynamics and things, which is like the state of disorder of the system. That's, that's not quite the definition we're using. Here we go and more from the Information Systems community, which it's very related, but like, I want to learn personally a little bit more about the whole physics part, the second law of thermodynamics, which is really entropy of a system and all that kind of stuff. So I don't quite understand that yet. But I want to read up on that some more. But from the information theory thought side of the house, which is where like Shannon entropy calculation, and the the Gini impurity that I was just mentioning, really comes out of Bell Labs. So Bell Labs is like, lots of great innovations came out of them in the 40s and 50s. This one individual, Claude Shannon, when he was working there came up with the idea. He was one of the founders of information theory, which is when you're talking about like data compression and signal processing, and as you're thinking about telephones and things like that, came up with this information theory, entropy calculation, which essentially what that means is, the higher the, the value, it has more information when there's lower probability of that occurring. So you think of how do we compress so like, if you've any, like zip files and things like that on computer, it uses this information theory as the basis of that. So low probability events have higher information. So you need to make sure that you catch those more, if you know that every single day, you're gonna go get a Starbucks coffee, that's very predictable. It's less information about about me, you already know all of that. It's not new information about me. Actually, I don't drink Starbucks every day. But just think of example, I'm a coffee. I don't like Starbucks, but that's neither here nor there. But that's, that's low information about me that it's because it's not surprising, you know, you know that I'm going to be going to Starbucks every day. So what's interesting about the Shannon entropy calculation is the lower the probability, the higher the information, it's a directly inverse relationship. So for depending on what you're trying to do, if you're like evaluating a lot of different metrics, or a lot of different like Monte Carlo runs, you can use an entropy calculation to figure out what are the most important ones, you need use the Shannon entropy, because if it has a high value, that means something strange is happening. If I'm taking like 300 Monte Carlo runs as an example, I'm perturbing different parameters and I calculate the Shannon entropy calculation on top of those runs. The ones with the highest information means something most surprising is happening because it's different than the rest. So when you're looking at let's and this is like a complex on itself, but when you're looking at all these large calculations and figuring out which metrics to use, this is where information theories is an interesting area and this is really left field and it's very different than than ml but why it was helpful to bring in is like, this is where What are you doing and why and understanding what's out there in the field of the literature. because there's some crazy things like information theory to get almost no press, that could potentially be the best solution for your problem.
Susan Peich:Yeah, and I think that's, you know, a lot more, a lot more to explore there. And I want to bring it back to some of the things that we've just discussed, we've gone through classification, we've gone through some items on distribution, checking, cross validation, but really that one of our goals, you know, and going deep on performance was to really bring it back as an aspect of model validation. I'm actually going to start with Sid, summarize for us what we just learned in the bigger picture of model validation.
Sid Mangalik:So you know, at a very, very high level, and maybe it wasn't obvious, what we learned was basically, how do we do statistics correctly, and not use statistics to live, because we might be tempted to just use the metrics we know and not be willing to understand the metrics and how they work and what they're actually doing. So you know, someone who just wants to tell their boss, they did great. And a model is going to report accuracy on 99 cats, and call it a day. But someone who really wants to understand their model and evaluate in a meaningful way is going to look to the f1 scores, the AUC curves, and really understand what is happening with these metrics, and report them in a way that not only informs the people that are going to be using these models, but also people making these models to make sure that they're making the models that do what we say they do. And I'll hand it over to Andrew to talk a little about like, what does that nuanced look like? In regression, because I just want to reclassification, it's a
Andrew Clark:great summary you had for regression, of course, it's a little bit more more difficult as well, but it's, it's very much what are you trying to do? And that's really what comes down to and what are you trying to validate? If you if you know, your dataset has a lot of outliers as an example, you're gonna, and you want to make sure that you you get a good handle on those outliers, then you probably want to use MSE because it penalizes those large, those large errors more. So you want to know, you want to understand what exactly are you doing? And why. If you want to if you want to make sure you have a model that handles those errors, make sure you're using that like just understanding what's the use case? Or if you want to see what how to have a parsimonious model that every single input is, is explaining the output, then that r squared calculation or adjusted R squared would be a good area to determine how do I know what that fit is? So it's really coming back to what are you doing and why and what are you validating? And that's really the basis of all this is what are you validating and why and then choose your metrics accordingly and do literature review, there's a lot of information out there, figure out what is the purpose that I'm looking for. I'm guaranteed someone's done it before, and has a has some different metrics approaches. So don't just go into your normal toolbox, try and expand with that tool boxes sizes and see use the right tool for the job and explain what you did and why.
Susan Peich:Perfect thank you guys for summing that up so nicely. I think we're going to wrap things up here for this episode. But before we do, we want to make a little shout out that this is episode nine, which means our next episode is Episode 10, the 10th episode of the AI fundamentalists, and we have a few ideas here of what we want to do to celebrate in episode but we would love to hear your feedback. If particularly if there are any subjects in the past episodes you want us to go back and explore. Oh and we will also be looking into after our past our 10th episode, we will continue our model validation series, giving a little bit of a focus to bias. Look forward to some surprise guests there as well to help us out. As for the eighth on my list. We'll see you next time.