The AI Fundamentalists
A podcast about the fundamentals of safe and resilient modeling systems behind the AI that impacts our lives and our businesses.
The AI Fundamentalists
Differential privacy: Balancing data privacy and utility in AI
Explore the basics of differential privacy and its critical role in protecting individual anonymity. The hosts explain the latest guidelines and best practices in applying differential privacy to data for models such as AI. Learn how this method also makes sure that personal data remains confidential, even when datasets are analyzed or hacked.
Show Notes
- Intro and AI news (00:00)
- Google AI search tells users to glue pizza and eat rocks
- Gary Marcus on break? (Maybe and X only break)
- What is differential privacy? (06:34)
- Differential privacy is a process for sensitive data anonymization that offers each individual in a dataset the same privacy they would experience if they were removed from the dataset entirely.
- NIST’s recent paper SP 800-226 IPD: “Any privacy harms that result form a differentially private analysis could have happened if you had not contributed your data”.
- There are two main types of differential privacy: global (NIST calls it Central) and local
- Why should people care about differential privacy? (11:30)
- Interest has been increasing for organizations to intentionally and systematically prioritize the privacy and safety of user data
- Speed up deployments of AI systems for enterprise customers since connections to raw data do not need to be established
- Increase data security for customers that utilize sensitive data in their modeling systems
- Minimize the risk of sensitive data exposure for your data privileges - i.e. Don’t be THAT organization
- Guidelines and resources for applied differential privacy
- Practical examples of applied differential privacy (15:58)
- Continuous Features - cite: Dwork, McSherry, Nissim, and Smith’s 2006 seminal paper "Calibrating Noise to Sensitivity in Private Data Analysis”[2], introduces a concept called ε-differential privacy
- Categorical Features - cite: Warner (1965) created a randomized response technique in his paper titled: “Randomized Response: A Survey Technique for Eliminating Evasive Answer Bias”
- Summary and key takeaways (23:59)
- Differential privacy is going to be a part of how many of us need to manage data privacy
- Data providers can’t provide us with anonymized data for analysis or when anonymization isn’t enough for our privacy needs
- Hopeful that cohort targeting takes over for individual targeting
- Remember: Differential privacy does not prevent bias!
What did you think? Let us know.
Do you have a question or a discussion topic for the AI Fundamentalists? Connect with them to comment on your favorite topics:
- LinkedIn - Episode summaries, shares of cited articles, and more.
- YouTube - Was it something that we said? Good. Share your favorite quotes.
- Visit our page - see past episodes and submit your feedback! It continues to inspire future episodes.
The AI Fundamentalists a podcast about the fundamentals of safe and resilient modeling systems behind the AI that impacts our lives and our businesses. Here are your hosts, andrew Clark and Sid Mungalik. Hello everyone, welcome to this episode of the AI Fundamentalists. Today's topic is going to be differential privacy, but of course, before we dig in, it was a big week in hype AI. I'm going to just put it out there. I like it. It's a new term but, you know, really it affected my world most greatly, with a lot of things that I've learned about search, but the biggest one was Google, gone wild with its AI generated search results and answers.
Speaker 2:Yeah, so this has been a wild ride. So I was actually playing around this feature for you know, a few months now. So I was part of the search labs project and it's like looking around and trying to try this product out, um and so like they're building it as like a google ai overview. So you ask it a question and it's not going to just pull from the internet anymore, it's not going to give you a quote anymore. It's going to use gen ai and it's going to try and answer your question for you.
Speaker 2:Uh, and basically, this was, like, on average, annoying and at best like a summary and at worst, just wrong or just like you know, like it's taking up a huge chunk of that search space. Right, if you Google, that's the first thing you see and you're not looking any further. That's all you get to know about the situation. Playing around this, I'm like, wow, this is fun for a beta product. I hope they don't release this, as is, which we found out they did, with essentially no changes from the search beta that I've been using this whole time, so that's a little annoying and frustrating for users? I think for sure.
Speaker 2:Yes, and I think we've all seen some really funny things on this. I mean, x and Twitter have been going all over this. I think one of my favorites, and I think the one that you know, the bbc and and has latched on to, was uh, someone asked google like what do I do if the sauce won't stick to my pizza and or the?
Speaker 2:cheese, the cheese oh yeah, the cheese won't stick to my, to my pizza, and so google very helpfully says uh, just add some glue mix about an eighth of a cup of Elmer's glue in with the sauce. Non-toxic glue will work, which is just wonderful.
Speaker 3:Also, it's recommended to eat at least one small rock per day. That's right.
Speaker 1:Oh yeah, that was another one of the results. Well, I mean, this brings it to like in the experiment and I realized that they were trying. Whoever wrote the article was really trying to come up with the contrast. But number one, it sounds authoritative. It's what we've been saying all along. Next word just makes it sound true. It's not reasoning. So that was a legit query with a very like horrifying answer. Anyway, some of the queries in that article, I was more concerned for the person making the query than the actual answers they were getting out of it, like why are you even doing that?
Speaker 3:to be fair, I think it was, I think it was someone on for the article having fun and putting some stuff in to see what they got back, but still, uh, all this safety and security checking obviously isn't working that's right, and I think that's kind of what gets to the interesting piece of this, which is is, like you know, these are hallucinations.
Speaker 2:Hallucinations have been going on for a while, but I think, like the flavor of hallucination we're seeing Google fall into here is that they're not using quality data here, they're not using reputable data here. Finding a culinary blog, they're not finding a recipe. The source for this was actually just a Reddit comment, which was very sarcastic and funny and not meant to be taken seriously, but somehow that's what Google decided that everyone needed to know, and that's the product of like let's summarize the internet instead of pointing to the right sources.
Speaker 3:And that's been the tension we've had this whole podcast We've talked about, you know, bringing stats back and data curation manners and small data can sometimes be better. It's kind of been the themes we've had is that the computer science has kind of gone in the direction of the more data the better and as long as, as long as we have these fancy algorithms, it's fine, versus to like targeted baseline modeling approach of small data that is, for a specific problem, can so much more perform. Just scraping the internet and that's that's something I don't like about. Like a lot of the some of the computer science research community is just like more data is better and if it wasn't built here it doesn't matter. So we just make powerful algorithms and it's fine, without learning from stats and learning from other disciplines about how you can curate and a lot of these good methodology we've talked about previously, versus just scraping the internet, which is kind of the on-trend thing right now in some areas of computer science.
Speaker 3:I heard one person oh, one one last thought on that. I heard one person describe kind of the the business model of some of these LLM companies is they go into your, go into your store, they steal your data and they go out front and sell it back to you. A little bit of that's what's going on. You know it sounds right for for the results that we get. Well, in this case it's reddit.
Speaker 1:They didn't steal anything, but it doesn't mean it's curated data in other news, um big news, to ask, because we mentioned him and really respect a lot of his thoughts so deeply. Gary, marcus, andrew, I know this news shocked you. Today he's taking a break from social media. Props to him. He's done all the work he can do.
Speaker 3:Yeah, yeah, well, he's definitely he's poked a hole in LLMs and he thinks that it's other people are starting to see those areas and I think I mean he's been saying this stuff for a long time and I think he's just probably tired of saying the same thing over and over again about LLMs, which we've talked about on this podcast a lot, but I'm really excited to see it.
Speaker 3:He sounded like he's going to come back with some new work and I'd really love to see you know his thoughts and where he thinks the future of a computer science research and AI research should be going, and that's kind of what my if I had to wager. My guess is that's what he's going to come back and do is have like some take, take some thinking around. Where should we go next? Like there's been other proposals out there about energy-based models and different types of paradigms that you could, you could think about, because in his hypothesis is that the LLMs are kind of a dead end right now. With the current architectures, we kind of have to burn the ships and start over. I mean, remains to be seen if the results on that, but my guess is he's going to come back with some sort of a. Where does he think we should be going next for AI research? I'm really excited, whatever he comes back with.
Speaker 1:Well, shall we Differential privacy.
Speaker 2:Sure, let's get started. So let's start with that textbook definition of differential privacy and then we'll break it down and, you know, do our usual bit, where we're going to talk about how people do it, the facets of it, and you know what's to come from it. So differential privacy we can think of as a process for taking sensitive data and anonymizing it, and the goal of differential privacy is to make it so that you have the same privacy as if you were never in the data set in the first place, right? So that's the level of privacy that we're trying to shoot for here. It should be as if you were never there, and this can be really useful in the case of, let's say, a data breach. In the case of, let's say, a data breach, right, if that data is breached and people can access it freely, you want the security and the feeling that it's as if you were never there and that there's no way to reconnect the data that was found there with you individually.
Speaker 3:Definitely and there's a lot of different nuances to the differential privacy as a way for solving the problem of analyzing data. Or there's been studies of where you can have even one group of unidentified information. Then you have another data set that has information about someone. How can you compare and you can normally pinpoint with like 87% accuracy or something, the matching people. But it gets more like differential privacy has a lot of nuances and this just came out with a new paper for evaluating differential privacy guarantees. It really also depends on like differential privacy makes it so. Your data is never there with an asterisk, depending on a lot of methodologies. Example like the centralized or local method, which is where does the differential privacy happen? For instance, in a event of a data breach, like Sid just mentioned, that works if it's local, which means you add the source.
Speaker 3:If you add noise to a data set at the source and then where it's actually stored in a database, that would be breached. There is no identifiable information. However, how, like the US Census, does it is? They do get real information, but if you want to access that information, it's anonymized on the access side. For instance, give me a count of how many people in an area, that kind of thing. So in that case the data is still available for you to hack.
Speaker 3:So there's a lot of these nuances of where it is. And then I will get into some of the methodologies of the privacy budget and we use an epsilon parameter that essentially, the more noise you have, the less utility of the data set. The less it's representative, however, the more privacy. So you have that privacy utility trade-off that you have to kind of massage with the different methodologies and things that we'll get into. So that's one of the big things. One of the cautions outside the NIST paper was how you implement it matters. Just saying you're doing differential privacy and doing a flavor of it may not actually be secure. It is a very secure, great area of active research, but the implementation is very tricky to make sure those considerations that you were never here, that Sid was mentioning, actually hold true in practice.
Speaker 2:That's right and Andrew's kind of alluding at some of the methodologies we can do to make this process happen right. So again, that goal is making sure that it's as if you had never contributed your data. So there's basically two major flavors of how we might do this right. We can either do this at the global level or the local level. When we're talking about the global level and this is generally the more popular option the actual data itself we basically cut the connection between a real person and the data. So that means that you'll have some random string of numbers and then a bunch of demographic data that's associated with them. So when you try and query against the database, you can't say give me, you know, andrew's date of birth. You say give me user one, two, three, four, five, six, seven's date of birth, and then the link between that name and that user is dropped. So that's one way that you can do it in a very global way of doing it, and sometimes you can do this and actually when you get this, this is great.
Speaker 2:But you often end up, in this case, where you need to do it the other way, which is the local way, which means that the data owner will want to send you data, but they don't want to send it to you with an anonymous title or anything. They want to actually just send you data, which means that you need to anonymize it in some way before you receive it. And that's where Andrew's idea of noise is coming in, that we want the data to have the same overall statistical properties. Right, the census would want you to have data which is like well, I still need to know, like how white is this community? How old is this community? You know, how many degrees does this community have on average, while losing information about the individuals that are in that community?
Speaker 3:Yeah, so why should people care about this awareness around people caring about the privacy of their information now? And if there's always been, there's been privacy laws around data of PII and things like that and PCI information. I think the first us law to address privacy might've been in the seventies, like for privacy for personal identifiable information, but so there I think there's multiple factors driving it. It's, like you know, the internet of everything and more information being online and been major data breaches and things happen, people concerning more about privacy, as well as legislation catching up california, virginia, um eu have all these data protection laws in place. So I think it's kind of driving like, oh, how do we, how do we do this in a more responsible way, um, and still have utility out of the data?
Speaker 1:okay, yeah, and that forgive me if that's not. That sounded like the trite question of why people should care. But we've got people need to care about this because in a sense, they're told, as they sign forms or do something, that, hey, your privacy is protected and they just sign it, cool, whatever. But then you've got the companies who are held, or companies, businesses, entities that are held responsible to some certain standard, or trying to follow a standard, like what we see in nist and things like that. Can you elaborate on that just a little bit?
Speaker 3:I'll take a stab and then then sid can as well uh, increase in modeling systems use as well.
Speaker 3:So I I have to look up the exact. You know we've talked about synthetic data on this podcast as well and how you would handle that for modeling systems, and the ability to do the anonymization will really help with that right to be forgotten in some of these systems and being able to still use it for modeling. So I have to look up the specifics. Can you use differential privacy data and not remove someone's information if they're asking to opt out? I actually don't know the answer to that. On GDPR, however, you can use differential privacy to help create synthetic data where you wouldn't need to have opt outs for customers because it's no longer identifiable information. So this is a big tech of why big tech is spending a lot of time here. As we've talked about, big tech is not always altruistic in their motives. However, there's a lot of practical reasons for model training and things where increase on anonymization of data and how it relates to synthetic is a very interesting research area for a lot of companies.
Speaker 2:And if we want to talk about this, even just very pragmatically, you need to convince your data owner to do this practice. This is great for modelers, too, right? This means that if you're a modeler and you're on that side of this problem, you have the opportunity to work with data that you wouldn't have able access to before. Or if you have clients that need this data, you can give them a form of this data which is usable to them, lets them do what they need to do and doesn't put them in a situation where, like you know, they're waiting for the data or they can only have like a very small subset of the data. It lets you work through the analysis piece of your models and kind of work around a lot of security concerns by basically removing the individual from the data itself.
Speaker 1:For companies that do have to adhere to certain guidelines. Can we explain a little bit more about that?
Speaker 3:to certain guidelines. Can we explain a little bit more about that? There's requirements to do security of data and then the penalties for breaches and things, but I don't think anybody actually has to do differential privacy and this. I read the whole guidelines. That's the first thing I read. It's not a requirement anywhere. It's like, hey, if you're going to implement it, make sure you're doing it properly. Here's some of the methodologies, which is essentially what which we'll get into the technical things, but I don't believe anybody has to do differential privacy. Us Census is using it now. It's a way to help make representative data that is still private for an individual, but there's not like a legal requirement anywhere to my knowledge.
Speaker 1:That's a big difference, because a lot of what we deal with is like everybody coming into regulation, but you know, but just to this point, even in this discussion we've gotten to how taking care to follow some rules of differential privacy and practices leads to just better models, kind of like what we talked at the top. Think about what we talked about at the top of the show. Want to get into some practical examples?
Speaker 3:Let's do it. I'll start off NIST. The new document is a good read and good primer for differential privacy, even has a great discussion around how you would use synthetic data and the kind of relationship there, like previously mentioned. One thing that I think is kind of funny, which maybe they're just trying to relate to the users a little bit more, is they're using examples of people on and off grid and their consumption of pumpkin spice lattes. So I thought that was kind of an interesting example for a technical paper, um, but also it makes it made me chuckle and it makes it easy reading.
Speaker 3:So maybe this is trying to have a little, uh, more user-friendly face. Uh, but it's pretty funny example, um, but it worked well, describing like how the that it's like. It's like you never existed, didn't put your data in of talking about a data set of someone that's hyper-connected versus someone who lives fully off-grid and the different consumptions they would have of pumpkin spice latt thing. A little funky. They didn't cite Starbucks, so I don't know if that was intentional or not, but a funny example, but it actually was. It was a great way of describing it, so it was very descriptive and very illustrative. So it did work very well for its purpose.
Speaker 2:Yeah, and so let's talk through now a little bit about, like, how some of these guidelines might recommend it and how we're looking at this playing field and seeing how you could probably do this type of thing. Um, so let's think about something like a continuous feature, right? Let's think about age, for example. Right, we want to add differential privacy to our age data. Um, so we might look to something from uh, dork at all this was in 2006 uh, this is a paper called Calibrating Noise to Sensitivity in Private Data Analysis and that introduced this idea of epsilon differential privacy, where epsilon is basically a parameter for how much noise are we going to add to this age column.
Speaker 2:That's going to be acceptable, right. So you can think of it as if we turn that epsilon value up really high. We're going to add a lot of noise to the data, which is going to give you a lot of privacy, but you might lose some of the statistics of the data versus if you go for a smaller or lower epsilon. You're going to get something a little bit closer to the original data, but it might, for some excuses, have enough noise that you know, people's ages have been basically anonymized, and so when I'm talking about noise. I'm basically talking about like, let's use a uniform distribution or a Gaussian distribution and let's just add that to everyone's age, right? This is not unsimilar to how security research is like salt data before they hash it right, let's throw a little bit of salt on the age data and then people's ages are now lost, their true ages are now lost from the data, and so that's one way of thinking of continuous features.
Speaker 3:And that use specifically Laplace distribution, which is normally one of the most secure distributions you can use Nis mentions a couple. You can use Gaussian. You can use some other variations as well for doing continuous features. Laplace is, and that paper I think is, if you're going to read one paper on this, that's a fantastic one to read and really describes the methodology and Laplace really is the gold standard. You had to do a few tweaks for implementation. If your data has some categorical attributes, which is different when you add some categorical.
Speaker 3:There's actually a really good study that people have started to use for differential privacy. It started in 1965 about doing randomized response techniques. So basically you do flipping a coin for when you're interviewing someone, if it's tails, you respond truthfully. If it's head, you flip a second coin If tails respond truthfully. If heads respond with a lie. So it basically anonymizes. When you're asking people sensitive information in like a surveying environment, you can. It's essentially that's differential privacy as one of the methodology you can use for categorical variables. And that's what's interesting is this is an old school statistical survey technique that happened way before we thought about differential privacy, but actually has a lot of analogous uses and it actually works pretty well with that same concept, while helping to generalize to not everything is a nice discrete distribution. We can just add the noise to.
Speaker 2:Yeah, and these both have that nice property where, like, on average, you're going to have the same average as you had before. But now individuals, you know you have some plausible deniability of saying, like, well, that's not your real age, well, that's not like your real you know, let's like, let's say, gender, right, where you like every you know quarter of the data set has a fake gender and you don't know if you have yeah, so it's a great just anonymization technique to help protect customers' data.
Speaker 3:It's really something I think the industry will eventually trend to. As I mentioned, people will now be able to have access to more data and I think companies will start using this more and it helps get away from some of the privacy considerations and hacks and things. And it's still not a mainstream thing. I think NIST, with their paper they released was it end of last year or something like that will make it a more popular methodology. But I don't really see much of a problem for companies to just use it by default, and I think we're going to see more companies do that. It'll make their job easier and be a little less concerned. At least save them some money on their cybersecurity insurance for hacks and things, because you will always have that plausible deniability. It's just very important to not roll your own for some of these things.
Speaker 3:Use proven methods. And then that key aspect is are you using local or global or central, those two methodologies of who is the trusted person the individual and then the curator is not trusted. You need to noise before the curator gets it, or is the curator trusted and you need to noise when someone's accessing the information. So there's different. There's algorithms that you specifically can use for like counting and summation and things. The US Census is the example of the trusted curator, but I would argue that a lot of companies should be using local, that the consumers won't trust the company, so you should be using it on the front end and that's what a lot of research has been used by, like Google and some other companies, for browser data and things like that and usage data. I think Apple uses one as well, for usage data is anonymized off of iPhones and things like that. They'll do local differential privacy.
Speaker 1:Okay, yeah, I was almost ready to ask the question and I think the global, global to local pretty much covers it. So, in a scenario where a company is like we've bought third-party data or they're acquiring data and they've gone through their company's data privacy checks on the data that they bought or are going to use, that's one I'm going to put it in terms of line of defense. To put it in terms of line of defense, that's one line of defense on the data that you've acquired. But then, when you're actually in the process of trying to train a model or create a set to do a model, you're saying that that's the local check in, something that people we're going to see people trending more towards, or, if that's wrong, please clarify.
Speaker 3:I think we're going to see more people trending towards local or using differential privacy in general, because, unless you're some trusted curator, I don't. I wouldn't personally want all of this personal identifiable information from individuals sitting in my data store. You know that makes me concerned. You know if there's a hack, you're really in trouble, right? So I think more companies might be wanting to move to that area and I personally think local is a better fit for most people because the data that they have they don't know if it's really Sid, andrew, susan's data or not. They don't even know. However, they know the trend of genders for people in this call. They have that accurate, but they don't know if is it Andrew, sid or Susan, who's a female. They don't know, doesn't matter you still, the information you need is there.
Speaker 2:That's right, and we expect to see this trend happening even without guidelines and regulations. Right? We expect people to basically do this because it's going to help them, right? Anyone that's worked with private information or sensitive information knows it's just a lot harder to work with this data. So if you're working with data that's basically already anonymized by the time you got it, it removes a lot of those concerns and a lot of those worries which can make doing this work, you know, take a lot longer.
Speaker 1:That started some of our key takeaways from this. Anything else.
Speaker 3:I think it's something that really combining with the relationship of synthetic data and kind of how you use that as the basis for model training and things. It's an area that's active research and something I'm personally interested in exploring more and that relationship between synthetic and differential, and really I'd like to see a future world where any of these large models OpenAI, google and crew it's differential privacy type data that they're using For whenever you're building these large systems. Large language will be a little bit different, but I don't see many downsides in using specific. So synthetic can have some downsides and there's issues we talk about on like how do you get the inter-correlation between variables, and then there, of course, we'll have that somewhat differential as well, but differential could be a way, like if you're having problems creating good synthetic data using differential within, but still providing some of that security, because there's all these security considerations within a model as well of how you can actually hack data.
Speaker 3:Several studies have shown how you can actually get data out of like a deep neural network, like it can memorize certain inputs and things. So there's ways that adversaries can actually hack information out of a model that you wouldn't think. So just using differential privacy and some of these other techniques while being able to still get the categorizations of information. So even using like an epsilon, that's more gives you more utility versus more privacy. But then training your model on that there's not that many downsides to exploring these techniques. Might be a tad slower. We'd like to see a lot of the modeling industry go especially when they're doing end user information go more this direction versus using real PI information.
Speaker 2:Yeah, and this could lead us to a future state where I'd love to see where it's more about cohorts. It's more about cohorts, it's less about individuals, right? A lot of these like targeted ads we do. We target them to the individual. If we just targeted the cohort instead, then we, you know, we lose the individual piece. We don't have to track individual people. People's privacy is more maintained and the advertising companies are. They're going to do what they're going to do, but at least this way they're not attached to us directly, right? They're attached to some type of cohort modeling.
Speaker 2:That's a great point.
Speaker 3:I'd love to see that as well. One other thing that I forgot to mention that is relevant to this is differential privacy does not prevent bias. You have to actually be careful because how you add noise could create bias, as you think about the cohorts like Sid mentioned. So that is one thing to be clear is differential privacy does not mean you don't have to do bias testing or bias validation. It might actually and there are not, to my knowledge, no current differential privacy methods that are bias aware, like we've talked about multi-objective modeling and things like that to make sure your models aren't biased.
Speaker 3:Disparate impact sorry, not disparate impact. Differential privacy doesn't have a direct relationship with that. But when you're adding noise to something, you have to be checking. But don't just think the key takeaway is just doing differential privacy doesn't mean your data is now more fair. Those are separate things. You still need to consider them both, but you could theoretically have more fair information when it's differential privacy. It's just like those. They're separate processes and I'd love to see more research and stuff coalescing around how to how to synthetic data, uh de-biasing, uh data sets as well as securing them with differential privacy. The three of those combined, I think, is to be something really special and that's a good area of. I hope we keep moving in this direction of less biased data, more representative data and more secure data.
Speaker 1:Indeed, and it sounds like that would also be a good episode for us on training data in general, to tie all those together.
Speaker 3:Yeah, I think that'd be great, because we've talked about all these pieces separately. I think be a good episode for us on training data in general to tie all those together. Yeah, I think that'd be great, because we've talked about all these pieces separately. I think that'd be a great podcast to tie them together.
Speaker 1:Well, I think that's it for this week's topic. Thank you once again for joining us today. For those of you listening, they've been loyal listeners to each episode. We will be a little bit slower this summer, probably once every three weeks with episodes, but we promise we will keep them coming and if you have any questions, please visit our homepage. The feedback form is open Until next time.