mnemonic security podcast

ML Engineers these days

July 01, 2024 mnemonic
ML Engineers these days
mnemonic security podcast
More Info
mnemonic security podcast
ML Engineers these days
Jul 01, 2024
mnemonic

Have you ever worked alongside a machine learning engineer? Or wondered how their world will overlap with ours in the "AI" era?

In this episode of the podcast, Robby is joined by seasoned expert Kyle Gallatin from Handshake to enlighten us on his perspective on how collaboration between security professionals and ML practitioners should look in the future. They discuss the typical workflow of an ML engineer, the risks associated with open-source models and machine learning experimentation, and the potential role of "security champions" within ML teams. Kyle provides insight into what has worked best for him and his teams over the years, and provides practical advice for companies aiming to enhance their AI security practices.

Looking back at our experience with "DevSecOps" - what can we learn from and improve for the next iteration of development in the AI era?

Show Notes Transcript Chapter Markers

Have you ever worked alongside a machine learning engineer? Or wondered how their world will overlap with ours in the "AI" era?

In this episode of the podcast, Robby is joined by seasoned expert Kyle Gallatin from Handshake to enlighten us on his perspective on how collaboration between security professionals and ML practitioners should look in the future. They discuss the typical workflow of an ML engineer, the risks associated with open-source models and machine learning experimentation, and the potential role of "security champions" within ML teams. Kyle provides insight into what has worked best for him and his teams over the years, and provides practical advice for companies aiming to enhance their AI security practices.

Looking back at our experience with "DevSecOps" - what can we learn from and improve for the next iteration of development in the AI era?

Speaker 1:

Kyle Gallatin. Welcome to the podcast. What's up? It's good to be here. If you're wondering why I'm dressed like this, it's because we have a Hawaiian party afterwards, a hula, hula, hula thing. I guess it's not the first time that you know that I'm a party guy. We had some good times together in Switzerland.

Speaker 2:

We did yeah.

Speaker 1:

So thank you for that. That was awesome. And you have substance Kyle and not just substance. Kyle and not just substance.

Speaker 2:

You have like AI machine learning substance. So extra points there.

Speaker 1:

It's like my whole thing, it's your whole thing. Yes, so you're a published author of the let me write it the Machine Learning with Python Cookbook Practical Solutions from Pre-Processing to Deep Learning, and you wrote that a while ago.

Speaker 2:

You didn't jump on a bag wagon, yeah no, I'll say that I am a co-author, so I wrote the second edition. First author did a lot of work so I didn't have to. But yeah, I had that out last summer actually.

Speaker 1:

Cool, cool. Has your life changed at all since ChatGPT came out, or is this like nothing new for you?

Speaker 2:

As someone who like builds solutions with AI, there's a lot more generative AI and large language model solutions that I'm like building into kind of like products on an everyday basis, like a lot more things that I like have education initiatives around because people want to learn about these things, and that's what I do as well. But there's been a lot to do, yeah.

Speaker 1:

Yeah, I guess one positive benefit is people understand maybe a little more about what you're doing than before, like, now they have a little bit of context.

Speaker 2:

I guess right, exactly Everyone's involved.

Speaker 1:

You have worked with a bunch of awesome companies Pfizer, where you were a data scientist and machine learning engineer. Then you went on to Etsy, where you did software engineering and machine learning, and now you're with Handshake, which, at face value, seems to be a platform to help students connect with potential employers, but under the hood, I guess it just runs in a bunch of code and algorithms that you're a part of. Does that?

Speaker 2:

sound about accurate. That's a perfect description. Is there anything?

Speaker 1:

else you want to add to your?

Speaker 2:

background. I used to be a biologist. Yeah, I saw that. So I have a useless master's in molecular biology. But yeah, you pretty much summed it up. It's all data scientist, machine learning engineer, software engineer and everything in between.

Speaker 1:

Yeah, so conversation today is I have a theory that I've been in my head recently that security is a I think it's under, it's not thought about too much in like the aiml world yet, and I want you to tell me if that's true or not throughout this conversation and to figure out, like, uh, where we are in the states of your world and security for sure I like it's probably not honestly thought about as much as it is in some other places.

Speaker 2:

I mean, the attack vectors I feel like for ML are new in some ways and very different. So on one hand it's software, so you have the same kind of security concerns as you would for building any kind of application where you want to protect endpoints, you want to expose things publicly, all that kind of application where you want to, you know, protect endpoints, you want to expose things publicly, all that kind of stuff. But at the same time, like, the ways that people are attacking ml um are kind of new and it's not something that people often think about when they're building an ml application. I think it's definitely not top of mind what has security meant for you?

Speaker 2:

it's meant a lot of different things um over over time, because there's you've all the classic aspects of data security, software security, and then you have new things of ml security. So for me, um it's meant, of course, data security, which is you're working with um data. Sometimes. You know it's of a sensitive nature If it involves users or, you know, advisor, their health care records, things like that. There's a lot of like rules and regulations and specific ways that you have to access that data so that you're not putting individuals at risk or potentially even identifying them in your own work. You can't join two data sources that might give you too much information. There's a lot of rules around that. The data needs to be accessed in a very, very safe and access-controlled way. For software it's been classic software things it's like you're building an application um for for a company. You want to be careful about who can access it, how they access it. So that means you know, don't accidentally build things and expose them to the public internet If they're sensitive and don't um make sure that you have role-based access control. Um make sure that, yeah, things live within certain VPCs If they need to stay within that VPC. That's what that's meant there most of the time.

Speaker 2:

And then on the machine learning side, there are things to think about, but I don't think that we've thought that much about them yet or had to think that much about them. There are attack vectors for machine learning models. Mostly it's around stealing IP, like if you had access to a machine learning model or a machine learning model endpoint or something you could like reverse engineer a model or like like get it to leak data. Or there are common attacks that people deal with, like chat, gpt, to get it to like do stuff it shouldn't do. But yeah, there are some interesting attack factors from machine learning too. Yeah, so that before we got in machine learning too.

Speaker 1:

Yeah, so that before we got in machine learning, you said everything that. That sounds like you've actually worked properly with security, like you named, like privacy and like identity and like just like the whole the guardrails of accessing things. Uh yeah, just the way in your answers it seems like the machine learning is kind of like new and you haven't had, you haven't been bombarded with people, security people trying to make you do things differently, yet I guess.

Speaker 2:

Yeah, I think that they want us to do it. The safe way for data and software is my experience and that covers 95% of what you have to think about, maybe for an ML application, but there's new things with ML that do open up, like new modern attack vectors that are, I think, going to be interesting to see all over the next few years. We saw a lot of people you know breaking GPT, chat, GPT. I think that'll be similar things for a lot of different large models that get deployed over the next few years.

Speaker 1:

Your relationship with security people over the years? Do you share the same sort of vibe that I gave off, like the security people that have just been on the outside and just trying to come in and boss people around, or how have your interactions with them been?

Speaker 2:

So it actually has varied. It's not like sometimes if security folks aren't like, I think there seems to be a good partnership Cause. Yeah, there have been times where you know if you bring in security at the end of a project you're about to deploy an application they're like dude, what are you doing? You didn't do this, you didn't do this, you didn't do this. Like there's all these rules we have for this application, like to deploy an application, you haven't done any of them, and now the product's like delayed two weeks and it's like yeah, I know like super pain, but we also didn't tell you whatever.

Speaker 2:

At the same time, I've had security teams who are basically partners. They're like look, we're not here to slow you down, we're here to do things in the safest way possible, the safest and most efficient way possible and the best balance of those two things. So I've also had security folks. I think some of the security folks I work with today are a great example of that. They're not slowing you down. They're not like unnecessary burden or risk. They're really just there as a resource to like help you build whatever you're going to build and make sure that it's built safely, without undue burden on the developers.

Speaker 1:

That's probably like the nicest way I've ever heard somebody that has worked with development ever portray a security person. But I think that's because, like the places you've worked, those are like those are developer centric environments, so that makes sense that they didn't yeah they're engineering teams, so there is some level of respect for that vibe and we're all in this together, making decisions, that kind of thing.

Speaker 2:

Is it different in places where it's non-engineering? I mean, at a larger company like Pfizer, we weren't necessarily an engineering team and so I felt like there was some stress with security sometimes just because we didn't work with them closely, we didn't have a pattern of working with them. So when they would come in, it was like here are things we have to do now.

Speaker 1:

And do you feel like those security people, like in your recent interactions, they understand how your world is, or is it kind of like you have to teach them something new every time?

Speaker 2:

I think they most of the security folks that I've worked with in the past few years are pretty tuned into the development environment and kind of get what's going on. They understand the applications we're building and, honestly, they even understand ML, and most of the time the folks in the engineering orgs are also well-versed in cloud infrastructure too, and so it's often a pretty seamless conversation where they understand, honestly, more than I thought they would about what I'm trying to do and achieve and deploy yeah, right.

Speaker 1:

So, um, there's most. There's pretty much only security people listening to this uh podcast, right? So if I asked, you like what does ml and ai mean these days? Like I've understood that there's like these, like app stores, but they're not apps, they're like model stores, I guess. And then you go out and you pick and take from certain things and you that helps you build whatever you're trying take from certain things and that helps you build whatever you're trying to build. Can you explain that process of like the app store for models?

Speaker 2:

Yeah, for sure. I guess you could call it like a model registry or maybe model catalog, I don't know. There's probably a bunch of different ways. But companies and websites, for instance like Hugging Face is a big player in that space where you can go onto their website, sign in and they have folks all over the world kind of registering, like large language models or computer vision models that are like pre-trained on some task and that you can either download and fine-tune from your own use case case or download and just kind of use off the shelf.

Speaker 2:

So an example would be like Facebook's Lama large language models which are like GPT-like models. You can use them for conversational things, you can use them to generate texts, for a lot of different things. You could go on Talking Face, download them and just start to run them yourself if you wanted to like have your own local chat gpt or start training it to do other stuff for your specific use case cool and that's kind of like just out of the goodness of whoever put those models there's hearts right, pretty honestly, pretty much, I'm sure, like, yeah, the most part it's like in the spirit of open source, like people just giving back to the community so that everyone can benefit from it yeah, and what risks do, if any, do you see, with that sort of economy and way of working?

Speaker 2:

it's probably pretty similar to the um, pretty similar the risk you see with like any open source, any open source thing. Right, you have a risk of like either accidental or deliberate vulnerabilities being built into kind of like large open source, like tooling. Like you know, tons of people are going on and downloading like Lama models from Facebook. Maybe that's like a legitimate source, it's Facebook, you know, you trust it but maybe, like individuals are uploading models for more specific use cases and you know they could could hypothetically build little like weird little back doors into those artifacts. Or you're just downloading like a sometimes just a large binary file so you could theoretically, anything could happen. So it's. It's kind of like the the same thing you go through when you're looking at like an open source project, like do I want to use this? You're like is it safe to just pull down this code and start running it locally and just hope it does what someone said it was going to do?

Speaker 1:

So you're aware of that problem, right? You said HubSpot, do they actually do an S-bomb of what's those components? What sort of level of security do they put on for you?

Speaker 2:

It's called Hugging Face, and I'm actually not 100% sure what the kind of security setup is. It'd be interesting to go on the website and check it out. I'll be honest, sometimes I'm just blindly downloading large bin files that contain machine learning model weights, and it's a process. I have to load them into a model and do some stuff in Python to get it running. So maybe there's not as much risk as I'm thinking here, but you're just downloading stuff from the internet.

Speaker 2:

Of course there's some risk associated with that and I assume they have some process to be like all right, you can't just upload this clearly hacky thing.

Speaker 1:

No judgment from my side, just so you know. The whole point again is just to get security people to understand how you work and because we all understand you have deadlines, you have shit to do, we wouldn't have a job if you guys weren't out there making a product and shipping the product right. So it's mostly just to understand and get everybody on the same, just to avoid this whole developer and security clash again.

Speaker 2:

Yeah and I'm not here to lie about it either. I mean I think that, like, like for many folks, like ml practitioners especially, I'm sure, security is like a little far back on like the, the list of what's going on. Um, they're. You know, the machine learning world moves so quickly so you're constantly just like installing new libraries in the internet. You're constantly like looking at new open source projects and pulling them into your workflows, downloading new models and then trying to train those. So like the amount of new code and files and things that are just like coming from different sources into like a machine learning practitioners workflow, it's like pretty high volume compared to someone who's like maybe just maintaining I don't know some older app or something like that.

Speaker 2:

There's a lot, yeah, a lot of stuff there and there's just so many tools like for deploying python notebooks and stuff like. Compute as a service is a big thing within the machine learning world. So you have to train these big models. So you have companies that basically will offer platforms to like easily write code and easily write machine learning and then you deploy them in the cloud. If you don't secure those, people are going to find those endpoints and if you don't change the password. They're going to log in and they're going gonna start mining bitcoin, because you just deployed something that like a, like a python notebook that's unprotected to the internet.

Speaker 2:

People can see that, find that. And then, all of a sudden, you get a notice from google. Um, which right, it's great. I have seen this happen, where google emails you and says, hey, like you deployed something and someone's mining Bitcoin.

Speaker 1:

By the way, can you just give us some context for how many projects or how many things you download, just so we have a pure understanding of the threat surface of a typical day, a typical week at work for someone in your position?

Speaker 2:

Yeah, it's definitely going to vary week by week. Yeah, it's definitely going to vary week by week. But you know, I think on the library side you're always looking at there's a bunch of new large language model and generative AI Python libraries right now. So it'd be like five new Python libraries just pip, installing things and being like, oh, what is it like, let me try this out and see if this works. And then, on the model side, if we're say I'm like doing something where something where I'm comparing, like I want to build like an internally hosted large language model for a different task, I might go on Hugging Face and download like five different Lama versions or five different versions of some other model and then evaluate them and see which one works the best for my task. So you like I'd say like that would be the threat surface and I'm not sure, like five libraries, five models, that's, yeah, quantifiable.

Speaker 1:

Yeah, but a lot of those like libraries models there, but they also have like underlying things that go to other places, right? So like there's a tax surface which is each of those.

Speaker 2:

Yep, they all their own dependencies, of course.

Speaker 1:

And you usually run this stuff on like your own on the same pc. You don't have like uh, it's not like. You have like opsec for you download things here, you scan them and then you pull them over. This is all just boom, boom going quickly right uh, yeah, it depends.

Speaker 2:

But like a big thing with machine learning is experimentation, because you're trying a lot of things out. You're not like necessarily, you're not always writing source controlled code that's gonna like end up in a github repository. You you might just be iterating in a Python notebook, either locally or attached to some GPU instance in the cloud. So in those instances there's not much. You don't have automated checks in GitHub for security vulnerabilities and all that kind of stuff. You're just installing stuff on some instance, whether it's local or in the cloud, and then experimenting as quickly as possible to try and like get the best model. I see there might not be as many, like you know, scans and things like that as there are in like a nice source controlled repository with like a nice dependency, updating and release process on that kind of stuff yeah.

Speaker 1:

So, um, you know, to sort of like combat those things, I guess it just would you agree that it's kind of on sort of your shoulders, like as in your teams, to understand the like, the threat surface and sort of apply that into your day. Is that, is that the best way of going about it? Kind of like software security?

Speaker 2:

I think so. I think there's going to be education. It just needs to happen on on both sides. Going to be education, it just needs to happen on on both sides. Like what you know, like security folks are going to have to become more well-versed in the you know, the new attack surfaces and new processes for, like ml workflows that introduce risk, and then, conversely, of course, they're going to have to educate the ml practitioners. Like here's, within the confines of what you're you know you have to do this like this is work. That's what you're you know you have to do this Like this is work. That's what you're going to be doing. Here's the way that you can do it you know, safer and better, and here's some tooling that we can provide you to do it safer and better. So I think you know, the more education there is on both sides, the more we can kind of like work together to achieve the goals.

Speaker 2:

Since we're both here to do the same thing usually you know we're working for a company, trying to make the company succeed, um, and so how is that education actually happening in reality?

Speaker 2:

I feel like the, the intersection between like like ml driven applications, security is is there, um, and it's it might be more about like kind of overcoming that last hurdle of like ml folks being educated and security being educated again, rather than like its own field of like talk and study.

Speaker 2:

Because, at the end of the day, like ML folks are deploying applications Like yes, there's ML, yes, there's like some new attack vectors, but like it's you're trying to protect software and data, which security folks have been doing forever and companies like trying to do forever. And so, as long as like security understands like the workflows that ML practitioners need to have, for instance, like their data access pattern is going to be different because they need to like access, you know, maybe production data for like experimentation in a way that, like typically other workflows might not. And ML practitioners understand that like there are regulations around the state and regulations around the software we deploy, and that they have enough software engineering knowledge not to deploy some kind of general compute engine to the internet that some of them might be, as long as each side understands those things, then I think we're good.

Speaker 1:

It kind of sounds like you're saying that it's kind of like security champions, but just ML champions. We need a new word for that right.

Speaker 2:

Yeah, maybe we do. Is that the term security champions?

Speaker 1:

Security champion just means you have a developer that knows what they're talking about when it comes to security and they kind of are helping people around them. Get on the bandwagon of like, hey, security wants us to do this, let's do it like this so we satisfy their needs, but also we don't have to stress about changing ourselves too much, right, yeah?

Speaker 2:

I definitely think so and I would say a theme within the ML world is that, like I'm a very applied ML person, like very software focused, but there are many ML folks who are, you know, coming from PhD backgrounds who are more theoretical and have less experience building software applications at scale. So they're used to like working in more sandboxy type environments with kind of like toy data or, you know, maybe data for like their postdoc whatever, but not working in like a very like secure setting in the cloud and don't often have as much software engineering experience. So part of my role in the past has been like helping to like educate folks. I'm like, all right, like here are good practices for like developing software with ml. I know you know the ml and you know the math, all that kind of stuff. Here's how we build software with it in like an enterprise setting yeah, You're like a real like.

Speaker 1:

I'm here for business, I'm trying to make something happen, right.

Speaker 2:

Yeah, exactly.

Speaker 1:

Yeah, and just while you're there, what are like the nuances there between like somebody playing with the data and actually making like in a live environment? I guess security is one of them, because you actually have people trying to get into your stuff and do nefarious things with it. Is there anything else that's notable to mention?

Speaker 2:

I would say it just goes back to that experimentation-focused workflow again. It's interesting, but it's. What needs to happen is that ML practitioners they need to rapidly try a bunch of things in a sandbox environment. So they need a Python notebook with access to potentially production data and they need to install a bunch of stuff in there and they need to try to run a bunch of stuff on it to see how an ML model might perform. So I think that new workflow hasn't historically been typical. Right, there's been nice access control patterns that didn't have to deal with those kind of like these ad hoc experimentation things being run by individuals on certain data sets, and I think that's kind of new for some places.

Speaker 1:

I've been asking, like my clients, like, do you have somebody that works with AI, ml, like that sort of development team? Do you have somebody? Yeah, do you know them? No, and I was like, okay, why not? Why don't you know them? He's like, yeah, well, we haven't had like the reason to, but it sounds like if, if, if a security team wants to start that conversation, understanding workflow is probably a really good place to start, because you just shut up and see how they're doing things today and highlight, uh, yeah, what, what would you? What sort of advice would you give to a security team, uh, that wants to sneak their way into your world?

Speaker 2:

I would yeah, I would say that, like definitely train an ML model. Like try doing like a. You know if there's code labs or quick starts or something for training a model internally at your company. Like try those, see what the process is Like.

Speaker 2:

Empathy is like the most important thing to have. You know when you're like trying to collaborate cross-functionally in any environment and when they're like trying to collaborate cross-functionally in any environment, and so understanding the other person's workflow, the things that they'll have to do to achieve their goal, is going to make that conversation a lot smoother. When it comes to like, oh, I know you have to do this, so let me provide you with a secure way of doing that kind of thing, as opposed to, you know, in the past, the reason that they're like I assume the friction you're referring to with security is that every now and then security has to come in and say no, you cannot do this, and that can push timelines, block projects, all that kind of stuff. So if there's an understanding and empathy on both sides and folks kind of collaborate from the start and from the outset, like here's what both people understand, what the other team is trying to do, I think it's going to be a more, you know, beneficial collaboration between those two groups.

Speaker 1:

If only the world had more empathy.

Speaker 2:

It's about empathy at the end of the day.

Speaker 1:

It is. It really is About security. Well, kyle, I've asked everything that I was wondering about. I've really enjoyed it Do you have any last words, anything that you're looking forward to in the near future.

Speaker 2:

I'm looking forward to collaborating more with security folks. I have a huge respect for, and empathy for, security. I've always loved cybersecurity and thought it was really fun, and so I'll just let everyone every security folk out there know that I'm trying to be that champion.

Speaker 1:

Cool. Well, kyle, thank you so much. Uh, they're not like fixing your house outside, are you? What's? What's all the jackhammering going?

Speaker 2:

on and dude, they're like just destroying the sidewalk. They're like and like. Every corner of the street is just being like completely destroyed Right. I really hope it's filtered out, but yeah, it's pretty loud. If I was not already awake for this, I would have been woken up by it, yeah.

Speaker 1:

Yeah Right, somebody with their machine learning model will fix it. Somebody from Adobe, hopefully, will fix it? Yeah, I know, I really hope so, if not, I will listen to this over again, and our second version will be much better than the first, because I actually understand a little bit of the life you have.

Speaker 2:

So perfectly happy to do it again with another time, for you too thank you so much, sir, and uh, enjoy your weekend.

Speaker 1:

When that time comes, I'm gonna enjoy this Hawaii party yeah, dude, have fun sounds great, thank you all. Right, man take care ciao.

Speaker 2:

Talk to you later.

Security in AI and Machine Learning
Model Registry and Security in ML
Empathy in ML and Security Collaboration