Ep10: Security Chaos Engineering Artwork

DSO Overflow

In this podcast, we speak with professionals working in cyber security, software engineering and operations to talks about a number of DevSecOps topics. We discuss how organisations factor security into their product delivery cycles without compromising the value of doing DevOps and Agile.

All Episodes

DSO Overflow

Ep10: Security Chaos Engineering

May 09, 2021 • Season 1 • Episode 10

Join us to explore and learn what is Security Chaos Engineering with two of the leading figures in this field Aaron Reinhart and Kennedy Torkura.

If you missed the Gathering watch the meet-up here.

References: Aaron Reinhart
Chaos Engineering: System Resiliency in Practice
Security Chaos Engineering

References: Kennedy Torkura
Security-Chaos-Engineering-for-Cloud-Services
From Dependability to Resilience → Security Chaos Engineering for Cloud Services
Risk-Driven Fault Injection: Security Chaos Engineering for the Fast & Furious

Contact Details:
Aaron Reinhart: https://www.linkedin.com/in/aaronsrinehart/
Kennedy Torkura: https://www.linkedin.com/in/aondona/

Your Hosts
Michael Man: https://www.linkedin.com/in/mman/
Glenn Wilson: https://www.linkedin.com/in/glennwilson/

DevSecOps - London Gathering
Keep in touch with our events associated with this podcast.

https://www.meetup.com/DevSecOps-London-Gathering/
https://twitter.com/DevSecOps_LG
https://www.youtube.com/c/DevSecOpsLondonGathering

[00:00:00] MM: [00:00:00] Okay, Glenn, I'm quite excited with this episode of our podcast. We've got two wonderful guests Aaron and Kennedy.

GlennW: [00:00:08] Yeah so this is a follow-up from our DSO London gathering last Wednesday and online events. And it was one of the most exciting events we've done, I think. And it's great to have both Aaron and Kennedy here.

So Aaron and Kennedy, whoever wants to go first, please introduce yourselves.

Aaron: [00:00:24] I guess I'll go first. I'm Aaron Rinehart. I'm the CTO and the co-founder of Verica. I co-founded Verica with the creator of Chaos Engineering at Netflix, Casey Rosenthal. And prior to that, I was the chief security architect at United Health Group where I wrote the first open source that applied Netflix's chaos engineering.

To security and I'm also the O'Reilly author on the topic of security, chaos engineering. And I'll hand it over to a good friend of mine, Kennedy who was a contributing author on the O'Reilly book. And I'll let him introduce himself.

Kennedy: [00:00:55] Yeah, thank you so much. So my name is Kennedy Torkura, and I am currently a [00:01:00] cloud security engineer at Mattermost and before then I was working at data4life, a company in Berlin as an information security engineer. And prior to that, I had spent quite a number of years in the Academy doing my doctorate in a cloud security. And that was where actually started more like research in about security, chaos engineering, where we had to solve certain problems and there weren't solutions for solving them.

And we sort of started playing around. And then we realized later on that what we were doing is called Security Chaos Engineering. So that's the short story story about me.

GlennW: [00:01:38] So for our listeners, what is Security Chaos engineering?

Aaron: [00:01:42] So chaos engineering, so let's know, it's easier just to kind of explain what chaos engineering, and then explain what security chaos engineering is. Well, chaos engineering. I'll give my own definition. There's a Netflix definition, but my own definition is so discipline up proactively introducing turbulent [00:02:00] conditions in our faults, faults or failures into a system to try to determine the conditions by which a system will fail before.

It actually fails. They, the key component to this is just, I love to get it. I give this example at the meetup. But I think this resonates with people really well. It's like, it's like the idea of a legacy system. Legacy just means it makes all the money for the company. That's what it means.

That's the cost of the thing that when it goes down, people get upset. So let me see. Also means it's like it's rarely goes down. We feel comfortable says that we're competent. The engineers feel like they understand the system. Like the documentation kind of reflects what it really is. You know, that condition really never is what it is, but, but we're feel comfortable with the system, but this begs the question.

What is the system. Always that way, right. It wasn't always so well known and stable and, and, you know, people felt so competent, probably not. If we learn about what it could becomes that way, because we learned through a series of [00:03:00] unforeseen events, what the system really was versus what we thought it was. And so you know, so the prob the problem with that is, is that over time we learned that we didn't have something configured, right? Or we didn't didn't have a place right, or we didn't have enough of one thing or another in the system. And we start to learn what we really needed to become stable over time.

But the process of those unforeseen events, those surprises, incidents, outages, and breaches cause customer pain and our process and Brooks at productivity now, chaos engineering, you can think about it. Is it proactive way of walking through that exercise without impacting customer pain?

We're into introducing things that we expect to be true about the system I expect under this most latency or when this kind of misconfiguration occurs, the security control fires of this latency auto-scaling kicks in or like that, that, that I believe that these conditions to be true. And then we introduced those conditions in the system and we, we try to ascertain, are they actually true or not very rarely do chaos [00:04:00] engineering, experiments or security, chaos engineering experience experiments.

And I'll let Kennedy actually chime in on this one is, is that I've have really seen a security, chaos engineering or chaos engineering experiments succeed. It's just, we're almost always wrong about what, how will we, how we think the system works. Kennedy, what are your thoughts?

Kennedy: [00:04:20] You mentioned a lot of very valid points and to me, you know the security curse and journeys is all about you introducing security faults into our system, and you want to see how those faults impact upon the security attributes.

And by this I'm specifically talking about the confidentiality, integrity and availability. Yeah. You want to really understand it in a pragmatic way. You don't just want to assume you don't just want to do some kind of you know, some, some kind of simulation you want to understand from a very practical standpoint, we want to have evidence about it.

So, because that, that, that knowledge you, you want to gain it. And use it in, in [00:05:00] a rather, you know, in a pragmatic fashion. So you're introducing those false security faults and you're observing how do they affect anything, you know, security and how not just do they affect it. But how, what is the extent, what is the, what is the magnitude of this impact upon your systems?

So that you can carry that, carry that knowledge and reuse it somehow, you know, usually to harden your system or to plan about a future, you know, because as security people, we always have limited resources and sometimes you cannot really do anything about it. You have to consider things about, if you have to consider things like your attack model.

You, what kinds of attacks are you trying to you know, to, to guard against? So you want to carry that information and use it to harden your system or to plan for some how to do that in the future.

Aaron: [00:05:52] Oh, I was just gonna say, Kennedy brought up something. I just want to highlight it's you know Charles Nwatu, he now leads Governance, Risk [00:06:00] Compliance, and audit and automation of that, those things at Netflix prior to Netflix, he was the chief information security officer at stitch fix.

He started, he started, he came to me when I was still at United health group and he said, Hey, Aaron, I love the security, chaos engineering thing. I'm interested in learning more and applying it because the thing is, I can't, Kennedy said we have to do so many things. We have to do them right . Like, but like how many of them do you actually do well and how do you know it?

And that, that's the question it's about doing less, better in a way.

MM: [00:06:34] Is it another buzzword Security Chaos Engineering. Is it, how's it different from the general practice of running an instance response war room type scenario or the concept of purple teaming where you have the operational analysts , and your internal attackers, your red teams kind of working together to simulate attacks, determining if the system is going to break or, or [00:07:00] not, and, and monitoring, making sure that all, all those monitoring use cases are in place. Is there a difference?

Kennedy: [00:07:07] So a lot of people, ask this questions and it's kind of logical because, you know whether you're newbie in security or you've been there for many years. We have seen a lot of evolution of different kinds of security toolings security controls, and they have moved over time.

And some of these evolution has been triggered by you know, systems, the kinds of systems, you know, from the olden days you had three tier applications, you had started having microservices that we have in Kubernetes containers. We have in cloud architectures and usually security has to move with these trends.

And over time we see that. It's architectural comes in with a new kind of problem. And when it comes to those problems, you realize that either the existing security controls have to adapt or they have to, you have to create something new completely from the ground up. So I think, what happened, you know, looking at what [00:08:00] Netflix did was they had a unique problem, which they don't have a solution at that time.

So they kind of creatively thought about how do we fix this problem? Like you have to fix it right now for our business to survive. And they thought about security. They thought about chaos engineering. Which maybe it was, you might call it a buzz word, but they were fixing a problem and they needed to give it a name.

So the idea of Security Chaos Engineering, which Aaron you know, kind of coined and maybe it's easier to just add security there so that people can understand yes engineering and, okay, so now it's security, but still, it's not very clear for, you know, for when you hear the first time.

I think it's, there's a lot of overlap. Well, you know, between what other security tools do or other security mechanisms, red team in glutamine and all of that. But there are unique things, you know , in Security Chaos Engineering. So you want to be proactive.

I think this is the major difference. Other security tooling like reactive. They [00:09:00] act after the incident that happened and they try to fix it and try to understand it, try to learn. By that time, the enterprise has lost a lot of money. They are losing trust from customers and all of that Security Chaos Engineering is trying to be proactive is trying to act before those problems happened before there are incidents.

So I think this is like from the point of what you gain, this is a major difference between Security Chaos Engineering and other security mechanisms that existed before now,

Aaron: [00:09:29] I think that's great excellent explanation. I would just add a few other pieces to it.

It's like, you know we're confused as a business, just to, just to be clear, like, what is pen testing anymore? Right. What is red teaming? I thought we didn't, we didn't, we move from red teaming to purple teaming. Now we have red and purple me, I'm confused. Right? Like, cause the problems were between, you know, with red teaming, the biggest, it was like the dev ops problem.

There's the wall of confusion. Like, and people were, or the, the red team would go forth and typically remember red team used to bleed done on like maybe the top [00:10:00] 5%, most of the company. And a lot of times it's regulated regulation. That's pushing it and funding it. You know, but like it's a, it's hard, but we can hear here.

We're also not saying, we're not saying don't do any of those things. Still do those things, right? The more objective information you know more objective information, we can get back instrumentation engineers don't believe in two things. We don't believe in hope. We don't believe in luck.

We believe in instrumentation. We believe what testing tells us it worked or didn't and why that way we take that information recalibrate. Fix right. Fix, move on evolve, you know, iterate. And so we're supposed to move on because red team, will attack, cause they always in somehow. Right.

And they would throw the PDF over the wall to the blue team, blue team be like, what the heck? You know, I, what am I supposed to do with this? So we involved in the purple team, purple team was supposed to fix the culture issues between red and purple. It was like, you know, we're gonna do this eyes wide open.

You can see what we're doing. We're going to tell you what we're doing. Kind of like, you know, in a way it's very analogous to like the clear box [00:11:00] testing with like pen testing. Pen testing is a technique within red team. Typically more that see that, but we're also seeing pen testing, being like automated, you know, pen testing in red teaming and purple teaming for software is very difficult.

It's very difficult because you know, when you're attacking nodejs , you have to know nodejs. You have to go right after that. Or you have to, if you're going to attack Java six versus seven, you gotta know Java six versus seven and the problems with it . Or, or.net, like it's very specific and the problems with software or.

Are not as, not the same as attacking active directory . Are or Exchange . Or old school Corporate IT that's what most of the red and purple type of tools you see in the market address, address big Corporate IT type of stuff we need to do that. We need to do red teaming and purple teaming and pen testing on a corporate infrastructure.

But the software's where we make the money . So we have to also focus our instrumentation because our products are made from software,. Like, you know , I'm saying the laptops get us there . But like, it's the, it's the software that's really generating the [00:12:00] dough and here's another thing with Security Chaos Engineering.

If you want to differentiate between purple teaming ish or breach and attack simulation tools, we're not simulating a bunch of attacks, a bunch of changes stepping through different parts of an attack and trying to try to see how the system does against a lot of activity. Introduce you one failure into the system because we're being conscious, like Kennedy said , this is more of a distributed systems problem, so we're attacking it from a distributed systems, like large Kubernetes clusters and a cloud native at large cloud native applications that have thousands of nodes. Like you're sending a lot of data to the system. When you step through attacks, we're introducing one failure, meaning the firewall is there to detect a misconfigured port.

By introduce the condition. Does it actually catch it and stop it right? Did something else catch and stop it? Did they both provide good log information? Did the log information generate alert. Cause what we're trying to do is once we do the experiment and so we succeed to our normal, we don't succeed.

Remember normally we're almost always wrong about how the system was working. But [00:13:00] once it succeeds and then that becomes more of a regression test over time on that system. But that also means expand scope now. Instead of now this environment, we go to this environment on the same experiment .

We're experimenting because even the different security groups within the same environment, you're going to have different results. And it's quite interesting, I find the security chaos experiments that are the dumbest, the dumbest, I mean like the. Both boring, stupidest things that you think would never, never not get caught.

Almost always get caught. Almost always like are true. I mean, what I mean by that is like they succeed like crap. I thought that was supposed to do that. You know, it's like, it's because it's, there's a massive difference between how the engineers are building things and how security people are building things.

And we have this lack of alignment problem. Lots of differences but it really comes down to. We're injecting faults into the system versus attacking or trying to get into the goals are different as well.

Kennedy: [00:13:56] I just wanted to add just to buttress what Aaron just [00:14:00] said.

So like my entry point into Chaos Engineering, like, was very like built in a Academy prototype of a cloud security posture management system. And, you know, this systems are to detect the random changes like miss configuration drifts and all of that in the cloud.

And so having to build a system from an academic perspective, normally if you write in paper, you have to evaluate, you have to show how adaptive that system is and compare it maybe with some other systems. So we had a challenge I wanted to test and normally if in a security world, if you are checking, if you're testing, maybe like a web application scanner , you will have already an established toolbox.

Things like Kali Linux or Metasploit, you will just configure and point at these things. And they were just sent a barrage of attacks and you can kind of understand how to respond . And what we discovered was there was nothing that could help us to test this environment .

And so we more or less started to build, you know, a bunch of scripts because we had ideas [00:15:00] of what could go wrong. And we started, you know, launching those scripts again, the CSPM and then we were able to understand, and what we did was not just to launch the scripts, but to make the system, like the attacks to be dynamic and to also play with the amount of attacks.

So you can see you just, you want maybe 10 attacks per minute or 20 per minute. So the system is actually tested in things that could actually happen, when a system is under you know malicious action, you know, so, so, and I think this is the problem today is sort of all the CSPM that are coming up and all of these new systems.

How do you verify whether it's just marketing gibberish that the vendors are selling to you. How do you even know, or when you deploy them, how do you know if they are they are given what the vendor has promised. This very, this is a tangible question that I think, I can't think of any other way. You can actually get this answer apart from using something like Security Chaos Engineering.

Aaron: [00:15:58] That's why at the end of every one of my [00:16:00] presentations now, Kennedy, I come back to that, quote from John Allspaw. Cause that's where John Allspaw he brought resilience engineering from the world of like medicine , nuclear science, like he brought it, he brought it in like, you know an aeronautics, a space.

He brought it, brought it to software. And one of the things he tries to tell engineers is like, guys, we've got to stop just like solutioning when you start asking better questions . Don't just assume are our answers are correct because our understanding of the world is wrong. So if you understand that the world is wrong and you have the wrong information, you're going to come up with the wrong answers.

Like, so the idea is, is we're trying to drive better information about the context of the system. So we can. Provide other people with the information, like, you know, if we launch a, a vulnerable image on it in a Kubernetes cluster and Twistlock or Aqua or whatever tool you have is not firing, that's not those solutions fault per se.

It's, it's the environments change a hundred times the solution didn't , where's the alignment, you know, [00:17:00] like it's, you know it's highlighting, but I like it as we're kind of highlighting in a graceful way. Like that we're not as things, some things we're doing are not. Really correct. Or we're like, let's say if we don't align well anymore with the way engineers work.

DevSecOps has really opened our eyes to the world that like guys like this, this is an engineering discipline, the business value. Does it come from your understanding of risks? The business value comes from the engineering and we're screwing up the engineering. We're making it more difficult for engineers to deliver things with the Security Chaos Engineering is directly engineering approach. What I like about another cool thing I like about like, this is all things I've worked after I started experimenting with this stuff . You know, to be clear, this wasn't like a grand vision . This was like an accident . Okay. But like. Compliance can be a by-product of good engineering instrumentation.

That's what I like about it. I think compliance should be the by-product of good engineering practices and not, not the goal . It should be the by-product, but like Security Chaos Engineering, you're [00:18:00] proving whether the security worked the way you thought it did . Keep the I'll put in a high integrity way labeled the right control framework. You got a free auditable artifact, you know, the proves it.

GlennW: [00:18:09] So the whole idea of Chaos Engineering or Security Chaos Engineering is that you understand what your secure, steady state is, then introduce a hypothesis to say, you know, our steady state remain in certain circumstances, then you inject the faults or the error in order to prove that theory, and then you've got the result from that.

It's like an experiment. So were you know, red teaming and pen testing is a testing exercise. This is very much an experimentation to prove that what you've done is what you think you've done. And therefore, as you said, Aaron, an engineers are able to do that. They're able to actually see what they're delivering, what value they're delivering and understand where they're not delivering value and then fixing that.

Aaron: [00:18:51] I think you explained it maybe better than him and I both did. I don't know. Kennedy has let you figure that out. I'm like, I I, you know, was it so getting to know [00:19:00] Casey over the years? So like, you know, really opened my eyes. Like first one, I always ChaoSlingr Netflix team invited me out, like, Hey, what are you doing?

You're doing this. Like, I'm like, yeah, it's still engineering. It's still, it's not a different part of the system. It's still part of it's different attributes. So like Kennedy said different attributes, the system. So what we're addressing what we're instrumenting, you know, but like I get into, but getting to know Casey Rosenthal, the creator of chaos engineering over the years, like I got the note, understand the real story guy in chaos monkey and like all, all the things, this stuff started making sense to me also, because as a company, the key driver for us, you know people come to us we've kind of a Chaos Engineering, a little bit as a practice into something.

We call continuous verification, but like people come to us for cloud transformation. Like that's, you know, and it's like, wow. You know, it's like even the Valley companies, this is not the banks, it's the banks and the big companies of the world and the startups. This is like, you know, big companies in silicone Valley transforming.

I'm like, you're not in the cloud yet. Like, you know, like it's, [00:20:00] they're not, you know, it's like, you know, just being being the non Silicon Valley versus anyway, chaos engineering is like it was a great way for Netflix to get sort of feedback loops on like, as they were building into the cloud from being that sending DVDs in the mail .

As they're moving to streaming services, Amazon, they were constantly, they started with chaos monkey. You know, it was bringing down AMI pseduo randomly, but like they were learning constantly that the things that they built weren't working the retry logic their circuit breakers wern't working in there.

Like they started figuring out that things were actually working. So I felt, I feel it is in my mind, I believe that chaos engineering could be a great accelerator for cloud transformation in general. It's just because people always do it wrong . But I've done it four times. Three different companies four times.

Okay. I'll tell you. I can tell you those stories, but that's too much time anyway. But it's except executives always give the wrong timelines, right? They always say they don't realize they're going to take longer to learn. They always try and do too many applications at once. It's not about how [00:21:00] many you try to send to the cloud at one time.

It's the rate that you get there, it's the rate you're buildings and shipping, but anyway, like, they're always worried that they don't have the right people. So they hire Amazons pro services to come do it for them . Then they don't learn . But like really engineers just need a feedback loop.

Like, okay, I go forth at build. I'm used to building it this way. Now I'm building it this way in the cloud. It's different. I get it. I go for that, build it. Now I need a way to instrument that, to know what I built was right. And then it works, like this is a way of doing that . This is a way of doing that.

Hey, after all these things that we've built does the property we expect emerge from the system. So it's security in particular is usually the heavyweight , that cloud transformations are trying to get past.

Like security teams to try and figure out and the less the company is waiting on them. This is a way of like security teams give their best shot. But I think this is how we do it. I think this is how we build it. I think there's a rules we use. Right. And then you introduce these conditions and verify it.

And then it helps build that confidence. You know, we can do this, [00:22:00] we're doing it. It worked, we were right. We were wrong. This is why, right. This is engineering. This is how engineering gets done. Nobody's perfect.

Kennedy: [00:22:07] I think like this the keyword there, which is really important is like the feedback loops because especially in the cloud, you have this problem of loss of control.

Like everything is virtualized. You don't have a feel of what's happening there in the cloud and you don't have to wait until when things go wrong. You don't want to wait for customers to tell you what's happening. You know, that's becomes a bad thing for you. So how do you get this information? There are a lot of ways of doing this these days you get logs, get observability , but you want something that gives you more accurate information.

And so I think there is a lot, again ,as Aaron said for the class cloud and transformation, you know, move to the cloud, you deploy, you want to be confident that everything is working as you expected. And I think there is really no way for getting that, that confidence, except you kind of [00:23:00] test it, you test specific things so that you get specific levels of confidence.

You know, it's really, really about testing for specific things and getting that information back, which supports your confidence.

MM: [00:23:13] Okay. So we've talked a lot about experimentation in my mind it still maps to test cases. So it could be a terminology challenge I have here, but parking that's aside.

How does one actually define or derive an experiment? Is it, I mean Aaron, you said earlier that it was it's, you know, the test cases are all the experiments using the right terminology here are typically quite boring. You know how would you introduce a company to this concept, what did they do? How would you help them or what can they do to define the first experiment?

Aaron: [00:23:52] Our industry is horrific at this, right. Then we just had a whole conversation on red purple and there is a thing called the security color wheel [00:24:00] is like every color of the rainbows, a thing it's like, wow.

Really? Somebody actually reached out to me over there and asking me, so what do you look, how do you think Security Chaos Engineering should be? I'm like magenta. I don't know. Cyan. I don't know. I don't know. But you know, we like risk defined risk, very confusing, very complicated, very quickly. Right. You know?

Cause it, cause it come from our industry. Anyway so to answer your question, where do good test cases come from? You know, it was very analogous to test cases. You spot on mic. But it's it's I like, what I have to say is past this prologue, you know meaning what has happened in the past is likely to happen again.

And so it could be the organization certain skills gaps. It could be the technology gaps. It could be that, you know if you do certain things, something through some, some things, one way something to do it another way some teams do things well, some things don't do some things. Well, the thing is, is that certain weird things kind of happened where laws existed and behave, but it's you know, so past incident data is a great place to start.

Like, you know, [00:25:00] usually when people get started Security Chaos Engineering, I'm learning about this from the beginning. I'm sure. Kennedy learnt from other companies too in Europe. But like final Capital One and Cardinal Health. They're they had major adverse public events. Put it that way before they started thinking differently from what I got it, you know, and a lot of companies, almost every other company doing it, they're like, Hey, we need to think differently about how we're doing this, but it's like, so, so if you've had a recent

or security potential breach or our incidents, you know you know, that information and, and, and things that you expected to be covered for in a system or, you know . So that's one way to think about it is if you have good data, I find that almost no one that has good data outside of sev ones. You know, a few, if you go through it's, here's a good challenge for the, for our listeners on here.

If you're an instance responder, or if you're not just asked to get access to the incident information, get approval to do it, but go look at it. It [00:26:00] is horrible. You can't tell what the hell happened, right. Like, because what happens is we don't really document it while we move on to the next incident. Like it's just revolving over and over again, you know?

So, but that's an area of improvement for the business, in my opinion. Right. For a startup to, to help out with that. Anyway so if it's not that then another good area to come from this, like is what is something that I know that we have built for, and that we, we know that we depend on it and we know that it's working, like, meaning like, that's why we started with chaos slinger started with the misconfigured port, but solving for that thing for 30 years. Right. Like in the cloud, it's still happens. But like, we believe firewalls were our bread and butter, you know, we just believe we have the most competent firewall engineers and the cloud we're going to get it right.

And that was our problem. Everything else was a problem. So, okay. You introduced a misconfigured that our offers for change. So to randomly security groups. And it turns out the firewalls only work in summary past incident [00:27:00] data, or go with things that you think beyond a doubt in your mind, you built it.

You know, it works. That's what you want to test because betting it doesn't work. At least not in all cases. And I don't mean that works. We're really crappy at what we do that it's the, it's the nature of the game we're playing. It's constantly changing. This system has to live. It has to constantly change.

Humans have to manifest security in that system and we're just, can't keep up. The pace is different. And so this, you know that's what I need, you know, kinda your thoughts.

Kennedy: [00:27:34] Yeah. That was a great one. So I think I want to start from like the more like the social approach because like, you know, I had the opportunities to speak with a transportation company in Berlin.

And they're actually practicing chaos engineering, not security chaos engineering, but chaos engineering. And he came to me asking about the security part. But anyways, I like the approach because what it did was, you know, to first get at a couple [00:28:00] of guys from different domains, you know, architects, security SRE DevOps, just, you know, brought these guys together and they kind of brainstormed, you know, upon the first, because they had the challenge of even bringing in case engineering into the company.

So they had to like, you know, fly people over. So like kind of hands on brainstorming, what do you think can go wrong? So at the end of that meeting, they already came up with possible places to start. And then they went in and kind of set up and prepared it. I don't know, maybe they already had prepared it, you know, that's a nice way, especially in a company otherwise, you know, I mean, if things fail, maybe it's something another person mentioned.

So it kind of saves you all the explaining team . But apart from, apart from that, there are other ways. So for example, I learned a lot from, you know, from the people who practice fault injection, as, because false injection is something that existed, it has been there for a [00:29:00] long time. It's, it's an old practice.

It's, it's practice a lot, especially for cryptography, for automotive in the automotive industry. And so, while I was studying, I had my second supervisor. Who was into this domain of fault injection. He was very excited when I told him I was practicing security, chaos engineering. And so what I understood from them is like, you know, they have they kind of understand the fault of the false space, which from a security perspective, the way I understand it is more or less your attack surface.

What are the areas which attackers can come in, right? So you gotta understand that. And then from there you can begin to construct what you think might go wrong. And there are a couple of ways you can do that. You can just brainstorm, you could use things like threat model models. You could use attack graphs.

Like all of these ways can give you insights into where to start or how to start. Another way to, to start is like, you know, coming from a compliance perspective, so there are, already [00:30:00] all of these best practices, you know, all of these things that are being written that you should do by cloud providers, by the center for internet security.

And these are also ways where you could say, okay, how do I test? What are these things are working well? You could construct a lot of test cases as, Michael said, they could construct a lot of test cases from the so-called best practices or, these benchmarks.

GlennW: [00:30:25] Just picking up on your point there about chaos engineer is it introduction of fault tolerance is an old thing. I was, I'm very fascinated by the moon landings in 1969 onwards. And during the whole testing process, they actually introduce faults into the system deliberately to see that, to make sure that the error messages were correct.

And the famous 1402 message. So this message came up as the land was approaching the moon. And basically the 1402 message came up and one of the engineers remembered it from the actual fault tolerance or fault injections experiments. They'd [00:31:00] done weeks before and said, no, you're good to go. And you know, so, so yeah, it's been around for over 50 years.

If you'd think about NASA, did it then.

Aaron: [00:31:08] You and I had this conversation that I did listen to the recordings that you sent me, you know, like I, I mean, I worked at NASA for four and a half, five years. I actually worked in safety and reliability engineering. So it's not that we did FMEA right we did fault trees. Like they do elements of fault injection, but it's, you know, it's the software, but the software is not like the software we have today. Like it's, it's a bit differently done. A lot of things. There's mathematical models of represent the faults. Like they have software that will generate what they think the fault are based upon the inputs. This is another extension of that I guess would say sort of the chaos engineering stuff.

GlennW: [00:31:50] You've both been involved in software that supports chaos engineering. So Kennedy, you were involved in a product called cloud strike C [00:32:00] L O U D. Not to be confused with another tool.

Aaron, you've mentioned ChaosSlinger a couple of times . So Kennedy starting with you, could you explain what cloud strike does and how it might help people who want to do chaos engineering?

Kennedy: [00:32:13] Yeah, sure . So cloud strike it was designed specifically to kind of help out for compliance. And so what it did, firstly, like once you executed was to carry out some kind of enumeration of the entire cloud target. So the idea for this first step was to kind of get a snapshot that you could roll back to in case things went wrong.

Or anyways, at the end of the experiment, you got to get the cloud back to how you met it. So after doing that it will kind of select out of this whole set of sort of resources, the resources to attack. And then obviously there are ways you could already kind of define the fracture.

That you want it to be attacked. And [00:33:00] then you could also select the, intensity of attacks. Based on that, it was going to like construct, the attacks, and inject to the various targets and then it will generate a basic report that tells you about the findings of the attack.

You know, how long it took, you know, some kind of interest in metrics what it, I am detected. And then you could at the end of the day tell it to roll back, to go back to the good state. So this was more like how it, it wasn't super complex, you know, just basic prototype, but it did the job for us.

GlennW: [00:33:34] And Aaron, you had chaos slinger. I believe it's not been deprecated. It just hasn't been worked on for a while, but, but I guess that had a similar approach. Didn't it?

Aaron: [00:33:42] Yeah, you're right. It's not maintained anymore. I left United Health Group two and a half years ago. I was the core sponsor for that project. And it was the first open source project for the company largest healthcare company in the world. First open source tool is a tool that proactive, introduced security failure into the system.

It was a [00:34:00] big, big, you have to imagine, you know, banks are pretty risk averse, but healthcare is a whole different thing, right? Because you get your money back. If there's a breach, you can't, when somebody finds out you have, you had too much fun when you were 22 and. Yeah. Yeah. I don't know. What's a will, if somebody finds out your health data, you can't really change that, get it back.

Right. It's out there, you know? So Chaos Slinger here, it's now deprecated, but they built it, they rebuilt it. I think of their CICD pipeline internally at the company and they don't maintain it publicly anymore. I don't know if they open source, anything else with it, but it still represents an easy framework for running the experiments.

That's why I advocate for people to still go into repo and other cool thing about the tool. It wasn't originally called chaos Slinger. It was originally called poo slinger . Okay. We try to, we're trying to figure out what would a chaos slinger or chaos monkey tool for security look like? So we went through the go code.

Chaos monkey is written go, and we're like, Oh, we don't need half this crap. They're like, okay, well what let's name it after a monkey? Like, okay, well, what's, what's [00:35:00] security related with a monkey. Like what do monkeys throw? That throw poo. So made the project super fun for some of the best engineers in the company to work on.

You know, I mean like this off the side of our desk, we're building this, you know but then like marketing team said we can't open source poo. So we changed the Chaos Slinger. That's a little, that's a little, not, not so well known thing. I don't know if Kennedy knew that, but it's called, but it's called Chaos Slinger.

So we originally had four experiments for the tool we open source. I think two of them, one of, but we needed a main experiment that people could resonate with understand what the heck we were doing. So we picked a misconfigured authorized port changes that was happening internally on our firewall.

Some of the physical systems in the data center also on it was happening in the cloud. And the matter with your firewall engineer, software engineer, system engineer, sysadmin, everybody kind of understands what a port is. Everybody kind of understands that there's 65,000 plus of them and that you know, what a firewall is supposed to do.

[00:36:00] So what we did was we you know, It's Chaos Slinger . We would do a pseudo randomly on an EC2 to a security group that had a reference tag of the poo as he was called originally, not intact. Most CAS engineering tools have opt in, opt out tag, cause you want a opt-in opt-out method. Cause some, you may not want to opt in on your edge, your edge firewall, or your edge facing internet.

You may not want to opt to open it, closing a port, you know, for example but like we would open her close up port that wasn't already open or closed. Okay. That's what chaos slinger did, , what it did was it was not selected all the security groups that were available to do, to do this on it's pseudorandom would pick one.

And then what a Kennedy knows this very well as well. But what Slinger did was Slinger will actually introduce, you know, pretty close port that wasn't already open or closed. Does that check cause if you open the same port that's already opened it didn't do anything. That kind of thing. But you know but then tracker tracks reports the information to Slack.

So, you know, what's happening. We didn't want to get emails. We didn't have to go look at logs. We want to just kind of like, as we ran the [00:37:00] experiment, we'll just kind of see what was happening. That's why we did the Slack. And you know, that that's kind of how it works. So, but you can easily take that same model and do that for an S3 bucket.

You can do that for, you know a unauthorized user unauthorized access you can do for a you know, network access internal, like, you know, you shouldn't be able to communicate you could do it with like you could simulate a connection to an external source, right?

Like you could write the same experiment with a different model, it represents a model for doing it. If that makes sense, because you only have to select the targets. It's not the targets, there's not an opt-in framework for that should consider that in your design.

You have to inject the failure, you have to track what happened. What, what were the results, what happened like, and that's that's kinda what it does and that's why it's, it's still referentially a good model.

MM: [00:37:46] Great stuff. So we've defined experiments, we've execute those. W what do we do with the results?

Aaron: [00:37:54] I can take a stab at that Kennedy, if you want. What do you do with the results?

So this is a higher order question, [00:38:00] really, in my opinion, most people always focus on. You know, running the experiments and that's crazy enough, right? It sounds crazy. It's actually fairly, like Kennedy said, it's very, very practical compared to a lot of other crazy things that we do in security.

But most people aren't thinking about what do I do with their results? How do I evangelize it? How do I make the case for value? How do I prove it? How do I, you have to be thinking about that because you going to do this practice a new practice, you gonna have to explain why you did it. What, why, why do you use that time?

Why you use people other people's time and what value you got from it. So it's important that you build that into your model of how you're going to do things. So start small, always something built and executed and ran as something better than an idea. If you can build it and run it and show the value in a lower environment, show them how it works.

You could evangelize what you're doing to say, Hey, you know what a really, if you read the capital one case study in the O'Reilly report on security chaos engineering, David Lavezzo goes through that. Like, you know how he did that. And it's like, every time he did it, he went by as a, Hey [00:39:00] guys, this didn't work.

I'm not, I'm not trying to pick on you, but I'm wanting you to know, like it's not working. Like we put this in place. Cause he was originally asked to evaluate new tools, but then he started saying, I want to evaluate what we're doing now. Like start evaluating what we're doing now. He said, it's not working, it's not working.

We thought it worked. It doesn't work like, you know, and he started communicating every time he did an experiment, sort of communicating to the teams that you're working on and letting them know and being an advocate. I think that really helps towards the business and communicate towards the business.

But like if you talk to anybody, just been chaos engineering in general we need, I need to mature companies or banks. They'll tell you like, that's the next challenge? So it's less about the tools. You'll get that down. You'll figure that out eventually. It'd be the next story is evangelizing the results and organizing that in a meaningful way.

And then it's expanding the scope, new teams, new applications, you know, repetitive tests, you know, it's sophisticated. That's kind of how I see it.

Kennedy: [00:39:57] Yeah, I think this question of what he used with [00:40:00] the results, it's, it's a very important question.

And from a security standpoint, there are firstly, you know, you have the challenge of what system is going to consume the results. Yeah. And that should be a core factor for deciding about, as Aaron said, what format do you want the results to be constructed? Is it like a, YAML file or JSON file.

Do you want to make the results to be like some existing log format? You know, Apache, whatever. Like that's going to be something you're going to decide based on your system. And what we did for our experiments was like, you know, because in the cloud, especially on Amazon web services, you got this whole systems that are just.

Based on APIs. And so if you understand how the API is work and you can just decide to act on it. So what we did was, for example, we were testing for, I think, security groups. And we like, okay, if you, if you conduct a test and a security group fails, you could just firstly, there are [00:41:00] two things, right?

The first thing is what if you are just testing, whether there's an alert then you have to like construct a rule that will trigger that alert. So on AWS is like for you to create a CloudWatch rule that triggers an alert about that event. That's one layer of doing it. It's not, it's not full cycle because the problem still exists.

You can step forward and say, well, I can also go there to either create a policy that blocks that port or whatever that you know, so there are various layers and if you already have a tool that does that okay. It's about how do you send that message to the tool and kind of hand over the responsibility to the tool, to carry out what it's sitting there for , so there's a lot of way of thinking about what you do with that information.

Another way I think about it is like you constructing a sort of knowledge base which can be like in the end, you're going to repeatedly carry out these experiments and these environments are continuously changing. So in the end you have like a huge bag [00:42:00] of experiments and results, and you can just easily say construct an attack for me, go check in that bag and just pick any one of these ones and use it.

And that even helps you to kind of have I don't know, how do you call it a track your system, how it's progressing over time. Like that, that whole knowledge can use it for making decisions in the future?

Aaron: [00:42:22] Good for confidence scores as well. That's another thing you can do when you have that kind of regression analysis like that kind of over time, you, you get to kind of like score the system. Yeah, sort of the metrics tell you how confident. Yeah. I'm I have this much information to tell me when this kind of thing happens.

I'm 90% of an average of comfort tablets interval or whatever.

GlennW: [00:42:44] I always feel if you collect trend data as well, so it becomes evidence in its own, right. For justifying why you're doing this as well. You know, you can see improvements over time.

MM: [00:42:54] Who tends to introduce a security, chaos engineering, or chaos engineering [00:43:00] into an organization.

And who tends to own it? Is it, is it security operations or, or some other parts of the business.

Aaron: [00:43:08] So the answer I'm going to give you as a mixed one. Okay. Because we're still early days with security, chaos engineering. So I am collecting the stories, but I don't have all of them.

Right. Like it's, cause I do all these podcasts and conferences, people eventually reach out to me or Kennedy and they'll tell us what they're doing. You know? That's how I heard about Capitol one or we heard about you know, Cardinal health and other companies that are not going to be in the book.

But anyway so chaos engineering, I've noticed, could be done at the team model. An individual product team will sometimes do it. You know, sometimes there'll be an SRE. We'll orchestrate it. Sometimes SRAs will have be a central function at a company and the SRAs will want to own chaos engineering and how it's done in the tooling because it's more of a it was always kind of a SRE practice kind of even a Google is dirt that the surgery team that they're a Google . So SRS, centralized model, a decentralized model with different teams or products, you know the [00:44:00] security chaos engineering, it's still kind of interesting to see how it's unfolding. I started doing when I was a chief security architect at United health group.

It turns out that the people who contacted me the most to get started doing this, our chief security architects you know, architecture is a function that needs to evolve .

Like how do we do what we've done? TOGAF never worked. Right. SABSA had never really worked either. You know? It's like, you know, what do we do now? How do we be effective? How do we bring all a lot of times, architects are some of the most seasoned knowledgeable engineers but like a lot of times, you know, you gotta have built some stuff before to know how to architect it correctly. Anyway. So chief architects are another one. You know, I'm seeing threat threat hunting teams adopted I'm seeing, which may also make sense. I'll seen cloud security engineering teams as well. Otherwise you're seeing just the direct platform teams. Kubernetes teams , you're seeing the people that run the platforms, sorta want to control their own destiny. That's where a lot of the, a lot of people are doing is like, I want, I don't want the security team messing with my system.

I want to kind of like , [00:45:00] prove it and say, Hey, here, here's the results stay away. You know, that's why I see engineers wanting to do so. Anyway kennedy your observations?

Kennedy: [00:45:09] I think it's a very difficult question that doesn't have an answer. It's depends on the organization.

I mean look, the way security is looked, as it looked at dev sec ops, like when I was interviewing for my current job, I interviewed at some companies that don't have a security team. Like the security team is embedded in the various teams and only have like meetings where they come in for the security champions meeting then they just go back like only the CTO or some VP take a one, one kind of appointment he's he does security and there's a security problem. He can just pull whoever to handle it. So I think even the security rule is like, kind of going away. For example, Aaron talked about the cloud security engineer, which is, it's a [00:46:00] very challenging role because for you to be a cloud security engineer, you have to pick up like everything about security.

Application security programming network security, because in the cloud, all of these things are jumbled up and you just have to take care of everything. And so if you add security chaos engineering into all of this mess of all kinds of things in a cloud who takes care of it?

Like it's, really difficult. But apart from that, I had a discussion with a big VC companies. And so he was interested , in chaos engineering and was asking after I described to him, he asked the same question, who are you going to sell this stuff to?

You know? And in the end he said, he thinks it's going to be like the SRE. And which makes a lot of sense because you know, SRE in one company that I worked, they actually do insecurity and the security guys are doing compliance. They only do compliance. They don't do security, you know? So in, in that kind of a company [00:47:00] SRS, we do, they will do chaos engineering , even the security chaos engineering. . And in other places, maybe it's more like DevOps. So I think it's, it's really it's very new. As Aaron also mentioned, it's a new trend, it's a new topic and it depends on who starts as in the company and who is ready to jump into the boat and take care of all the responsibilities. If they are going to like acquire a tool.

Like I spoke with some companies, I think they just wanted to acquire the tool as a, an entry point to using chaos engineering. So they look for an established tool. So it's easy to explain to the management because it's, a commercial tool. So in this case, whoever started that whole, you know, triggered this whole decision making has to handle it and select for himself his team.

So it's a very crucial question that really has no answer.

GlennW: [00:47:55] I think ultimately what we don't want is to see recruiters come up with a security, chaos [00:48:00] engineering role,

MM: [00:48:02] Is there anything else you guys want to share with our listeners?

Aaron: [00:48:06] I would say I'm just put this out there, Kennedy and I are freely available. Our information is out there on the internet. At least mine is my DMs are open my phone numbers out there. Don't don't call me selling, selling me crap. You want to talk to me? You talking to me? That's okay.

I know you're gonna, I think you're still gonna try. My LinkedIn reached out to me on LinkedIn, email me. I, I really, really want to hear from people. You know, what's funny is people always, you know, I used to, when I was junior, I used to always think that he's not gonna answer me.

He's not going to see, you'll say they say no for me. I was on a, I was on a podcast for like Day in your life at entrepreneur. And I was trying to like tell people like, you believe that just about 99% of people on the internet will actually talk to you as long as you're genuine. Right? Like even anybody who won't talk to you, you didn't want to talk to them to begin with.

Kennedy: [00:48:56] So I mean about reaching out to Aaron, I think that was how I [00:49:00] met Aaron. So I never knew Aaron anywhere. And I was just like, you know, trying to understand if I was, I was bullshitting myself or I was, doing something foolish. And I didn't find anyone around to like ask this question. And so I started looking around in the internet and I saw a couple of Aaron's articles and I hestitated for a long time, not reach out, but, you know, I went to conferences I spoke. And even from the academic perspective, he was saying, you don't know what you're talking about.

And I needed someone who would just say. Just one person who will like, say, go ahead. And I just sent a, I sent a LinkedIn message to Aaron and like he rapidly answered. He read my paper, like the questions he was asking, like when he was responding, it was evident. He had read the paper, like from back to front.

And the next few days we were already on a video call, which he also invited Jim's weekend. It was, it was, that was all I needed. Like I was all in. So, I think for me too, I'm open [00:50:00] to answer questions. I'm really, I'm anxious. I want to talk to people, people have questions, people who are not very sure about this field who have, who even don't like it.

And I want to hear those kinds of stories. Like, what are you doing? It's it's rubbish. Like I love to, I love to participate in that kind of critical examination of what I'm doing and it, I too, it's going to be a great learning experience for me.

GlennW: [00:50:24] So, , just put us into context. Kennedy, you've written a paper it's called chaos engineering for security and resiliency in cloud infrastructure.

You've also contributed to the report that Aaron wrote, which is a an O'Reilly publication called a security chaos engineering. I believe Aaron you're also in the midst of writing a book, which should be out in early next early 2022.

MM: [00:50:45] Well, thank you, Aaron. Thank you, Kenzie, for coming onto the show.

Aaron: [00:50:49] Thank you for having us. Thank you so much. I appreciate.

GlennW: [00:50:52] Thank you so much, guys.

People on this episode

Glenn Wilson

Co-host

Steve Giguere

Co-host