S5, E218 - CrowdStrike: The Quiet Part Out Loud - Worlds Largest Recovery Incident Artwork

Privacy Please

A genuine and informative podcast about data privacy and security. Your reliable place for best practices, interviews, belly laughs, and real stories.

All Episodes

Privacy Please

S5, E218 - CrowdStrike: The Quiet Part Out Loud - Worlds Largest Recovery Incident

August 02, 2024 • Cameron Ivey

Send us a text

What if a single cybersecurity incident could cost your company billions? On this episode of Privacy Please, Cameron Ivey and Gabriel Gumbs dissect the monumental CrowdStrike incident that left 8.5 million Windows machines vulnerable and sent shockwaves through the IT community. As a dedicated Linux user, Gabriel lends his unique perspective, backing CrowdStrike amidst the backlash and exploring why public overreactions might be misplaced. We dig into the staggering five billion dollar financial toll and stress the importance of recognizing technology's imperfections. Our hosts also revisit cybersecurity fundamentals—confidentiality, integrity, and availability—emphasizing the crucial role of backup and recovery in maintaining these principles.

But it’s not just about understanding what went wrong; preparation is key. Using the CrowdStrike incident as a pivotal case study, Cameron and Gabriel offer actionable advice on disaster recovery strategies crucial for any business. They break down the shared responsibility model in the cloud, highlighting how data and identity still lie in the hands of the customer. From adhering to the 3-2-1 rule of data protection to automating recovery processes and keeping offline guides, they cover practical steps to bolster your recovery plans. With industry-specific insights and tips tailored to sectors like healthcare and aviation, this episode provides essential guidance for any organization looking to proactively navigate IT disasters.

Support the show

Speaker 1: 0:44

All righty then. Ladies and gentlemen, welcome back to another episode of Privacy. Please, I'm your host, cameron Ivey, alongside the other host, mr Gabriel Gumbs. Gabe, how we doing man Take two. Did it freeze on us?

Speaker 2: 1:05

No, we're good, we're good, we're good, we're good, we're live and we are Memorex and I'm doing well. Sir, how are you? It's been at least a solid week since the IT world imploded. So we didn't do much last week in the recording front because the whole world was dealing with a lot. Yeah, as was I. As was I, but I'm doing well. I am recovered for a million facts.

Speaker 1: 1:32

Were you one of the 8.5 million windows computers that were affected?

Speaker 2: 1:37

I was not, um, I am not a uh, a native windows user. Um, for my everyday driver, uh, and for much of uh of what we do on the business side of things, those are all Linux machines. Um, so the answer for us was zero impact. But I would be lying if I said that, a I'm not a fan of CrowdStrike That'd be a lie, cause I am. B we actually almost went with CrowdStrike. We were very close to choosing CrowdStrike. And, c I haven't written them off. The only reason we had it originally was because of just some other things that we already were able to take advantage of. That satisfied our requirements, but they're literally within consideration. Still, last week's event certainly did not sour myself. I don't know how everyone else is feeling. I think there's a mixed bag, but I think most people are are coming down the side of this where I am.

Speaker 1: 2:40

Yeah, and I I can say that, uh, one of the reasons why we didn't do an episode so sudden is, sometimes it's good to let things sit to think about uh, some, some other angles to talk about the situation.

Speaker 2: 2:56

let the dust settle a touch, let the dust yeah, because sometimes there were a lot of overreactions, in my opinion, oh so many overreactions, people. People lost their minds at crown track. At first they were like how in the world dare you? And is there no process? Don't you QA? And why would you release it to the whole world? I mean, I don't know that there were and weren't valid points in some of those criticisms, but it's like breathe.

Speaker 1: 3:25

Yeah, it's going to be okay. You know what that actually showed you? It's when technology does go out one day, just everywhere, people are going to lose their minds. I have a love in minds.

Speaker 2: 3:40

To be fair, the number the financial impact number that has been bandied around is in the five billion dollar range. So my telling people to breathe through five billion dollars losses is it's obviously somewhat comical, but I'm still going to tell them to breathe. It's better than holding your breath and turning purple you know, there were 8.5 million windows machines affected.

Speaker 2: 4:06

Apparently, and for what it's worth I mentioned a minute ago we are a Unix shop right For the most part CrowdStrike actually had an issue with their Unix agent in the last 12 months also. It's just it wasn't big news, largely because it didn't affect a lot of end-use things of that nature, because Windows machine breaks in the middle of the forest and even though there's 10 people around, no one heard this gabe, I got a question that is lingering in my mind what crowd are they trying to strike?

Speaker 2: 4:41

I mean, hopefully it is the crowd of ransomware. I don't remember why the name that the organization is saying what it is. I thought it had something to do with how it collected data across its own nodes and others, etc. As a SaaS service, it learns from incidents that it might find in other environments and it updates everyone's information based on that. Don't hold me to that. I am far from a CrowdStrike, even novice, fair, information based. And I don't hold me to that. I am far from a crowd strike, even novice, right like there, other than being a bit of a fan boy. You should not, you should not turn to me for information about how the platform works, other than I'll say the following all of the people that I know that have used it have praised the way it works, right up to that one little tiny thing well, hey, I mean just goes to show that technology is infallible.

Speaker 2: 5:29

It is not infallible. It is not infallible. Our slogan at Myoda is actually operate unbreakably. Operate unbreakably because, although we know things aren't infallible, keeping your systems running and keeping them from breaking is a very, very important part of data security in and of itself. Now here's an argument that has been kind of floating around online whether or not things like backup and recovery are part of cybersecurity. I would argue that it is because the tenets of cybersecurity are confidentiality, integrity and availability. And availability One must keep the data confidential. Those that do not have the right to access the data shall not be granted the right to access the data and be able to see it. Integrity the data shall be remained in a state that is preserved, a rightfully preserved state from where it is expected to be. And then availability Data has to be available for use for systems and people that need it. If you compromise any of those, you have a security incident.

Speaker 2: 6:42

Classic availability attack people think of is like DDoS a DDo right. A DDoS attack right Like I will saturate your network or your processing capacity so that you can no longer operate. That is an availability problem. Ransomware itself is literally an availability problem. If you encrypted your data and ransomware came along and re-encrypted the data, you would have preserved the confidentiality of that data. I can read it it's still encrypted but now you can't get to it either, unless you pay me to get the encryption key. That's an availability problem. In fact, it's also an integrity problem, because I've changed, I've altered the data. I've altered it such that it is now re-encrypted and it's not useless to you in that form, right? So the integrity and the availability have been attacked, and I think there's this myopic view of what cybersecurity is and what it isn't.

Speaker 2: 7:39

And so you know, one of the one of the meme-y raging debates that was going on post-crowd strike incident was the fact that a security tool caused a major outage that kept the IT people working all weekend. It was the IT teams and the infrastructure teams that were working all weekend to recover the business, not the security teams working all weekend. I'd be lying if I said there was zero security people working all weekend. That's obviously an overgeneralization. Folks, you know the protocol. If you're going to bother adding me, just go ahead and skip right over that part of your day. Don't add me. Find something else to do with that. Five minutes every day I'm not responding. So there's been a little bit of. I'll use the word tension, maybe not friction, but tension between is this a security problem or is this an IT problem?

Speaker 2: 8:30

I saw Ali Menon from Gartner weighing in on this, chris Hoff was weighing in on this, cole Gromis was weighing in on this. We should tag a few of those folks in this episode. I'd love to get their input and feedback on it also. And so Chris ran a poll asking that question explicitly on LinkedIn. Like do you consider backup and security, backup and recovery, part of security? And like 85 or so percent of people said yes, but the dissenters were very, very strong in there. No, it absolutely is not.

Speaker 2: 9:03

I understand the nuance between who owns who is responsible for it versus who's accountable for it, versus who contributes to it versus who just needs to be informed about it. I get the nuance of that, but I don't appreciate the nuance of trying to separate the meaning of availability. Security has to have a meaningful stake in availability and traditionally that has just meant network availability and system availability, but not data availability. Data, yeah, and arguably on the system side it's the same problem. Who's responsible for uptime on the servers? It's not security. Security's not responsible for uptime on the servers. It's not security. Security's not responsible for uptime, right like it is, but one, but one still has to.

Speaker 2: 9:51

To factor in that uptime can be directly affected by cyber security incidents. In fact, today the number one impact to availability is a security problem. The number one impact to availability is a security problem. The number one impact to availability today is not data centers going down, it's not natural disasters, it's none of those things. It's the every 17 seconds ransomware attack. That's the number one impact to availability. So, while we understand that IT is responsible for the availability and even accountable for it to a large degree, the number one thing impacting it today is a security problem. And as long as we're still having this conversation in these silos of them versus us, we're not going to solve well a, we're not going to solve for the ransomware problem, right, because right. So for the ransomware problem, obviously to me seems like you have to solve for the availability, right, like they're making your data unavailable to you and only making it re-available to you upon payment, right, like we obviously have to solve for the availability problem. And we have to stop conflating uptime with availability and other such shenanigans.

Speaker 2: 11:06

We have to stop conflating immutability with data integrity from a security perspective, because every time I hear someone talk about, oh well, the data is immutable, they always fail to mention how immutability is protected. They never tell me how they. For example, if you told me you encrypted your data, let me phrase that if I told you I encrypted my data, um, told a customer encrypted their data, one of the first things they ask is well, how do you protect the keys? Right, like that's just a natural question to the. Oh, I locked it up, so how do you protect the locks and the keys?

Speaker 2: 11:43

I see the same thing happening across it today where it's like, well, we made the data immutable. Well, how did you protect the immutability of it? Well, the data is always available, we have 99.9% uptime, but how do you preserve the availability of the infrastructure itself, not just of that one system, et cetera. Where I've seen a lot of improvements in this is where we happen to see infrastructure also being owned by security, or vice versa, security having a significant ownership over the it. Like we. We've never seen security owning it as a thing that people hold up as a gold standard. Um, by and large, like we just don't see that. We always see the inverse. Uh, it having that purview over security. Um, we, we do see them also as peers quite frequently, but I think it's a fair argument to say still not nearly enough we need to be more of a merging of this.

Speaker 2: 12:40

Sorry, I've been ranting, it's been no, it's been a week away.

Speaker 1: 12:43

It's been a week. Well, I mean okay, so this stat I want, I want to propose a question to you. You know it says well, 97% of sensors are back online. Some companies are still struggling to recover. What does that mean in terms of they're struggling to recover? Does that mean that, like we were talking about offline, does that mean that they didn't actually have a recovery, um, in the first place? Probably, like, a backup.

Speaker 2: 13:10

So a lot of organizations.

Speaker 2: 13:12

So here's we've all been talking about the greatest it outage in modern history in history, in fact, that is a true statement, but this is also the greatest IT recovery in history, like the greatest recovery in history, and so what we're witnessing is that a lot of folks haven't been able to recover for various reasons. Some of them many of them had no recovery plans. Many of them had no recovery plans that they've ever tested and executed, and so this was a live fire drill with no training, tested and executed, and so this was a live fire drill with no training. Some of them simply had no backup as part of their recovery plans, and so, although they could do things like restore the Windows operating system, they weren't always able to get back all of the data. In some cases, some of them and I saw some of these ones up close and in person had their machines rightfully encrypted, and so they protected those encryption keys by putting them somewhere else and they put them on another system that they couldn't recover, so they couldn't recover their recovery key.

Speaker 1: 14:21

That's an oh no.

Speaker 2: 14:23

That's all kinds of an oh no, that's all kinds of an oh no. And so what we saw was so a lot of the sensors are back online because you know, we can get systems just up and running from scratch, right, Like, just take a gold system image and boom back up and running, get a new sensor on there. You know, maybe that's enough to start going, especially in a world where a lot of organizations use a lot of SaaS services. Right, Many of them were impacted somewhat differently, but a lot of those SaaS organizations, by the way, were impacted. Also.

Speaker 2: 14:51

I spoke to a couple of my own customers who had some impacts in other parts of their businesses where they were just getting calls going hey, is everything okay? And they were like no, we're fine. One or two of our endpoints maybe had an issue, but otherwise they were fine, their back-end systems were fine. But all of those third-party impacts a lot of their customers are the major banks and so they were all calling up going, hey, are my services up and running? Fine? What was the impact? Because a lot of folks equally kind of forget that this shared responsibility model in the cloud means that data is still your responsibility, Data is your responsibility. Identity and data is your responsibility period, and they tend to think that that means that, oh, my data is going to be safe because I have it in a SaaS platform.

Speaker 2: 15:51

Well, you know that's not untrue. There's a reason why those SaaS platforms all have, for the most part, capabilities for you to make them part of your backup and recovery strategy. I think a lot of people learned over the last week that they simply weren't prepared for this level of an availability impact. And let me also say that a different way what's the difference between ransomware and what happened with CrowdStrike? The short answer is just intent, really just intent. All the systems were brought offline simultaneously by a bad bit of code. Obviously, there's lots of other huge differences, like CrowdStrike isn't an evil organization and so they're not holding your keys, etc. But the immediate impact is the same and the recovery impact is the same. If you weren't prepared to recover or, said yet another way, if your recovery strategy hasn't been looked at in the last 12 to 24 months, even you're probably not prepared for any significant recovery event. You're just not. You're just not.

Speaker 1: 17:04

Well, I would imagine that a lot of companies definitely had a wake-up call during this experience. Just a hypothetical which type of industry would you most not want to be in when something like this happens? Would it be healthcare? Would it be banking? Who do you think was the most affected by this?

Speaker 2: 17:24

Just high level, not even reading into it, If I'm being honest, probably the ones where you get the grumpiest employees. I mean customers like the airline industry. The moral answer is probably healthcare, because there's real lives at stake.

Speaker 1: 17:40

But, damn.

Speaker 2: 17:41

I wasn't traveling that day, thankfully, but I wouldn't have wanted to have to. Every time the airlines go down, their systems go down. It's the worst. Luckily, some of the airlines are so antiquated they're using such old systems that CrowdStrike couldn't run in the first place Burn, and so they weren't impacted. They were walking around flexing, like look at us, we're still operating. And it's like that's because you're using a chisel and a slate.

Speaker 1: 18:12

My friends, it's not the flex you think, it is there yeah, airlines are already terrible enough to deal with, but I can imagine that's a good point, because I know that a lot of flights the real answer to your question, though, is honestly, I wouldn't want to work for any organization that didn't have a recovery strategy yeah like it's just the worst thing, the worst scenarios to be in from an IT perspective is just being unprepared for calamity, and you can't prepare for every disaster at all. No.

Speaker 2: 18:41

But yeah, some are way more prepared than others. Being able to operate unbreakably is difficult.

Speaker 1: 18:51

All right. So what for the future? For something like the next type of situation like this to happen, what should companies and people do to maybe help prepare for something like this?

Speaker 2: 19:03

At a very high level. There's a number of things that they should do. The first one is an old school one, right, following the 3-2-1 rule of data protection Keeping three copies of your data on at least two types. The first one is an old school one right, like following the 3-2-1 rule of data protection, right, keeping three copies of your data on at least two types of media, at least one of those offsite. The challenge is that gets expensive, right? Keeping full copies in multiple places is expensive. There are ways that you can go about that that are less expensive.

Speaker 2: 19:28

Shameless plug it's a thing Myoda helps organizations do of shameless plug it's a thing Myoda helps organizations do. The other thing I would add to that is you got to be able to automate as much as you can within reason. You have to make this process. Your recovery process has to be as automated as a recovery process as possible. I would suggest that for businesses that are brick and mortar businesses specifically For businesses that are brick and mortar businesses specifically they should have some type of offline recovery strategy. Right, how do I process credit cards when machines are down? How do I perform basic services when other machines are down? Right, so, have some kind of an offline operating guide as well, too.

Speaker 2: 20:17

Replicate, re, replicate. Replication is necessary replication of those types of offline guides so that everyone across the business has them. Maybe have multiple stores, etc. But the same is true of that data as well, too. Right like you want to replicate that data in multiple locations, having a hot copy and a cold copy of that data, etc. The same is true even of like desktops that needed to be recovered. A lot of folks went to their gold standard images to recover systems, and these are all things that you need to make sure that you have.

Speaker 2: 20:48

Having an actual plan is probably number one. Having an actual plan is probably number one. Having an actual plan is number one, absolutely number one. But that plan should be recognized as your safety net when everything else fails in IT and security. Your recovery plan, your recovery strategy, your recovery toolbox, is all that you have left For the billions we spend on prevention, detection, response, et cetera. Recovery is probably the most ignored aspect of what we're doing today as businesses. It's almost a foregone conclusion. I think it's because we've been doing it for so long. We just think we know what we're doing, right, like we've been we can't drive since the better part of forever.

Speaker 2: 21:42

But we have to recognize the world's moved past, that. The world has moved well past what backup and recovery used to be. We're talking about operating unbreakably. We're talking about recovery. We're not just talking about you know, do I have a backup?

Speaker 2: 21:58

And the answer to having a backup strategy that's robust, that just means making more copies, is a ludicrous one to me. It's like so it's like saying all right, so you know, the way to protect your house is while you lock the door and then you take the key, you make safe, but then you make like 99 copies of the key. It's like wait what? And like every copy of the key costs exactly as much as the first copy of the key. So there's like this one-to-one cost, like it just doesn't really add up. That's not. That's not a layered protection mechanism. That's just wearing three pairs of underwears to make sure that when you poop through the first one, you can just peel off another layer of underwear. That's not the answer. No, Peeling dirty layers of unmentionables is not the answer. And let me be clear that's what the strategy looks like today Wear three pairs of underwear and make sure you have one over each leg and when something goes wrong.

Speaker 1: 22:55

Why is why is that like the norm? Why has it's been a religion for?

Speaker 2: 22:59

so it's no. It's been religion for a long time and we and largely innovation in this space has only just started to really show itself to others. I mean, I'll take my own as a great example. We've been around for eight, nine years and it is only recently that folks are really paying attention to both the problem and that there are new ways to solve this. I'll give you the other half of the problem. There's apathy. We already do that. We've always been doing that. We already do that. We've always been doing that. Yeah, sometimes we just have to look at ourselves in the mirror and, you know, talk to the man in the mirror. Have we? Have we addressed like the business has changed the way we operate? The business changed the way it operates has changed all that time. Why are we still? Why does our recovery strategy still look the same?

Speaker 2: 23:51

yeah, does it still look the same that's a good question.

Speaker 1: 23:54

I think we can leave it with that. Um, we'll leave it on that note. Why does our recovery strategy still look the same? Why does it still?

Speaker 2: 23:56

look the same. That's a good question. I think we can leave it with that. We'll leave it on that note. We'll leave it on that note. Shout out to all the good people at CrowdStrike we appreciate what you have done prior to this incident. We appreciate what you've done for everyone to get them through this incident. My condolences for everyone that's had to deal with this incident. It's been a real pain in the ass. It's hopefully a wake-up call Again, not just that this was the single biggest IT outage incident, but this was the single biggest recovery incident.

Speaker 1: 24:28

In terms of comparison, what was the last big recovery type of incident?

Speaker 2: 24:35

Matt gave it the same thing, and it happened when the CEO of CrowdStrike was also leader there.

Speaker 1: 24:40

Oh no.

Speaker 2: 24:44

Oh no, Some days you're the bug, some days you're the windshield. I'll say the following If you're not breaking anything, you're not moving fast enough. That's interesting. What are?

Speaker 1: 24:55

you going to do? Do now. Do you think that's a funny coincidence?

Speaker 2: 24:57

oh, it's just a funny coincidence. Yeah, hard things are hard and hard things at scale at scale are infinitely harder. Right, like whatever? Yeah, it's. Yeah, this this is not about. Uh, I saw some wild claims like this guy just doesn't value quality, and I'm like I don't think so. That's no, that's not it. That's not it.

Speaker 1: 25:17

Yeah, people are going to. Yeah, there's going to be a lot of loud negative voices.

Speaker 2: 25:20

Yeah, and it's cool. I appreciate the dissenters amongst the crowd, but you know, yeah. But on that note, my friend Yep, always good to see you. We got some live episodes coming up soon. We got to start announcing Folks. Stay tuned. August 6th, august 6th is the next one that's going to be dope. We got one hosted by our friends at Transcend again. We got another one hosted by our friends at Alter coming up. We have a third one hosted by the folks at Myoda. Yeah, it's going to be exciting.

Speaker 1: 25:52

It's going to be exciting, it's going to be exciting, yeah, so thanks for tuning in and we'll see you guys next week.

Speaker 2: 25:57

Cheers.

People on this episode

Cameron Ivey

Host

Gabe Gumbs

Co-host