Hacking the Apple M-Series via Prefetching Exploits

Enjoying the show? Hating the show? Want to let us know either way? Text us!

In the arms race of computational performance gains and big market splashes...security issues can pop their ugly head, something going all the way down to the hardware level...and when they happen, they can be brutal.

In this episode of Tricky Bits, Rob and PJ discuss a recent article and research on hacking into the M-series chips, using clever attacks to extract data that is, ideally, supposed to be kept secret.

What are the nitty gritty details of how this attack is successful?

How does company culture and processes affect approaches to hardware development?

And did we back ourselves into a corner by becoming less efficient programmers?

Link to article:
https://arstechnica.com/security/2024/03/hackers-can-extract-secret-encryption-keys-from-apples-mac-chips/

Link to research:
https://gofetch.fail/

0:00

Ierengaym. com ierengaym. com

PJ: 0:10

Hey, everybody. Welcome back to tricky bits with Rob and PJ. So one of the interesting articles that have come out over the last month has to do with Apple's M series of chips. So the M1, the M two, and the M three. And issue in question is that have successfully attacked the chip to extract. Encrypted data, specifically encryption keys. And it's a really unique or interesting way by which they've gotten access to this. And it has to do with a performance. Gain that the chip is attempting to do, or does do, which is this topic around instruction prefetching. So the M Series has a particularly aggressive set of prefetching that it does when it's examining the information coming in. And because of the particular side effects that can occur, this allows attackers to potentially get access to sensitive data. Rob, this, uh, this is a unfortunate situation because I think you've pointed out this is occurring all the way down at this level of the silicon and everyone raved at the performance for so long. But we, we end up in this kind of situation.

Rob: 1:43

Well, one correction I'll point out, it's data pre fetching, not instruction pre fetching, that causes the problem here.

PJ: 1:49

Oh,

Rob: 1:49

And, it's not something unique to the Apple chips, everybody does it, it's just how they do it. But before we can really understand what's going on here, we have to go all the way back, way, way back. Why do we do this? Why do we prefetch is the ultimate question. And why do we prefetch instructions? And why do we prefetch data? And what are we prefetching from and to? It's, it's an architectural question that goes back about 30 years. And it all comes back to why do we have a cache on processors? The reason we have a cache is because processor speeds and memory speeds. are so far apart these days. We can, today we can get processors that are boosted to over five gigahertz clock speeds. You have memory as yes, it's faster than it was, but three, 4, 000 mega transactions a second is about as fast as you're going to get. So processor have gone up 10X in speed, 20X, 30X, whatever it may be over the last few years. And memory has gone up maybe 5X. So there's huge speed discrepancy. So. And this started in the 80s, like the original ARM 3 processor, the 386 processors back then had the same problem. Processors were getting much faster than memory. Maybe it was 30 megahertz for the processor, uh, and it was 8 megahertz for the memory. What happens is if you access memory, you have to make the processor wait. And it takes time. So to fix this problem, we added cache. Cache runs at the same speed, or almost the same speed as the processor, and single, one, two cycles, maybe, at access speed. So then, you load data, it's in the cache, you load the same data, and you get it for free. You don't pay that delay. But what happens if you make the memory orders of magnitude slower than the processor? You end up with a problem where you now miss the cache and the processor has to wait. a thousand clock cycles while it sits there doing nothing. And that's not acceptable. So every architectural improvement in processor speed is basically addressing this problem. And it has been that way for, like I said, 30 years. So what did we do? We started doing things like, well, first of all, we'll split instructions and data. So they're in their own cache. And the way you prefetch each of them is very different. Like instructions have Branch predictors, branch target predictors to have a guess as to like, okay, the code's going to go this way. I assume this branch is going to be taken. So I'll fetch instructions from the other side of the branch and won't go this way. And if it's wrong, it throws them away, comes back and fetches the real way. So it only pays the price. if it's wrong. And if it can predict that early enough, then it doesn't pay. If it can't predict early enough, it pays a percentage of the full cost. If it can predict early enough, it doesn't pay any cost.

PJ: 4:48

I was gonna say from a statistical standpoint, if you can keep that prediction in a high enough basically order, like 90 percent

Rob: 4:58

and that's where it is.

PJ: 4:59

you effectively are are able to guess the right path and therefore proceed unimpeded most of the time. It's only the sort of hiccup if you mispredict every once in a while.

Rob: 5:10

Yep. And that's the same for, so that's the instruction side and that they have their own system for fetching instructions. I said, where's the code flow going to go on the data side? It's a whole nother. Prediction system. And initially it was just, have you used this cache line before? Is it in the cache? If so, it's cheap. And we started adding prefetches. So, you know, I know I'm going to access this data. So I'll issue an instruction to say, prefetch this before I get to it. How far ahead do you prefetch? Because the discrepancy between CPU and memory speed is variable on everyone's machine. You might have five speed memory. I might have medium speed memory, same CPU, same code, same optimization. The time changes. So manual prefetching is actually very difficult to implement efficiently, but those instructions exist and have existed for a long time and they did work for a while. And the cool thing about the prefetch is instructions is you can give them a hint as to like, I am not going to use this again. So prefetch it, but don't leave it in the cache. So it's not polluting the cache for that could be taken by other memory, which is more useful to you. So there are some benefits to these non temporal prefetches as they're So that's where we went initially. But then we started to do things like, well, let's have a look at what data's going to be accessed. And this is where it gets really difficult. Out of order processes are needed to get here. You need the prefetcher, the instruction prefetcher, to be way ahead of where the data is. Actual execution point is where are the instructions being retired from? Because that gives the processor a huge window to pre look at load instructions and be like, okay, I see a load here and I've computed the address already. So I can issue this load. And it may be hundreds of cycles early, which takes away that cache load delay. It also may be wrong. And this is where we get into speculation errors and side channel effects on speculation. There is a unspoken contract, which has been violated a lot in architecture, which says, whatever the hardware does, there should be no visible side effects. So if the processor predicts, I'm going to go down branch A and it goes down branch A and starts fetching instructions. And in those instructions, they're all load instructions. And it starts issuing cache fetches for those load instructions. Those loads are now in the cache early enough that they could be used without penalty. But what if that branch was predicted wrong? It doesn't unflush those loads from the cache, it just lets them speed. So the process that gets to the retirement port of this branch, which it predicted to go Down a, but it actually went down B. So now it'll throw all that work away as far as executing instructions goes, but the prefetches are still visible. Now it starts executing down path B and there are visible side effects from the path it didn't go down this is the root of like the, the Spectre bug that we had a few years ago of tricking the processor into prefetching things and seeing the side effects of that. And the way that we see these side effects is by basically probing the cache, seeing how long it takes to access a memory address. If you kind of guess the address that is fetched, you can then access that in a different thread or a different process. And. see the fact that, oh, this memory was accessed quickly, so therefore must be in the cache, where if it's not in the cache, it's take a long time to access. And then by doing timing, side channel timing effects on the cache and statistical analysis, you can figure out what the speculation of the processor did. And that caused all sorts of problems with the Spectre days. But, those problems, these side effects, which are supposed to be invisible, still exist. The recent Apple problem is related to these. So going back to memory prefetching of, and specifically data prefetching, so we talked about like, okay, we can see an address that we're gonna load from, let's prefetch that. The processors also got a little smarter in spotting patterns. Say you're loading through an array of data structures and you're accessing memory every 600 bytes. The processor will spot that and be like, okay, I see you're fetching 600 byte intervals. I'll just automatically fetch far enough out. That maybe not fetch the next one, but I'll skip three and I'll fetch the fourth one. And I'll go from there and hopefully get this into the cache in time. And obviously the last time around the loop, it might get that wrong. It might fetch one that you don't need. Again, it should be invisible and probably isn't if you look close enough. So we have these automatic prefetches, which will fetch data in sequential patterns. We have prefetches like you could hint. But what if the data isn't so easy to prefetch? What if it's the prefetch destination is dependent on the data itself? For example, a linked list. If you load a linked list node, if you never access the next node in the data, the processor will never see it. And it's totally random where the next node is. So it can't predict anything. So we added things called data dependent prefetches. And what they are is As you load the node, even before you issue the load for the next node, which is too late to prefetch, typically, the processor will look at the, the node, the cache line that the node's in and go, Oh, I see things that look like addresses, and I will prefetch that just in case. And that's the root of the, problem on the Apple processors. Now, Intel do the same thing. They have a data dependent pre fetcher, but it's not as aggressive as the Apple one. The Apple one will basically look at multiple levels and fetch from there. So, It's far more likely to have visible side effects. And in fact, the go fetch people who found that's the name of the security flaw that they gave it, can't get it to work on an Intel or AMD processor, but they do get it to work on the Apple processes. Another reason is Intel have a lot of controls, which can switch these things on and off and Apple don't seem to have them at least in the M one and the M two. Maybe the M three has some controls. And. It all comes back to this Violation of a contract that the hardware isn't doing what the software Said

PJ: 11:41

So what do you mean by this is, the hardware is trying to be smart. it's over eager. For example, like you gave a non coherent Or a data structure that has sort of non coherent memory, like a list or an STL map or set where you could be bouncing across all over the place for pointers, like the Apple prefetcher is basically like examining every single one of these nodes, picking out all the memory and going after it, even though you may not actually be exploring that in the software, you may never go down any of those nodes,

Rob: 12:17

it might not actually be an address It might not be an address and there's the problem. Obviously there's a lot of logic here. It's not just going, Oh, that's an address. I'll fetch it because what's an address. It has to kind of have some smarts to be like, okay, this is an address which makes sense for this process or what I've seen before access wise. So there are some smarts there. It's just an endo smarts and how many levels of prefetch do you do? If you fetch the next cache line, do you look at that? Do you look at the next one you fetch? Do you recursively do this? All of these smarts is where the difference between Apple and Intel are. And Apple was a little over aggressive, which probably gave them a few percent. And that's all we're talking is a few percent speed increase. But it breaks this fundamental contract of what the software and the hardware is doing. So how does it actually work? What is this, uh, GoFetch doing? It's looking at the side effects of this prefetching, and then making fake data to reveal a security key. And you can hack over things with it too, but obviously security keys are the big one. Is this a big deal? Kind of, yes, because you could have another non root process running alongside some app that's used in OpenSSL, and it could derive the private keys, or the AES key, if it's a symmetrical system. It takes a long time, but it doesn't take that long, and it doesn't take that much processing power, so if you had a rogue process on your Mac, you may not know it's there. It's just sitting there trying to guess keys and it may guess wrong most of the time, but in the end, it will get the key and it seems to be pretty reliable from the test. I've not seen the code yet, but from the videos of the tests, it seems to be pretty reliable given enough time. And so what is it? How's it doing this? So encryption code, first of all, you should never write yourself. You should use open SSL or something that's been written by somebody who really knows encryption. Because you don't want the code to have different execution paths based on the bits of the private key. If it does a lot more work for a 1 bit than it does for a 0 bit. Then it's really easy to detect by cache analysis, power analysis, thermal analysis, there's a whole bunch of ways to determine how it's executed, various bits of the key and easy to derive the private key, regardless of the data. So you do things like, okay, we'll write code that has no difference in execution time for zero and one bits of the key and Intel have a whole execution mode where you can kind of do constant time. execution for a whole bunch of instructions regardless of what the input parameters are. So, if you're loading from memory, it's this cost. If you're loading from memory plus an offset, it's the same cost. They're all slower instructions. You don't get the maximum performance of each variation of the instruction. They all go to a lowest common denominator, but it makes it real easy to do constant speed execution. And same thing. You want constant branching. You want everything to be as constant as possible, regardless of what the input key is. So you can't side channel attack the key. And this is kind of getting into the territory of where the hardware violates that software contract. Because the software went through all these, all this effort to make sure there's no visible side effects. Every instruction takes the same amount of time. There's no branches. There's no 0, 1 variation. Blah, blah, blah. There's a whole list of these things that X security, code implements. But the hardware then just goes and runs wild with it. They're like, Oh, I'll do this. I'll do that. I'll speculate. I'll prefetch. And you start to reveal data about the key. And that's exactly what GoFetch does. GoFetch. Interjects fake data in the data side. So you've got the data side, you've got the key side and the key side is what you don't know, but if you can't control the data side. So if you start injecting data you can control things that look like addresses and whatnot. there's a few hoops to jump through, but basically you're generating fake data. To pass through this constant speed, constant time, constant power encryption code, and using the side effects of that fake data from the pre fetcher to guess what the keys of the bit, the bits of the key were, and it takes time. It takes a while, but it's very effective.

PJ: 16:47

you're taking advantage of the fact this was all supposed to run in constant time. But I'm now going to inject fake data to induce this hardware variation that I can detect, which then gives me information about the underlying key. So I can actually find out information by, again, statistically running this enough times. let's talk about order of magnitude of time. Are we talking that it's going to take hours to crack a code, or seconds, or somewhere in between?

Rob: 17:13

It's somewhere in between cracking a RSA 2048 key takes on the order of 25 to 30 minutes, a 2048 bit Diffie Hellman key, 127 minutes, and there's no offline processing. On those, when you get to the matrix lattice encryption schemes, which are supposed to be secure against quantum computers, like Kyber 512 and, dilithelium two and things like that. It, uh, Cracks, those two. So CBER 5, 12 43 minutes, but it also takes about 300 minutes of offline processing time where the RSA and Diffy Helman keys didn't take any offline time. So none of the keys are pretty. Safe. It's like, it's quite bad. And like I said, it's for 26 minutes to crack an RSA key. That's,

PJ: 18:10

It's

Rob: 18:10

that's quite, that's quite quick. And that's how long it is to be reliable. It may not take that long.

PJ: 18:18

and, and, you know, to drive a point home, this is all the way down in the silicon, like, any fixes need to be some sort of software fixes on top. I mean, it's, it's in, it's in the hardware. There's nothing you can do to fix that.

Rob: 18:33

Yeah, and it's actually quite hard to avoid these side effects. You've basically got to randomize the code such that the side effects Aren't visible or at least way down in the noise floor that you can't isolate what the hardware did. And this RSA has done this for quite a while. Well, they'll, they'll just have random bits associated with the key and it makes it much harder to analyze what the actual algorithm is doing. Because from an external point of view, you don't know what the noise is and the random noises and you don't know what the algorithm is. So it makes it a very difficult to analyze from side channels. So you have to do that. And. It comes at a huge performance hit and if it's a one off RSA key that performance hit even if it's 10x slower It doesn't really matter if you're just doing a key here and a key there. Where it does matter is things like AES, where, yes, it's a symmetrical key, so you're not looking for the private half of the key, you're looking for the key itself, because it's symmetrical. But that's used all the time. All of HTTPS, all of the drive encryption, all Everything that's basically encrypted is ultimately an AAS key that because it's fast to encrypt. So VPNs, all of that is all done with AAS. And then that key is protected by an RSA key. So getting the RSA key could give you the actual AAS key that's being used. So now you can decrypt the hard drive. You can decrypt HTTPS sessions and things like that. and that becomes a huge problem.

PJ: 19:59

And to put it an order of magnitude, did I read correctly in the paper that it's like a 2x difference? So, I mean, if would effectively, you know, say cut my speeds in half if I was on the

Rob: 20:11

Not really,

PJ: 20:12

No.

Rob: 20:13

Maybe, maybe not because you're not, if encryption is your bottleneck, then slowing it down by two X is a problem.

PJ: 20:20

Yeah.

Rob: 20:23

encryption and decryption is typically not the bottleneck in these systems. The internet itself is. So even if you slow the encryption down by two X, are you likely to see the result? Probably not, but maybe in some scenarios, in the grand scheme of things, not a big deal. I mean, Spectre had all these other things that had to be slowed down. Some people noticed it, a lot didn't. And I think this is the same. I think the workarounds are viable to make this attack at least difficult because the data will be in the noise and statistically analyzing the backend gets a lot more difficult to pinpoint what actually happened and if it's 2x, who cares, it's. Much better than having this there. Like I said, it's 20, 30 minutes to crack a key. You could quite easily have a rogue app running for 30 minutes before you even notice that it's draining your battery or that it's there. How many people have run the process monitor on a Mac to see what's actually running as bare service level processes that don't have any UI?

PJ: 21:23

and quite honestly, how many people would know what to look for? Because I mean, there's

Rob: 21:27

Exactly,

PJ: 21:28

that are running and it'd be like, Oh, like, unless you're looking each of these things up online, you could easily have a rogue process there.

Rob: 21:35

Exactly. And it doesn't need to be root access. It's just any process. So as long as, if a malicious software can launch a process, With minimal user rights, then it can crack the keys of all the processes, which is the big, that's the big problem right there. And like I said, who knows what's meant to be there, what isn't supposed to be there. And this was reported to Apple last year. So a lot of the fixes are already in place. OpenSSL will get updated to have a different code path for the machines. But the problem is, is that Apple don't have any controls over this prefetcher. Intel have some controls. So you can turn it off and things like that. I think the ultimate fix is like the cacheable fix in memory pages. Of. When we added memory managers and we added caches, there was always a need for this memory cannot be cached. You can't access this memory out of order. This memory is a memory mapped piece of hardware and reads and writes have to go in the order the processor issues them in. And so because this problem was already known about when we added caches,

PJ: 22:40

right.

Rob: 22:40

we added flags to the memory pages from day one. That was this memory is not cacheable. And basically, if you set that flag, yeah, you don't get cache performance, but you do get in order memory accesses, which is exactly what we need.

PJ: 22:54

to be checking, like, when we say we, you're referring to Intel and AMD chips, right? Because

Rob: 22:59

Intel AMD ARM,

PJ: 23:01

Rob: 23:01

has flags on the cache pages.

PJ: 23:05

and M2 would as well, right?

Rob: 23:07

Oh, absolutely. Every processor since 1985 has had these flags because there was always a need for memory that wasn't cached. This existed before we added caches. So when we added the caches, like, you know what, this is not going to work for all of this. VGA hardware, whatever it may be, things that existed back when we first added caches. We realized immediately, we need to have flags to switch the cache off. And it was done on a memory page basis. It can also be done globally. What we didn't. See, is these problems showing up? So we didn't add flags to the memory pages to switch off the prefetch. And we should have, that is one of the better fixes is the processor runs as normal. You just can't prefetch in these pages. You put all your keys and all your encryption go in these pages. The architectural difference between regular prefetching and these pages is known and documented. And the performance difference between having the prefetch on and having it off is also known. So we can write code to take advantage. Of the situation that it's in.

PJ: 24:03

And correct me if I'm wrong, the Intel chips have such a flag on the memory pages, correct?

Rob: 24:08

they, they do not have on the memory page. No, they have it more globally. So you can turn it off. And I think the M three has a flag that you can turn off, but the Intel one you can access from. I believe it's a model specific register and you don't need the kernel to do it for you. You can just do it, run it and switch it back on again.

PJ: 24:27

Got

Rob: 24:27

the Apple one is a lot more complicated to switch on and off. It requires the kernel to help, et cetera, et cetera. So for the early processes don't have that at all. You can't turn it off. At least at least that we know of there's no architectural way to do it. To, uh, turn it off. So I think that's the next thing, is to have this memory cannot be pre fetched and just like this memory cannot be cached already exists. This needs to be added.

PJ: 24:51

So the M3 has a control and that is hard to, to work with because as you say, it requires the kernel to help. The M1 and the M2 do not. They are basically,

Rob: 25:01

as, far as as, as far as I know,

PJ: 25:03

is software

Rob: 25:03

you can't turn it off. I assume you can't turn it off at some level because none of them are on. When you power the processor, someone enables them. So I assume at some level, it may not be practical to turn them off. Maybe you have to turn off memory management and the caching and everything to get it to go off. Maybe it's controlled by the same flags as the cache. I have no idea how it works at this level, but, I assume like I said, when it passed, when it first booted, these things were not enabled,

PJ: 25:30

Right.

Rob: 25:30

one enabled it per process, per core, whatever it may be as the process of boots,

PJ: 25:37

so I think it, it will be useful to talk a little bit about, you know, you've mentioned that the Intel chips also have, data pre fetching. I mean, everyone does. But why is this not a news story that is affecting Intel as well? is

Rob: 25:53

the Intel one's not, uh, not aggressive. I believe the Apple one will look at a cash line, see an address, pre fetch that, and then pre fetch what's in that too. And just basically do this over recursive thing where the Intel one only does one layer, which makes it very hard to attack. If you look at the paper, which you can find at, gofetch. fail

PJ: 26:15

We'll post a link.

Rob: 26:16

All the other information is in there, you'll see that they, they actually analyze. Access patterns, not just directly access, not just directly looking at what the DMP did. They're looking at a whole access pattern across many accesses. And the Apple one, because it's more aggressive. It reveals more about itself.

PJ: 26:35

So in terms of a speed difference, then let's say, let's say we had a fictitious knob where the M series could be matched to the Intel aggression, which is not as aggressive. would be the percentage difference or, you know, order of magnitude of percentage difference of speeds?

Rob: 26:55

Barely, barely any.

PJ: 26:58

Like what? 1 percent less?

Rob: 27:01

I have no idea. You'd have to look at, you'd have to simulate it or have a chip where you could switch it off at various levels to actually get real numbers. I'm guessing that And the more aggressive pre fetcher maybe gives them a percent, two percent. It's not going to be much. The pre fetcher existed at all, only added a few percent over static analysis pre fetching that the process is used to do. And only in certain scenarios too. If you're doing the predictable every 500 bytes. That can be prefetched much easier without data dependency. These are the things like prefetching based on the data itself, which is, like I said, the linked list example is one where you'd need a data dependent prefetcher because you don't know where you're going. And doing it one level is all you need to fetch that. If you start doing multiple levels, then you start to reveal a lot more. Secrets, but the fact that the whole DMP wasn't a major performance increase. It was just incremental performance increase leaving the DMP there, but taking away some of it's more aggressiveness wouldn't mitigate the entire DMP being there. It still has benefit. So if it was 1 percent overall, I'd be amazed. But that's the numbers we're working. They re architect an entire chip to get 2 or 3 percent speed boosts. It's all these 1 2% s on top of each other that give us the performance we have, today. And it's all these 1 or 2% s working together which reveal side channels.

PJ: 28:27

So Rob, I mean, is there a, uh, a macro symptom here that because we've become collectively shittier programmers and have moved to more map sets, dictionaries, things that use kind of this incoherent memory that we've now designed chips basically to kind of enable that more rather than like, Hey, use a vector and use it right. And think about how to lay out your memory. did we back ourselves into this problem?

Rob: 28:55

We have backed ourselves into this problem to some extent by having generic But it's all to enable hardware to be a black box as software engineers like to look at it. If we go back to the early days, even not that early, even if you go back to like the 2000s, well, let's go to the game consoles. Let's even just, let's just look at these. These are concrete, Examples. The PC has always been on this generic model where it's, it'll just, the hardware will just figure out what your code does and you don't have to optimize your code, it'll do it for you. You can optimize for different things along the way on the Pentium. We used to optimize for UV pipes in execution, and then the Pentium Pro went to out of order execution. So then you have to optimize for D decode bandwidth because once it's decoded it to micro ops, it will execute them out of order. You have no control. over it. There are some optimizations you can do at a high level, which is what the compilers do to kind of encourage the reordering and after order execution to execute efficiently. But it's a very different optimization to what you were doing previously of counting cycles and UV pipe and schedule and things like that. And this was true on ARM. This was true on everything until they go to out of order. When they go to out of order, Optimizing becomes real hard. You kind of optimize for register pressure. You optimize for decoder bandwidth and things like that. And ARM and Intel today are no different, but that's the PC path. And all we're doing now is trying to get the most out of it. We can prefetch predictable access patterns. We do data dependent prefetching to find unpredictable access patterns. And it's just the next step along the way, and it just allows code to be code and just work. If we go back to. The PlayStation 2, for example, it had a scratch pad. Like it was a very fast cache performance level piece of memory. You could DMA into it. You could DMA out of it, and you would access it with the, uh, with the CPU. So you could pre, you could preload this thing with CPU instructions. You could preload it with DMA and then get very fast performance. So to write real fast engine code on the PlayStation 2. It was critical that you use the scratch pad and at Insomniac we ran the entire engine outta scratch Padd, it was over like 16 K, but we DM EDMA in like we have this list of things that need to be culled. So rather than go through them in memory and let them load into the cache in the natural way. And this was early days, so it was basically on demand caching. It was no prefetch. Rather than do that, we DMA 4k, 8k or something into the scratchpad, process it from the scratchpad. And then while we're doing that, we double buffer. So we, while we're processing buffer A with DMA and buffer B, and then we'd. We'd switch between them. We never missed the cache because we, uh,, always had the next buffer available when we got there. And then we'd size it based on how much work were we doing? How quick could we go through the list? How big was it was the list in total? how many working buffers did we need? If you need an input and output and a process buffer, then obviously you have to make them smaller. Cause they all have to fit in this 16 K and you can work out data structures for any scenario, which. fit in this model. Link lists are still really difficult because you can't DMA the next node until you've seen the current node, but you can't put the whole list in a small block and then move the whole thing into Scratchpad and then work it that way. It requires programmers to rethink all of the classical data structures to be like, how can we run this from Scratchpad? And then you get to the PlayStation 3, which had the SPUs. A lot of what we'd learned on the PS2 already applied to the PS3, because we already had the mindset of making these data structures, which could be DMA'd. The PS2 was slightly more difficult that it wasn't the same processor. At least on the PS2, it was the EE core that was accessing scratchpad. You was just doing DMA, effectively pre filling a cache in a way that made sense to you. So you controlled how you fill the scratchpad versus the cache hardware controlling how it filled the cache. And you could. Basically eliminate all stalls because we can make data that fits. Data structures were not designed to work in these. I know what I'm accessing, where and when. So I'll prefetch my data. The PC is always relied on the fact that it'll just figure it out and how well it figures it out is based on how well you write the code. So today you'll get more performance from a PC by. Semi optimizing your prefetch bandwidth and semi optimizing your access patterns. But you don't have to take it to the extreme that the early machines did. And, yeah, I do agree. We have become shittier programmers because we just write generic code for generic architecture and Intel and ARM are both in the same boat. They're both the same thing, RISC versus CISC is Been gone for years. anybody who argues, oh, it's a RISC chip, so it's better than an Intel chip is an idiot. The only real difference between the two chips is the decoder. Everything behind there is all state of the art out of order architecture.

PJ: 33:54

Let's double click a little bit more on the Intel side here. So why didn't Intel fall to the same issues? We've talked about it being less aggressive, but I think there's a little

Rob: 34:03

That's really it. Less aggressive makes it statistically more difficult to predict those patterns. Cause like I said, it's statistically analyzing what the prefetcher does. And if you're only. If you're less aggressive, it doesn't reveal, it's still revealing data. It's still filling the cache. It's just maybe impossible to analyze or needs more aggressive examples to be able to analyze.

PJ: 34:30

The signal to noise ratio is unfavorable in that case.

Rob: 34:33

Yeah. So they're not saying it can't be done. They're just saying that this code can't do it.

PJ: 34:37

right, you mentioned to me an interesting fact, which is that Intel will publish what they're going to do way ahead of time in a way to almost invite folks to critique their designs. And that seemed really interesting to me. I think that was like interesting difference of like, Hey, come take a look at it. Security folks, see what we could be doing wrong. And we'll take some feedback on these things.

Rob: 35:02

Absolutely. Intel have always been really good at doing that. They don't always publish upfront what they're going to do, obviously for competitive reasons. But they do to certain people and they also have lots of white papers on the general idea of what is a data dependent prefetcher

PJ: 35:18

Right.

Rob: 35:21

DMPs, data memory prefetchers, that go by various names, they have a white paper on how they work, what they do, and then prefetchers. People will comment on the white paper and say like, Oh, this could be a big problem or it's violating this or violating that. So in theory, this is how it works. And the implementation will be different from the initial white paper due to the feedback. And Intel have always had a lot of bells and whistles of like switch things on and off. So I'm not surprised that Intel could turn those off a lot easier. They also published a post white papers of the effect it had. You can see like exactly how it works. The performance increase, like it expected the performance increase that it got. Apple don't do any of this.

PJ: 36:01

Yeah,

Rob: 36:01

don't even know how Apple's DNP works. It's like, we have no idea. We have nobody was in there. We assumed it was there we had, because they don't tell us anything. It's like, Oh, it's better, better by whose standards.

PJ: 36:13

There's that culture, of wanting to create a big splash and it's secrecy culture, right?

Rob: 36:18

Yeah. The secrecy culture at this level for security cannot exist. They need to be more open into what they're doing and not rely on people. Basically side channel attacking it to see what it's doing. And this is the same for like, how big are the reorder buffers in the Apple chips? They never tell you. Like, how many reorder working registers does it have? They don't tell you. People figure it out by just writing in creative sequences of code, and seeing how the performance changes. Again, it's a side channel attack. It's not a direct measure. It's It's like if I write code in this really academic, very inefficient, potentially, most likely, way, but I can see some measurable difference for when it runs out of internal renaming registers, then I can start to figure out how many of those registers there are. Apple could just tell us. We can figure it out. So they're not saving themselves. It's all they're doing is delaying people and in figuring out people figure these other things out too. So

PJ: 37:19

Right.

Rob: 37:19

just need to come clean. I've said it before. I'll say it again. I'll always say it. Apple suck that they don't document anything.

PJ: 37:25

I'm curious how much of this you think is, is really the, the culture and the hubris. And this isn't just Apple, this is the big tech problem.

Rob: 37:32

Oh, it's the culture full stop. It's the culture in Apple full stop. It's like, Oh, we'll add one of these. We'll do this. We'll do that. And all the people in Apple probably don't know what they're doing. Nevermind external, anybody external security, which is, as of the last on the list that Apple want to talk to

PJ: 37:50

It's fascinating because obviously a lot of the appeal of Apple is the vertical integration, the hey, we're secure. Right. It's this closed system, but it's a really interesting example of how this closed system actually creates these security issues. Which is driven by the secrecy. It's driven by probably managers who want to have it all. Oh, I'm 1 percent faster than Intel. Like, ah, that'll get me my, my promotion next year.

Rob: 38:17

Definitely some of that there for sure, but, security at any level cannot be done through secrecy. And if there's any effect on security, how it executes, what it executes, it has to be fully understood before it's released. And that goes to things like says leaking speculative data, things that you think like, oh, well, I didn't use this code path. So it doesn't matter. Was the mindset of the original speculation. It's like, we'll just throw that work away, but there are some visible side effects, which can reveal sequence if used correctly or incorrectly. And. That's the problem. I mean, one option could be if you pre fetch everything into a second cache, that was just for pre fetching, you had no other access to it. All it did was pre fetch into this cache, and then if you use it, transfer to the real cache. If you used it, so it could be done real quick. That would solve, that would solve a lot of pre fetching problems, of speculation, attack problems.

PJ: 39:12

That's a hardware solution.

Rob: 39:14

That's all, That's a hardware fix. And it's a cache that you could be using for general cache, which would make everything slightly faster. But now you've got this huge cache, which is just sitting there as a prefetch buffer, which you have no internal access to, but things like that. Like it's architectural difference from what we currently do, which fixes some problems. It may also reveal all the problems, is it worth it? Is it worth having another megabyte cache where we could have a two megabyte cache? Everything gets faster. We'll have, Some of the visible side channels, no matter what you do over the directly executing instructions as they are, it's going to have something that could be attacked.

PJ: 39:50

But, again, in the difference with the Intel example, it's, you know, or not those statistics are raised to a level of detection, depends upon like how you actually approach this. So, I mean, Intel gets away with it by virtue of the fact that it's not as aggressive.

Rob: 40:06

Yeah. By being less aggressive, they avoided some of these problems. And it's, it's a case of when they first found Spectre, they found it on, I think And then they, Oh, Intel have the same problem. Oh, ARM have the same problem because everybody was doing the same speculation with the same side effects. And then it's like, Oh, you know, well, all modern processors are affected by Meltdown and Spectre. A lot of different fixes were put in for specter mitigation things. the way it's changing the way speculation worked, but again, it was new hardware, not old hardware, although it still has the same problems. You could change software to mitigate it. You couldn't prevent it, but you could mitigate it. And we kind of moved on from it. I think this will be the same thing. New hardware won't have these problems. It'll either have controls or it'll have less aggression and software will get changed to make it so it's statistically. Next to impossible to get the keys and we'll move on it'll just be a blip in history, but it could have been Prevented it to some extent if Apple would have been more open as to how their system works Like we still don't know how it works. I mean we haven't reverse engineered it We don't know that Intel aren't doing what Apple's doing and we don't know why Intel is immune to this particular attack, but Apple isn't. We don't have any architectural details as to what they're actually doing. And we don't know what's the trigger. We assume it's Intel's being less aggressive and they're not doing dependent dependent lookups, but we don't know. All we know is statistically, we can't get the data out of Intel, but we can get it out of Apple. It'd be nice if we could have an architectural review of both and see what the trigger is. Okay. This. One tiny feature, it's probably insignificant feature is what's making Intel, Apple not. We don't know because we always have reverse engineered how they both work.

PJ: 41:51

From a cultural standpoint, is Apple an outlier or is it like everyone else? Meaning like we've talked about how it's different from Intel, but you know, Nvidia, AMD arm, like, do they tend to be more like Intel in terms of their here's how it works or more like Apple in terms of no, trust us, we've closed it all off or somewhere in between.

Rob: 42:09

AMD is usually the most open of everybody. Fact that they have open source video drivers, they started to open source the firmware inside the GPU. They started to open source lots of little bits that you would normally not have access to. I think AMD's ultimate goal is to have an open source from first instruction. Everything running could be open source. And that includes like board management and firmware and. BIOS and everything. So AMD have all traditionally always been a lot better. I think Intel, if you sign the correct NDAs and you have platform integrator or a system vendor, then you get that same level of access, but not open source.

PJ: 42:49

Hmm.

Rob: 42:50

do give you that access, just not in an open source way. AMD tend to, A, they have the same NDAs too. If you need access to stuff, that's not open source. But AMD is going down the path of we'll just open source all of it in the end. And there's legal issues, legal reviews and things like that. I assume is what stops them just going, fuck it, open source everything. AMD is traditionally a lot better. Intel do give you the access with the correct, NDAs and, uh, licensing. Apple, zero.

PJ: 43:16

Got it. the flaw is present on the iPhone, but there's no attack vector because you can't get the same access that we know of, correct?

Rob: 43:26

Potentially, it's definitely there. I mean, the M series and the A series are kind of the same cause. just to have that package in the SOC is different speeds and things like that are different. Cash sizes of caches are different, but front architecturally, the same things, the firestorm, ice storm, low performance, high performance calls are the same between the M's and the A's. So yes, this problem does exist on the iPhone Debatable whether the attack vector exists. It's hard to run background processes on

PJ: 43:51

On iOS,

Rob: 43:52

PJ: 43:52

yeah.

Rob: 43:54

On the iPhone, it's hard to just know what it's doing at any given time. Like, is it running right now? Is it not running right now? Where on OS X, it's a lot easier to do that. So if you can get the apps to run at the same time, the same problem exists on iPhone. It's just the vector is a lot more complicated due to the user interface and backgrounding of tasks and things like that. But in theory, it does exist.

PJ: 44:15

Got it. Well, folks, I think that's a good lesson. Watch out for what you're downloading. recognize that issues exist. This is just a part of computing. It's been a part of computing and it's a balance back and forth between performance security and, know, in many ways, the hubris side of things.

Rob: 44:37

And I will also throw out that, The people who found the problem did do this responsible disclosure to Apple. They told Apple about this in December of 23, 107 days before it was publicly released. So Apple did have time to start implementing mitigations. So by the time this hit the public, a lot of this was already mitigated, not fixed.

PJ: 45:00

But mitigated.

Rob: 45:01

they are definitely still attack vectors, but some of them will have been removed. The most common uses of the attack vector will have been mitigated. So it's not as dangerous as it sounds, but it's still a big problem.

PJ: 45:15

Hopefully by the M4, we'll have it all fixed, right?

Rob: 45:17

Well, I assume they're going back now and redesigning a lot of the M4 to fix this problem, or at least have page memory, page controls on off controls, constant time execution, whatever it may be that they're going to have to add to it. To get some of this to work. and M4 might be too close. It might be M5. That's going to have changes because M4 would have already been laid out by now.

PJ: 45:38

a, that's going to be an interesting question, I think. Uh, do you think, Rob, this is actually enough to either delay the M4? prompt a change in culture to be if not completely open, but more open, maybe on the level of Intel NDAs.

Rob: 45:56

Never going to happen. Never going to happen. Apple are never going to be open. It's just not in their culture. They just don't know how to do it.

PJ: 46:04

I find it fascinating maybe on a, on a long term scale then, because it really is almost a conflict in my mind between this desire for vertical integration that Apple has been going after forever, and as well as them touting themselves to be the most superior in security. It feels like one of those two things are going to have to Oh, plus the secrecy. It seems like one of those three things is going to have to break. And in this case, it broke the security side.

Rob: 46:32

Yeah, it's they, I just wish they were more open, more analysis and more investigations into what they're doing would put people's mind at risk. Cause we know like a lot of this good security is from the fact that they don't document anything. And it's not that any more secure than anybody else. Once you dive into the details, it's just getting access is difficult, which is a level of security. It shouldn't be the ultimate level of security.

PJ: 46:57

The old security by obscurity line.

Rob: 47:00

Yep. Don't do it. I mean, it helps, but don't rely on it.

Tricky Bits with Rob and PJ

Hacking the Apple M-Series via Prefetching Exploits

Listen to this podcast on