Apple: M1 vs. M2

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,317
Reaction score
8,498
This is a case of having to separate the personality from the information. Tom constantly blows his own trombone, and it makes him look like a jackass.

I disagree. From my experience, his sources are solid, and performance expectations are reliable. He counter balances RedGamingTech who has a pleasant host, but tosses out every figure he hears. I'd rather have a beer with Paul, but get my tech rumors from Tom.

Speaking of which, according to the videos you won't be watching, Arrow Lake comes after Rocket Lake and Meteor Lake. That's the first Intel arch that Jim Keller evidently worked on. I realize that he's a brilliant man, but I think tech nerds have given him godlike status. I would note that he apparently left Intel earlier than expected, so perhaps everything wasn't so sunny during his tenure there.

I worked with him. I don‘t know if he is brilliant. He architected the hypertransport bus on K8 (opteron). I don‘t recall him working on anything else on that chip, though I could be misremembering. There were other folks who I remember were architecting things, including our CTO (who really had the idea for the overall thing and was definitely brilliant). I did the initial work on the integer and FP execution stuff, but turned over the architecture part of that to Ramsey Haddad when he decided to stick with us (or maybe he left and came back - it‘s a little fuzzy to me after all these years).

Anyway, I’ve worked with many brilliant people over the years, but I don’t know that any of their names are well known. It helps, I guess, when you jump from company to company every couple of years.
 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,317
Reaction score
8,498
Oh, one more note. Tom's information was bang-on for RDNA2. That's because he later revealed his source to be @Cmaier's old colleague, Rick Bergman, who just happens to be AMD's VP of Computing and Graphics. So yeah, Tom's an arrogant guy, but has quality sources.

I worked with Rick at both Exponential and AMD, as it turns out. We knew each other, but actually I had very little interaction with him - he wasn’t really someone who was hands on with the design side, as far as I know. I think he was more a strategy guy, but I honestly don’t know what he was responsible for,
 

Colstan

Site Champ
Posts
822
Reaction score
1,124
I worked with him. I don‘t know if he is brilliant. He architected the hypertransport bus on K8 (opteron).
That does seem to be Keller's first claim to fame, at least when his name started showing up in the tech press. He's developed a cult like status among a certain subset of tech nerds, so as an outsider, it's hard to discern what he can and cannot truly be credited for. It's good to have your take on it. In my mind, the most notable thing about Keller was how he can't seem to stay in one place for more than a few years.
 

Yoused

up
Posts
5,617
Reaction score
8,928
Location
knee deep in the road apples of the 4 horsemen
Taken to its extreme, you could imagine that instead of separate cores, you just have a sea of ALUs, and any thread can just be dispatched to the next available ALU. On paper, where we have massless frictionless pulleys and perfectly spherical ball bearings, that may very well be the most efficient architecture.

POWER10 has 15 cores running SMT8, meaning one processor can push 120 threads at a time. There is some sort of scheduler that dispatches work to the I/M/F/V array (howsoever it might be composed of what many units) based on, I have no idea, logic I guess. When you have a structure that big, "core" starts to be a non-meaningful way to describe it.

The juice number I saw somewhere was 800W, but I am not sure whether that was for a single processor board or a dual processor board (or some other figure that meant something else). If it was for a single, that would put its net watts per thread (if that has any meaning) at something on the order of two thirds that of Alder Lake – however, it is on 5nm. One article claimed that a farm replaced 126 Intel servers with two POWER10 units.

So maybe there is a place for SMT, or an implementation that is less fraught than Intel's design. At the consumer level, though, it looks to me like wide-issue OoE is likely to do a better job, where it can be used (x86 probably not so much).
 

Colstan

Site Champ
Posts
822
Reaction score
1,124
Looks like Apple Silicon has another vulnerability, @Cmaier's favorite, namely side-channel attacks. The M1 is under attack by PacMan.

Apple's M1 chip was the first commercially available processor to feature ARM-based pointer authentication. However, the MIT team has discovered a method leveraging speculative execution techniques to bypass pointer authentication.
 

Yoused

up
Posts
5,617
Reaction score
8,928
Location
knee deep in the road apples of the 4 horsemen
“We want to thank the researchers for their collaboration as this proof of concept advances our understanding of these techniques,” Apple said. “Based on our analysis as well as the details shared with us by the researchers, we have concluded this issue does not pose an immediate risk to our users and is insufficient to bypass operating system security protections on its own.”
 

theorist9

Site Champ
Posts
613
Reaction score
563
M2 die area analysis from Dylan Patel at SemiAnalysis.com, based on annotated die shots and area measurements generated by Locuza.
Edit: Setting aside the commentary about Apple at the beginning, any thoughts about the technical portion of the article?
[NB: The "tree" seen at middle right on the M2 die is just SemiAnalysis's logo.]



Screen Shot 2022-06-10 at 10.30.48 PM.jpg


Screen Shot 2022-06-10 at 10.30.33 PM.jpg
 
Last edited:

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,317
Reaction score
8,498
M2 die area analysis from Dylan Patel at SemiAnalysis.com, based on annotated die shots and area measurements generated by Locuza. Thoughts?



View attachment 14837

View attachment 14836


I read this, and there’s some weird premises at the start. Nonsense about disappointing performance and Apple falling behind because folks left for Nuvia. Quite a lot of garbage in the opening paragraphs.
 

theorist9

Site Champ
Posts
613
Reaction score
563
I read this, and there’s some weird premises at the start. Nonsense about disappointing performance and Apple falling behind because folks left for Nuvia. Quite a lot of garbage in the opening paragraphs.
Yeah, I get that. But I was wondering more about the technical portions of the article.
 

leman

Site Champ
Posts
636
Reaction score
1,184
I think the most interesting bit of the article is the alleged increase in manufacturing costs. I tend to believe it. Explains a lot.
 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,317
Reaction score
8,498
Yeah, I get that. But I was wondering more about the technical portions of the article.

Not much in there other than floorplan and some size stuff, which is probably right,. The allegation that apple cropped to hide something? I dunno. The cost? Apple probably pays TSMC per wafer start, so assuming identical yield, bigger die cost more if you fit fewer of them per wafer. Whether there are fewer per wafer, I don’t know (you can’t be sure unless you know what else is on the wafer, how the die are packed into each reticle, etc. For all we know there was some blank space on the old wafers because before you could fit 5.9 chips in the reticle and now you can fit only 5.1, and neither .1 nor .9 chips do you any good.) Or maybe Apple did something to increase yield. (Most likely the cost did go up, but as an engineer I don’t have enough information to draw breathless conclusions for sensationalistic blog posts).
 

casperes1996

Power User
Posts
185
Reaction score
171
Looks like Apple Silicon has another vulnerability, @Cmaier's favorite, namely side-channel attacks. The M1 is under attack by PacMan.
I think it's worth noting that this is a bypass of a security feature rather than an exploit in itself. Having PAC is still more secure than not having PAC. This is one, pretty involved way, of bypassing it, but it relies on being able to overwrite a pointer in memory that will be jumped to to exploit any way. And if you had such a vulnerability you might even be able to exploit that without the Pacman thing anyway with return oriented programming potentially. PAC makes it substantially harder to exploit memory safety problems, but as this demonstrates not impossible. But harder is still better, though ideally we just don't write null terminated user input into random fixed size buffers hoping for the best :p
This Pacman thing is not itself something that can be used to exploit M1 based Macs. It can just aid in circumventing a safety mechanism preventing other exploits from potentially being possible in potentially buggy software. I don't think it's that big a deal. Intel chips without pointer authentication are, if we consider this a vulnerability, more vulnerable as no side-channel fiddling is needed to mess with a pointer and jump to it there
 

Colstan

Site Champ
Posts
822
Reaction score
1,124
I think it's worth noting that this is a bypass of a security feature rather than an exploit in itself.
Thank you for the explanation, much appreciated. As @Yoused and @Cmaier have explained in other threads, the P-cores inside Apple Silicon wouldn't benefit from hyper-threading, while the E-cores might to some degree, but implementing it may not be worth the bother, including security concerns. As we all know, there have been numerous headline exploits for x86 where HT was involved. I take it that, because Apple Silicon doesn't feature SMT, that makes it more difficult to exploit those types of side-channel attacks? Also, to your knowledge, are there any elements to Apple's design that make it inherently more secure than x86 in general?

Of course, I'd like to hear any thoughts from other knowledgeable folks here, and again, thank you for your time and explanations, @casperes1996, much appreciated.
 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,317
Reaction score
8,498
Thank you for the explanation, much appreciated. As @Yoused and @Cmaier have explained in other threads, the P-cores inside Apple Silicon wouldn't benefit from hyper-threading, while the E-cores might to some degree, but implementing it may not be worth the bother, including security concerns. As we all know, there have been numerous headline exploits for x86 where HT was involved. I take it that, because Apple Silicon doesn't feature SMT, that makes it more difficult to exploit those types of side-channel attacks? Also, to your knowledge, are there any elements to Apple's design that make it inherently more secure than x86 in general?

Of course, I'd like to hear any thoughts from other knowledgeable folks here, and again, thank you for your time and explanations, @casperes1996, much appreciated.

Well, from a system architecture perspective, it seems Apple’s design is more secure than a typical x86. Apple’s secure store, for example, appears to be better locked up than what goes on in most x86 systems, and there are things that Apple can do with the boot sequence because it controls everything that other vendors aren’t doing.

Not having SMT gets rid of one side channel attack, but there are lots of other attacks that may still be possible. Even differential power attacks, where you can monitor tiny variations of power consumption over time may be an issue. Apple may or may not have done things to try to prevent that (for example, in the secure store, it’s possible that the logic is differential so that the power consumption is constant, or that there is a noise generator like an LFSR to drown out noise generated by key calculations). But I’d bet that, as far as side channel attacks go, there are the same order of magnitude in number between M* chips and x86 chips. There are just too many ways to unintentionally leak information. What you try to do is limit the usefulness of the attacks in various ways.
 

casperes1996

Power User
Posts
185
Reaction score
171
Thank you for the explanation, much appreciated. As @Yoused and @Cmaier have explained in other threads, the P-cores inside Apple Silicon wouldn't benefit from hyper-threading, while the E-cores might to some degree, but implementing it may not be worth the bother, including security concerns. As we all know, there have been numerous headline exploits for x86 where HT was involved. I take it that, because Apple Silicon doesn't feature SMT, that makes it more difficult to exploit those types of side-channel attacks? Also, to your knowledge, are there any elements to Apple's design that make it inherently more secure than x86 in general?

Of course, I'd like to hear any thoughts from other knowledgeable folks here, and again, thank you for your time and explanations, @casperes1996, much appreciated.
Hm. I mean you can tackle that question in several ways. The ISA level and the chip level. Side channel attacks don't tend to be at an ISA level so if we're talking side channel specifically it really would never be an x86 vs. Apple Silicon (ARM) thing, but more of a "this specific chip architecture versus that one". Like how some Intel chips have various degrees of hardware mitigation to deal with Meltdown but not a new ISA. I don't necessarily know the extent to which SMT plays a role here. Having two threads share a core might open up more opportunities but I would think a bigger part is generally just shared cache space and it would be pretty expensive to invalidate all caches and TLB and all every context switch. I don't think either Spectre or Meltdown rely on SMT to work, at least not exclusively. PortSmash does, so you can definitely exploit threads sharing a core like that in some situations but it's by no means the sole vector for side-channel attacks - frankly I would be surprised if not all chips on the market currently have some form of exploitable side-channel attack vectors. Sometimes more exploitable than other times though. If you're securing highly confidential servers it's worth bearing in mind, but for more common computing it's the least significant attack vector to worry about I would say. There is often easier ways to obtain whatever it is an attacker might want to obtain.

As for specific security properties in chips, all sorts of hardware security mechanisms have been implemented throughout the years on both ends of this. There's things like the shadow stack which is now a feature of AMD's Ryzen Pro and EPYC chips - I'm not sure if Apple has a shadow stack actually - but it is effectively giving similar security guarantees to the PAC system but with a different mechanism (so Apple probably doesn't make use of it) - but the return address on your stack frame would have a duplicate on a "shadow stack" that would then be compared before executing a jump in case an attacker got a way to manipulate the return address. In a similar vein there are execute permission bits for memory pages, on Intel this comes in the form of NX bits in the page tables. This can mark memory tables as "non-executable" so if code tries to jump to an address that resolves to that page it won't be allowed to jump there. Apple also has such a feature in their chips and enforce it strongly, to the degree that memory marked executable cannot simultaneously be writeable. You either have read/execute or write/(read) - you can't have a page be both executable and writable. For something like a browser's JIT compiler it will then have to rapidly change a page to be writeable, write the instructions in, change it to executable and jump to it. But this is also a security mechanism to prevent attackers from potentially overwriting a heap buffer with some code and managing to jump into it. - Again you might be able to work around this in a return oriented programming fashion if you can find a widget (executable memory to jump to) that changes a page's permissions, but it's significantly harder to exploit than without the feature especially with address randomisation and such.
Then you have "secure enclaves". Of course Apple has their "Secure Enclave", but there are more or less equivalent ideas from the other vendors too. Intel had SGX 'secure enclaves' (stands for Software Guard Extension) in their chips. I say had because they removed it in 12th gen after several vulnerabilities were found, but this actually also means they can't play Blu-Rays anymore because the Blu-Ray licensers wound up requiring it. In an again somewhat equivalent manner, AMD has TrustZone which is actually an ARM based Secure Enclave for their systems.

I think it's hard to impossible to evaluate all the security mechanisms in all the chips against each other because there's so much to consider as well as so much left to explore and discover by the security community as a whole. And some stuff is also rather new still. But keeping an eye on the CVEs and counting can definitely give a heuristic though that also will come down a bit to what is most targeted by security research.

Plus, the CPU/SoC isn't the only chip in a system. - Well I guess with SoC it can get close as the name suggests, hehe. But I remember not too long ago there was a vulnerability in the firmware of a bunch of Broadcom wireless chips that was also used to get kernel level arbitrary code execution on a bunch of devices.

One of my friends worked on ARM specific cryptography for a while where they focused particularly on the opportunity for timing based attacks as well, and hand-crafted assembly trying to ensure that all possible branches had the same amount of work in them, and where possible that things were branchless etc.

I've rambled a bit at this point but my TLDR is that there's a lot to it, I don't think there's one clear cut answer and a lot of it is "time will tell" because there's a lot of complexity inherent in the problem, and I think software and especially humans are a more critical attack vector in most cases than side channels. At least for the foreseeable future :p
 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,317
Reaction score
8,498
Hm. I mean you can tackle that question in several ways. The ISA level and the chip level. Side channel attacks don't tend to be at an ISA level so if we're talking side channel specifically it really would never be an x86 vs. Apple Silicon (ARM) thing, but more of a "this specific chip architecture versus that one". Like how some Intel chips have various degrees of hardware mitigation to deal with Meltdown but not a new ISA. I don't necessarily know the extent to which SMT plays a role here.

The reason SMT side channels keep coming up is that the shared hardware opens up opportunities. Thread A performs some operations and leaves some register set to some value. Thread B comes along and depending on what the value was in Thread A, may or may not have to clear the register, or may or may not have to update a memory page table, or may or may not have to write a value to memory before thread B can proceed. This changes the amount of CPU clock ticks before something happens in thread B. By doing this millions of time, eventually thread B can figure out something that’s going on in Thread A.

The way around it is to always clear out all registers (every storage element) before switching threads, and to make sure that every thread change takes the same amount of time. This inherently slows things down and costs power - instead of doing things only as necessary to ensure correctness, you are doing it to prevent any conclusions from being drawn.

Of course, I am oversimplifying the attack a little bit, but that’s the general idea.
 

casperes1996

Power User
Posts
185
Reaction score
171
The reason SMT side channels keep coming up is that the shared hardware opens up opportunities. Thread A performs some operations and leaves some register set to some value. Thread B comes along and depending on what the value was in Thread A, may or may not have to clear the register, or may or may not have to update a memory page table, or may or may not have to write a value to memory before thread B can proceed. This changes the amount of CPU clock ticks before something happens in thread B. By doing this millions of time, eventually thread B can figure out something that’s going on in Thread A.

The way around it is to always clear out all registers (every storage element) before switching threads, and to make sure that every thread change takes the same amount of time. This inherently slows things down and costs power - instead of doing things only as necessary to ensure correctness, you are doing it to prevent any conclusions from being drawn.

Of course, I am oversimplifying the attack a little bit, but that’s the general idea.

Definitely. And there is as I pointed out as well that PortSmash attack that relies on SMT. But as we also both state, it's certainly not the only vector for side-channel attacks. And I feel like SMT is hard to work against as an attack opportunities since you generally don't control where you get scheduled relative to the process you're trying to obtain information from - Of course that's also a problem if you're trying to observe local caches but if you're trying to time behaviour based on a chip-wide cache's speed or something that at least seems easier to target an attack around.

Clearing everything out definitely sounds like something that will practically eliminate almost all benefit from SMT though

BTW. Just thought of another thing, if you don't mind another x86_64 design question being thrown at you :p
Why is it that addressing AL or AH like, mov %AL, someImmByte will leave AH untouched, while mov %eax, 52 will 0-out the upper 32-bits of %rax? Why was that design decision made? One could imagine using a split addressing scheme like %eax for lower 32-bit, %hax for upper 32-bits and %rax for the full thing with %eax not touching %hax and vice versa, which would also allow you to kinda double the number of 32-bit registers you could fool around with in some circumstances at least. Was this considered as a design choice or was the system we wound up with for some reason the "natural choice"?
 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,317
Reaction score
8,498
Definitely. And there is as I pointed out as well that PortSmash attack that relies on SMT. But as we also both state, it's certainly not the only vector for side-channel attacks. And I feel like SMT is hard to work against as an attack opportunities since you generally don't control where you get scheduled relative to the process you're trying to obtain information from - Of course that's also a problem if you're trying to observe local caches but if you're trying to time behaviour based on a chip-wide cache's speed or something that at least seems easier to target an attack around.

Clearing everything out definitely sounds like something that will practically eliminate almost all benefit from SMT though

BTW. Just thought of another thing, if you don't mind another x86_64 design question being thrown at you :p
Why is it that addressing AL or AH like, mov %AL, someImmByte will leave AH untouched, while mov %eax, 52 will 0-out the upper 32-bits of %rax? Why was that design decision made? One could imagine using a split addressing scheme like %eax for lower 32-bit, %hax for upper 32-bits and %rax for the full thing with %eax not touching %hax and vice versa, which would also allow you to kinda double the number of 32-bit registers you could fool around with in some circumstances at least. Was this considered as a design choice or was the system we wound up with for some reason the "natural choice"?

I remember it coming up and I don’t specifically recall the rationale but I think it was for simplicity. If you are in 64-bit mode, you are in 64-bit mode, and as cute as it is to think of 64 bits as two different 32 bit quantities, …ok as I type this stuff is coming back to me… Assuming you mean in the general case, not just with loading registers. So, one thing is that it’s not very performant. An optimized 64-bit adder is very different than 2 32-bit adders, and a 64-bit register file is not the same as two 32-bit register files (or at least not necessarily the same). You have flag logic, and 2’s complement stuff, and things that you would need to duplicate at two places, etc. If you want to have a register file that can load just the high or low 32 bits, you have to essentially build it like two register files.

It’s a very x86 thing to do, of course, but that was because x86 is, from the ground up, built on “clever short-sighted hacks.”

When I designed the multiplier that could handle two 64-bit quantities, it was probably one cycle faster (it was actually just as fast as the 32-bit multiplier in 32-bit athlon, in terms of clock cycles, is my vague recollection) because I didn’t have to worry about weird stuff between bits 31 and 32.

My recollection was that my boss, the CTO, considered what that could even be used for. In x86, a lot of things were done to compact the instruction stream, and that was not even one iota of a consideration for AMD64. If you just allow mov on half the register or the other half, then what? It seems a lot of the use cases would be better off with SIMD instructions or whatever.

Anyway, I’m absolutely positive I am forgetting a lot of the rationale, but I definitely remember talking to Fred Weber about it at one point because I needed some guidance on how that was going to work so I could figure out what integer instructions would be possible.
 

casperes1996

Power User
Posts
185
Reaction score
171
I remember it coming up and I don’t specifically recall the rationale but I think it was for simplicity. If you are in 64-bit mode, you are in 64-bit mode, and as cute as it is to think of 64 bits as two different 32 bit quantities, …ok as I type this stuff is coming back to me… Assuming you mean in the general case, not just with loading registers. So, one thing is that it’s not very performant. An optimized 64-bit adder is very different than 2 32-bit adders, and a 64-bit register file is not the same as two 32-bit register files (or at least not necessarily the same). You have flag logic, and 2’s complement stuff, and things that you would need to duplicate at two places, etc. If you want to have a register file that can load just the high or low 32 bits, you have to essentially build it like two register files.

It’s a very x86 thing to do, of course, but that was because x86 is, from the ground up, built on “clever short-sighted hacks.”

When I designed the multiplier that could handle two 64-bit quantities, it was probably one cycle faster (it was actually just as fast as the 32-bit multiplier in 32-bit athlon, in terms of clock cycles, is my vague recollection) because I didn’t have to worry about weird stuff between bits 31 and 32.

My recollection was that my boss, the CTO, considered what that could even be used for. In x86, a lot of things were done to compact the instruction stream, and that was not even one iota of a consideration for AMD64. If you just allow mov on half the register or the other half, then what? It seems a lot of the use cases would be better off with SIMD instructions or whatever.

Anyway, I’m absolutely positive I am forgetting a lot of the rationale, but I definitely remember talking to Fred Weber about it at one point because I needed some guidance on how that was going to work so I could figure out what integer instructions would be possible.
That makes sense; Especially the bit about considering the use case and that other tools may be a better fit for those situations regardless.
I can't picture how the 2's complement stuff plays in though since my understanding is that one of the benefits of two's complement is that you can do logic the same for signed and unsigned and so it wouldn't matter if the sign bit was the 32nd or the 64th bit. Though I can see the problem with setting relevant flags and figuring out if the carry from the 32nd to the 33th bit should carry into the 33th bit or only set a flag and stop. So yeah, I can see the "edges" of the logic changing and increasing complexity there for minimal gain. The real thing I would want would be more register space to not have to go to memory operations even if it is as simple as push and pop, and that was also granted through the r8-r15 registers so accounted for in a better way anyway. And then there's SIMD operations as you say in those cases where you want to pack stuff to do a bunch all at once on the same data pool, so yeah, makes sense that it was done as it was. In many ways I have always thought that the AMD portion of x86_64 were much simpler and nicer. I kinda would like to see the alternate reality where x86 never existed and x86_64 was the beginning. If you lot could've designed the whole thing without building on top of the x86 that already was.
Mainly a problem for the compiler optimisers and their graph colouring problems, but I also always disliked how div and mul instructions "steal registers". With most instructions I can say "I want to use these instructions". But sometimes you wind up with a bunch of "unnecessary" moves just to put things in RDX:RAX and then putting them back again after you extract your result from the mul or div. This again feels like one of those things optimising for packing instructions together tight in memory so they didn't have to encode which registers were involved in a div or mul (other than one operand register, the rest being implicit). Though as you've also pointed out with decoding of instructions tighter can have benefits too so decoding and reordering can see more at once and cache hits and all. And I don't know other ISAs well enough to compare. I only properly know how to write (a subset of) x86_64 assembly (so many instructions if we include all the extensions and x87 and everything)
For fun I also tried writing my own fixed width ISA during my last uni break - Of course I don't have the chip engineering knowledge to optimise for any of the aspects going into actually producing hardware that can run it, but it was a fun exercise to think about how I would express certain things in the 4 bytes I gave myself per instruction. Still want to do more with it at some point cause it's incredibly basic right now, but also made an emulator for it and an assembler to take the mnemonic form into a raw binary form that the emulator can execute. I'm pretty sure it's incredibly inefficient though, haha. A lot of my instructions only need three bytes but I wanted fixed width and some things I couldn't think of a clean way of doing in less than 4 bytes within the constraints I set up for myself. Though thinking about it now I probably could actually get the current set of instructions down to 3 bytes though it's nice with headroom, knowing I want to eventually add more instructions too.
Anyways I'm just ranting now about pet projects and all, haha. I get carried away easily while writing
 
Top Bottom
1 2