Apple: M1 vs. M2

Cmaier · Jun 8, 2022

Yoused said:
I made a little chart to look at some Apple Silicon performance metrics based on GeekBench 5, using iPad SoCs

chip cores GB5 score GB5 multicore GHz score per Ghz multicore per core
2013 A7 2 278 526 1.4 198.6 94.6%
2014 A8X 3 378 1049 1.5 252 92.5%
2015 A9X 2 648 1195 2.2 294.5 92.2%
2016 A9X 2 643 1176 2.1 306.2 91.4%
2017 A10X 6 831 2264 2.3 361.3 45.4%
2018 A12X 8 1113 4607 2.5 445.2 51.7%
2019 A12Z 8 1116 4617 2.5 446.4 51.7%
2020 A14 8 1584 4124 3.0 528 32.5%
2021 M1 8 1708 7145 3.2 533.8 52.3%

What’s interesting to me is that the multicore per core for M1 Ultra, if I understand your methodology, would be 23870 (GB5 multicore score) divided by (1771 (GB5 single core score) x 20) (=35,420) which is 67.4%.

Unless I messed up with one of the numbers I am plugging in.

Yoused · Jun 8, 2022

Cmaier said:
What’s interesting to me is that the multicore per core for M1 Ultra, if I understand your methodology, would be 23870 (GB5 multicore score) divided by (1771 (GB5 single core score) x 20) (=35,420) which is 67.4%.

Unless I messed up with one of the numbers I am plugging in.

The scores I saw for M1 Utlra were 1754/23350 which gave me 66.6% – ballpark. IIRC, Ultra is 16+4, right?

Cmaier · Jun 8, 2022

Yoused said:
The scores I saw for M1 Utlra were 1754/23350 which gave me 66.6% – ballpark. IIRC, Ultra is 16+4, right?

Yep.

Of course Geekbench scores will vary slightly. That’s a pretty impressive number. And it may actually go up with M2 Ultra, given increased memory bandwidth.

Cmaier · Jun 8, 2022

As I was driving past Steve Jobs’ house in old Palo Alto on the way back to the office from lunch in whiskey gulch, the whispers of the nerds on the street were that M2’s clock speed is 3.4GHz.

There was not much about the whisperers that made me think they were in a position to know anything, but I figured I’d pass it along.

Citysnaps · Jun 8, 2022

Cmaier said:
As I was driving past Steve Jobs’ house in old Palo Alto on the way back to the office from lunch in whiskey gulch, the whispers of the nerds on the street were that M2’s clock speed is 3.4GHz.

There was not much about the whisperers that made me think they were in a position to know anything, but I figured I’d pass it along.

That's a beautiful home and not your typical tech CEO mansion. When I worked at the end of California Ave in PA years ago I'd walk through that neighborhood once in awhile during my lunch hour.

Cmaier · Jun 8, 2022

citypix said:
That's a beautiful home and not your typical tech CEO mansion. When I worked at the end of California Ave in PA years ago I'd walk through that neighborhood once in awhile during my lunch hour.

Yep. Very understated. You’d never know it was his if you weren’t a local.

Citysnaps · Jun 8, 2022

Cmaier said:
Yep. Very understated. You’d never know it was his if you weren’t a local.

Very much unlike his pal Larry Ellison who built a 16th century samurai village and emperors palace complete with lake in Woodside. Apparently to keep construction authentic, it was all built with wood pegs, rather than nails.

Yoused · Jun 8, 2022

theorist9 · Jun 8, 2022

Yoused said:
The last column is kind of silly: if the multicore score reflected single-core times core count, it would be 100% – at 2017, it falls off a lot because the SoC is Big.little and the single core score is for the big core.

You can get a meaningful MT percentage with Big.little if you can estimate what the SC score is for the little cores. I recall someone estimating the little core in the M1 at 25% of a Big core. If so, 100% MT with the M1 would be (using your SC and MT numbers):
1708 x 4 + (1708/4) x 4 = 8450. Thus we have an average MT per core of 7145/8450 = 85%.

And we can also do this for the Ultra. Using the same assumption that the little cores give 25% the performance of the Big cores, we have:
23366/(1755 x 16 + 4 x 1755/4) = 78%

Yoused said:
As a side note, when Alder Lake is mapped into the last column, if you count 16 cores, the performance is a respectable 54.2%, but if you count the full capacity of 24 threads, it drops off to a sad 36.2%.

[N.B.: Corrections made based on mr_roboto's post, below.]

If you're going to do the calculation by threads, the performance cores each have two threads, so the 1990 SC score would be for two threads, not one. So the expectation by threads would be 995 per thread. Thus the percentage by thread would be 17285/(995 x 24) = 73%.

But rather than doing the calculation that way, I'd suggest taking the same approach with the i9-12900K Alder Lake as we did with the M1 and Ultra: If a little core is 25% of a Big core, then the i9-12900K's MT percentage is 87%. If the little is 50% of a Big, then we get 73% (mathematically, it's the same as we get when counting by thread, because counting by thread effectively counts each Big core as 2 x a little core):
17285/(8 x 1990 + 8 x1990/4)= 87%
17285/(8 x 1990 + 8 x1990/2) = 73%

mr_roboto · Jun 9, 2022

theorist9 said:
Having said that, the bigger issue is that you don't want to think of hyperthreaded threads that way to start with, because hyperthreading doesn't allow a single core to do more than one process at once. It simply queues up the threads within a core for faster thread switching, so there is less idle time. [I believe it does that by "exposing two logical execution contexts per core"*, where each context has its own thread. Thus when one context would be waiting for more input, it can immediately switch to the other context. But only one context can run at once.]

What you're describing is a type of hardware multithreading support sometimes called Switch on Event Multi-Threading, or SoEMT. Usually the event causing a context switch in a SoEMT core is a memory stall - rather than waiting around for memory to come back with results, switch to another thread to keep the core busy.

However, Intel "hyperthreading" is true simultaneous multithreading (SMT) - instructions from both hardware threads coexist and make forward progress in the core's execution units at the same time. There is no context switch.

The purpose of SMT is not fast thread switching. It's basically a trick to extract more throughput from an out-of-order superscalar CPU core.

To understand how this works, consider a hypothetical OoO cpu. It has U execution units, each with P pipeline stages, so the core can have N = U * P instructions in progress at the same time. To maximize the number of instructions completed per cycle, and hence the total performance of the core, ideally you want all N of these execution slots occupied by an instruction in every cycle.

That turns out to be hard to accomplish. Say the execution units consist of two integer, one load/store, and one FP. Assume there's a front end capable of decoding and dispatching four instructions per cycle. If the running program doesn't stick to a rigid pattern of exactly 2 int, 1 L/S, and 1 FP instruction in each group of 4 instructions, there's simply no way to keep all the execution units busy. The core will have to issue (and therefore retire) less than 4 IPC.

When you measure this in the real world, it's rare for CPUs to run anywhere close to their theoretical maximum IPC. The usual reason cores end up in that place is that some programs benefit from having lots of a particular kind of execution unit while not using the others, so you end up sizing things for the peak requirements of each type of program you care about, but this inevitably leads to lots of wasted resources when running something else.

That's how SMT was born. You just try to fill the empty slots with instructions from one or more additional threads. In very rare cases you might see as much as a doubling of throughput, but the average won't be nearly that good - the threads are competing with each other for all the core's resources, including cache and physical registers. However, you typically do see more throughput than a single thread running in the same set of execution units.

Yoused · Jun 9, 2022

sorry, incorrect info removed

Cmaier · Jun 9, 2022

Yoused said:
One of the big problems with Itanic was that it could issue 4 ops per cycle but only one FP op. Given its large reg file, one could easily imagine stretches of code where several FP ops might occupy one 4-op code line, but Intel just could not b arsed to add FP capacity. I think there were other major problems, but that was a big one.

The trade off was probably not too unreasonable at the time. We were thinking of doing a brand new FP instruction set with a fancy dedicated unit, etc. at one point, right around that time frame, and we found that, frankly, in real world software floating point almost never was used. When I owned the floating point on the PowerPC x704, it took up half the die area of the chip. We needed to have it, but hardly anybody used it.

add in the fact that x87 floating point is so kludgy and so itanium’s changes would likely have given a boost to FP software, anyway, and I can understand why Intel wasn’t keen to allocate more than 20% of the issue bandwidth to FP.

The other issue is that a few FP ops are not all that pipelineable (depending on how you implement them), so I wonder, in real use, if they were already saturating the FP execution hardware.

Colstan · Jun 9, 2022

Hey @Cmaier, a question about the missing Mac Pro. While Gurman has slowly morphed into Digitimes, his original sources for the M1-series were spot on. He never did specify what form the M1 "Extreme" would take. Do you think that Apple had originally planned for the Mac Pro to use a die with four UltraFusion interconnects, but then decided to push that out to the M2? Or do you think that Apple was working on a traditional SMP design for the Mac Pro, decided that the engineering effort wasn't worth it for a niche product, and scrapped those plans? The Mac Pro has been a riddle, wrapped in a mystery, inside an enigma. Perhaps we should hand out missing computer flyers on every Cupertino street corner.

Also, I've been following Moore's Law is Dead for over a year now. The guy has excellent industry sources and knows everything about the next products from Intel, Nvidia, and AMD. He could ask his sources about Apple, but like most of the PC crowd, just doesn't care enough to even bother. He dismisses the M-series saying "it's not magic guys" and claims that it's all process nodes that give it the advantage. Now that the PC guys and Apple are soon going to both be on 5nm, I wonder what the next excuse will be.

Cmaier · Jun 9, 2022

Colstan said:
Hey @Cmaier, a question about the missing Mac Pro. While Gurman has slowly morphed into Digitimes, his original sources for the M1-series were spot on. He never did specify what form the M1 "Extreme" would take. Do you think that Apple had originally planned for the Mac Pro to use a die with four UltraFusion interconnects, but then decided to push that out to the M2? Or do you think that Apple was working on a traditional SMP design for the Mac Pro, decided that the engineering effort wasn't worth it for a niche product, and scrapped those plans? The Mac Pro has been a riddle, wrapped in a mystery, inside an enigma. Perhaps we should hand out missing computer flyers on every Cupertino street corner.

Also, I've been following Moore's Law is Dead for over a year now. The guy has excellent industry sources and knows everything about the next products from Intel, Nvidia, and AMD. He could ask his sources about Apple, but like most of the PC crowd, just doesn't care enough to even bother. He dismisses the M-series saying "it's not magic guys" and claims that it's all process nodes that give it the advantage. Now that the PC guys and Apple are soon going to both be on 5nm, I wonder what the next excuse will be.

I am pretty sure Mac Pro was always going to be four ultrafusion die. I don’t think it was ever intended to be based on M1, though. There’s nothing about M1 Max that makes me think it would ever have supported two connections.

I doubt anybody at Apple would give that guy an info. I know, and have worked with, supervised, or worked for many folks over at Apple’s CPU design team. None of them will so much as say a word to me about any of it. The only reason I occasionally figure things out is because the closer I guess the more nervous they get. You hear more from folks who have already left Apple, but they know less, of course.

For what it’s worth, I don’t know anyone from the engineering side of my linkedin contacts list who isn’t incredibly impressed with what Apple has done.

theorist9 · Jun 9, 2022

mr_roboto said:
What you're describing is a type of hardware multithreading support sometimes called Switch on Event Multi-Threading, or SoEMT. Usually the event causing a context switch in a SoEMT core is a memory stall - rather than waiting around for memory to come back with results, switch to another thread to keep the core busy.

However, Intel "hyperthreading" is true simultaneous multithreading (SMT) - instructions from both hardware threads coexist and make forward progress in the core's execution units at the same time. There is no context switch.

The purpose of SMT is not fast thread switching. It's basically a trick to extract more throughput from an out-of-order superscalar CPU core.

To understand how this works, consider a hypothetical OoO cpu. It has U execution units, each with P pipeline stages, so the core can have N = U * P instructions in progress at the same time. To maximize the number of instructions completed per cycle, and hence the total performance of the core, ideally you want all N of these execution slots occupied by an instruction in every cycle.

That turns out to be hard to accomplish. Say the execution units consist of two integer, one load/store, and one FP. Assume there's a front end capable of decoding and dispatching four instructions per cycle. If the running program doesn't stick to a rigid pattern of exactly 2 int, 1 L/S, and 1 FP instruction in each group of 4 instructions, there's simply no way to keep all the execution units busy. The core will have to issue (and therefore retire) less than 4 IPC.

When you measure this in the real world, it's rare for CPUs to run anywhere close to their theoretical maximum IPC. The usual reason cores end up in that place is that some programs benefit from having lots of a particular kind of execution unit while not using the others, so you end up sizing things for the peak requirements of each type of program you care about, but this inevitably leads to lots of wasted resources when running something else.

That's how SMT was born. You just try to fill the empty slots with instructions from one or more additional threads. In very rare cases you might see as much as a doubling of throughput, but the average won't be nearly that good - the threads are competing with each other for all the core's resources, including cache and physical registers. However, you typically do see more throughput than a single thread running in the same set of execution units.

Thanks, I thought I finally understood HT, but apparently didn't. I always appreciate it when someone can correct my misconceptions, as you did, and thus improve my level of understanding.

I have edited my post.

Cmaier · Jun 9, 2022

theorist9 said:
Thanks, I thought I finally understood HT, but apparently didn't. I always appreciate it when someone can correct my misconceptions, as you did, and thus improve my level of understanding.

I will edit my post.

I think the key thing to understand is that unless you have ALUs that aren’t busy, it’s still sequential. So the value depends on the workload, the efficiency of the dispatcher, etc. On Arm it appears that ALU utilization is much higher than on x86, at least with apple’s decoder/scheduler. In my estimation, x86-style multithreading on an M-series-type chip would achieve a modest speed improvement, at the cost of hardware complexity (and power consumption and die area, though the power consumption may be negated by completing tasks more quickly and being able to then reduce voltage), and, of course, you also have to be careful to mitigate against side channel attacks (since multithreading is the type of thing where it’s very easy to end up creating vectors for such attacks).

The trade off also should take into account the relative size of cores and the relative efficiency in scaling core count. If you can scale well, it may make more practical sense to add a core than to make a core hyperthread.

Taken to its extreme, you could imagine that instead of separate cores, you just have a sea of ALUs, and any thread can just be dispatched to the next available ALU. On paper, where we have massless frictionless pulleys and perfectly spherical ball bearings, that may very well be the most efficient architecture.

mr_roboto · Jun 9, 2022

Colstan said:
Also, I've been following Moore's Law is Dead for over a year now. The guy has excellent industry sources and knows everything about the next products from Intel, Nvidia, and AMD. He could ask his sources about Apple, but like most of the PC crowd, just doesn't care enough to even bother. He dismisses the M-series saying "it's not magic guys" and claims that it's all process nodes that give it the advantage. Now that the PC guys and Apple are soon going to both be on 5nm, I wonder what the next excuse will be.

Sorry, but I'm not very impressed by that guy. In my opinion, he's just a smarmy bullshitter angling for clicks. Every time I look into him I come away with the impression that he's operating on the standard fake prophet playbook: toss out lots of semi-random predictions, memory-hole all the misses, use any hits for self-promotion, and always, always, always look and sound super confident. It's a simple confidence scam, old as time.

Cmaier · Jun 9, 2022

mr_roboto said:
Sorry, but I'm not very impressed by that guy. In my opinion, he's just a smarmy bullshitter angling for clicks. Every time I look into him I come away with the impression that he's operating on the standard fake prophet playbook: toss out lots of semi-random predictions, memory-hole all the misses, use any hits for self-promotion, and always, always, always look and sound super confident. It's a simple confidence scam, old as time.

Good to know. I won’t bother watching his vids. I prefer reading, anyway.

Colstan · Jun 9, 2022

mr_roboto said:
In my opinion, he's just a smarmy bullshitter angling for clicks.

This is a case of having to separate the personality from the information. Tom constantly blows his own trombone, and it makes him look like a jackass.

mr_roboto said:
It's a simple confidence scam, old as time.

I disagree. From my experience, his sources are solid, and performance expectations are reliable. He counter balances RedGamingTech who has a pleasant host, but tosses out every figure he hears. I'd rather have a beer with Paul, but get my tech rumors from Tom.

Cmaier said:
Good to know. I won’t bother watching his vids. I prefer reading, anyway.

Speaking of which, according to the videos you won't be watching, Arrow Lake comes after Rocket Lake and Meteor Lake. That's the first Intel arch that Jim Keller evidently worked on. I realize that he's a brilliant man, but I think tech nerds have given him godlike status. I would note that he apparently left Intel earlier than expected, so perhaps everything wasn't so sunny during his tenure there.

Colstan · Jun 9, 2022

mr_roboto said:
Every time I look into him I come away with the impression that he's operating on the standard fake prophet playbook

Oh, one more note. Tom's information was bang-on for RDNA2. That's because he later revealed his source to be @Cmaier's old colleague, Rick Bergman, who just happens to be AMD's VP of Computing and Graphics. So yeah, Tom's an arrogant guy, but has quality sources.

	chip	cores	GB5 score	GB5 multicore	GHz	score per Ghz	multicore per core
2013	A7	2	278	526	1.4	198.6	94.6%
2014	A8X	3	378	1049	1.5	252	92.5%
2015	A9X	2	648	1195	2.2	294.5	92.2%
2016	A9X	2	643	1176	2.1	306.2	91.4%
2017	A10X	6	831	2264	2.3	361.3	45.4%
2018	A12X	8	1113	4607	2.5	445.2	51.7%
2019	A12Z	8	1116	4617	2.5	446.4	51.7%
2020	A14	8	1584	4124	3.0	528	32.5%
2021	M1	8	1708	7145	3.2	533.8	52.3%

Apple: M1 vs. M2

Site Master

up

Site Master

Site Master

Elite Member

Site Master

Elite Member

up

Site Champ

Site Champ

up

Site Master

Site Champ

Site Master

Site Champ

Site Master

Site Champ

Site Master

Site Champ

Site Champ

Similar threads