Apple: M1 vs. M2

Colstan

Power User
Vaccinated
Posts
216
Reaction score
272
Take it with a grain of salt though, as that same youtuber is known to have made up technical issues on the spot for clicks (i.e. the 'TLB is limited to 32MB due to lack of foresight and that's what's limiting GPU scaling on the M1 Ultra' BS).
For what it is worth, I don't think Vadim is intentionally making things up. I believe he is simply ignorant of some technical details and fills them in, to the best of his ability, such as it is. I appreciate his enthusiasm, but he's Max Tech's "hype man", while his brother, whom the channel is named after, tends to do the "bake offs" comparison videos, which are far more useful. They're the modern tech equivalent of P.T. Barnum and James Bailey. Barnum was the huckster with the side show, Bailey was the circus man.
 

Andropov

Power User
Vaccinated
Posts
223
Reaction score
160
Location
Spain
For what it is worth, I don't think Vadim is intentionally making things up. I believe he is simply ignorant of some technical details and fills them in, to the best of his ability, such as it is. I appreciate his enthusiasm, but he's Max Tech's "hype man", while his brother, whom the channel is named after, tends to do the "bake offs" comparison videos, which are far more useful. They're the modern tech equivalent of P.T. Barnum and James Bailey. Barnum was the huckster with the side show, Bailey was the circus man.
Maybe my wording was a bit too harsh. I don't think he makes things up on purpose, but it sure is convenient for his business model that he though he had found a fatal design flaw on Apple's SoC design that was the cause for the (then unexplained) apparently bad scaling of the M1 Ultra. Maybe he thought he had genuinely found a flaw, but at the very least I doubt he believed it to be as impactful as he implied in his videos/tweets. I know I would second-guess myself *many* times before claiming to have found a design flaw that Apple itself missed.

Could be much worse, though. I've read an editor, on the spanish-speaking Apple-related blogosphere, that makes all info in their technical articles up. Like, absolutely wild claims: Intel CPUs being fastest thanks to 'smarter' variable-length instructions, x86 forbidding heterogeneous architectures by design (this was before Alder Lake), x86 having to execute everything in the CPU core as things like video decoders / HW-accelerated cryptography are 'impossible' on x86... Wild.

On another topic: I found this thread by Hector Martin about the M2 IRQ controller on Twitter interesting:
Maybe we'll know more about Apple's plans for the Mac Pro once the M2 Pro/Max Macs release.
 

Cmaier

Elite Member
Staff Member
Vaccinated
Site Donor
Posts
2,794
Reaction score
3,785
Maybe my wording was a bit too harsh. I don't think he makes things up on purpose, but it sure is convenient for his business model that he though he had found a fatal design flaw on Apple's SoC design that was the cause for the (then unexplained) apparently bad scaling of the M1 Ultra. Maybe he thought he had genuinely found a flaw, but at the very least I doubt he believed it to be as impactful as he implied in his videos/tweets. I know I would second-guess myself *many* times before claiming to have found a design flaw that Apple itself missed.

Could be much worse, though. I've read an editor, on the spanish-speaking Apple-related blogosphere, that makes all info in their technical articles up. Like, absolutely wild claims: Intel CPUs being fastest thanks to 'smarter' variable-length instructions, x86 forbidding heterogeneous architectures by design (this was before Alder Lake), x86 having to execute everything in the CPU core as things like video decoders / HW-accelerated cryptography are 'impossible' on x86... Wild.

On another topic: I found this thread by Hector Martin about the M2 IRQ controller on Twitter interesting:
Maybe we'll know more about Apple's plans for the Mac Pro once the M2 Pro/Max Macs release.

Yeah, the M2 Max die will be what tells us their plans. Can’t tell much from this yet.
 

Colstan

Power User
Vaccinated
Posts
216
Reaction score
272
Could be much worse, though. I've read an editor, on the spanish-speaking Apple-related blogosphere, that makes all info in their technical articles up. Like, absolutely wild claims: Intel CPUs being fastest thanks to 'smarter' variable-length instructions, x86 forbidding heterogeneous architectures by design (this was before Alder Lake), x86 having to execute everything in the CPU core as things like video decoders / HW-accelerated cryptography are 'impossible' on x86... Wild.
Tangentially related, Vulcan just froze over, because Linus Tech Tips actually released a video that mirrors everything we've been saying here.


Anthony is the only presenter on LTT worth watching, at this point, in my opinion. He lays out how the PC industry's reliance on ever increasing power consumption is going to catch up with it, that building a PC may become a pastime, and that Apple's integrated approach is the future. Anthony specifically sites the Mac Studio and how it gets nearly the performance of a high-end PC at a fraction of the wattage. The entire Mac Studio with an M1 Ultra consumes as much as a 12900K alone before adding in the other PC components. He also points out that x86 is an old, crufty architecture, and that the move to Arm would benefit the computer industry. Other than a small jab at Metal, he basically parrots everything we've been saying here for months.

I mention it because these problems that Apple has been trying to solve with their vertical integration strategy are eventually going to impact the rest of the PC industry. Anthony's perspective is refreshing, since I'm used to Linus harvesting clicks with anti-Apple video titles and pedantically harping on what he believes to be the Mac's drawbacks; or at least what his primary PC partisan audience perceives to be negatives. From spelunking into the video's comments section, his viewers were not happy about Anthony's logical reasoning.
 

Cmaier

Elite Member
Staff Member
Vaccinated
Site Donor
Posts
2,794
Reaction score
3,785
Tangentially related, Vulcan just froze over, because Linus Tech Tips actually released a video that mirrors everything we've been saying here.


Anthony is the only presenter on LTT worth watching, at this point, in my opinion. He lays out how the PC industry's reliance on ever increasing power consumption is going to catch up with it, that building a PC may become a pastime, and that Apple's integrated approach is the future. Anthony specifically sites the Mac Studio and how it gets nearly the performance of a high-end PC at a fraction of the wattage. The entire Mac Studio with an M1 Ultra consumes as much as a 12900K alone before adding in the other PC components. He also points out that x86 is an old, crufty architecture, and that the move to Arm would benefit the computer industry. Other than a small jab at Metal, he basically parrots everything we've been saying here for months.

I mention it because these problems that Apple has been trying to solve with their vertical integration strategy are eventually going to impact the rest of the PC industry. Anthony's perspective is refreshing, since I'm used to Linus harvesting clicks with anti-Apple video titles and pedantically harping on what he believes to be the Mac's drawbacks; or at least what his primary PC partisan audience perceives to be negatives. From spelunking into the video's comments section, his viewers were not happy about Anthony's logical reasoning.

The comments are hilarious, both from their lack of perspective as to what consumers care about and from their lack of technical understanding.
 

Colstan

Power User
Vaccinated
Posts
216
Reaction score
272
The comments are hilarious, both from their lack of perspective as to what consumers care about and from their lack of technical understanding.
I get more enjoyment out of Linus' comments section than the videos, for these very reasons. Hardcore PC gamers are highly myopic in their viewpoints, rigid in their thought processes, and superbly resistance to change.

Buried within the miasma of condemnation, this comment stuck out to me:
There's a video with lead Ryzen designer Jim Keller titled "ARM vs X86 vs RISC, does it matter?". His answer was that yes you want a tiny instruction set if you're building a tiny low power processor, but for desktop class chips it makes basically no difference since the decode block is so small relative to the die.
This got heavily upvoted, here's the video in question, but all evidence suggests that RISC does matter on the desktop. I hear all the time from the PC crowd that Apple's advantage is solely a result of a more advanced process from TSMC, and that instruction set doesn't matter. I suppose Mark Twain was right; denial ain't just a river in Egypt.
 

Cmaier

Elite Member
Staff Member
Vaccinated
Site Donor
Posts
2,794
Reaction score
3,785
I get more enjoyment out of Linus' comments section than the videos, for these very reasons. Hardcore PC gamers are highly myopic in their viewpoints, rigid in their thought processes, and superbly resistance to change.

Buried within the miasma of condemnation, this comment stuck out to me:

This got heavily upvoted, here's the video in question, but all evidence suggests that RISC does matter on the desktop. I hear all the time from the PC crowd that Apple's advantage is solely a result of a more advanced process from TSMC, and that instruction set doesn't matter. I suppose Mark Twain was right; denial ain't just a river in Egypt.

Jim, Jim, Jim. I don’t have time to watch the video, but if the context is accurate, that’s just silly. Sure, if all you care about is die area and total power dissipation, then it doesn’t matter. Doubling or tripling the size and watts of the instruction decoder won’t matter when you have a chip with 32 cores and tons of cache on it. But there are lots of things other than just die area to worry about. And he is selling the power issue short - needing a higher clock to keep up because your IPC is lower because you can’t reliably decode enough instructions to keep the pipelines full also causes much more power to be burned; it’s not just the power consumed by the instruction decoder itself that matters.
 

Colstan

Power User
Vaccinated
Posts
216
Reaction score
272
I don’t have time to watch the video, but if the context is accurate, that’s just silly.
If you're short on time, just watch the first two minutes. Keller explains, in his opinion, why ISA "doesn't matter that much". Yes, the context is entirely accurate.
 

Cmaier

Elite Member
Staff Member
Vaccinated
Site Donor
Posts
2,794
Reaction score
3,785
If you're short on time, just watch the first two minutes. Keller explains, in his opinion, why ISA "doesn't matter that much". Yes, the context is entirely accurate.
Keep in mind that keller also thought it was just dandy to have a chip that could be both x86 or ARM and just have the instruction decoder take care of it. So he’s big on “who the hell cares whether the instruction decoder is big and inefficient?!?”

Imagine what it would take to have an M1-like chip, with so many parallel pipes, where you guarantee that the x86 personality can also keep that many pipes full? You’d have to build in all the alder-lake decoding just to maybe make efficient use of your pipes. And I’m still not convinced that would work.
 

casperes1996

Power User
Vaccinated
Posts
54
Reaction score
31
Take it with a grain of salt though, as that same youtuber is known to have made up technical issues on the spot for clicks (i.e. the 'TLB is limited to 32MB due to lack of foresight and that's what's limiting GPU scaling on the M1 Ultra' BS).

For what it is worth, I don't think Vadim is intentionally making things up. I believe he is simply ignorant of some technical details and fills them in, to the best of his ability, such as it is. I appreciate his enthusiasm, but he's Max Tech's "hype man", while his brother, whom the channel is named after, tends to do the "bake offs" comparison videos, which are far more useful. They're the modern tech equivalent of P.T. Barnum and James Bailey. Barnum was the huckster with the side show, Bailey was the circus man.
I definitely don't think they're trying to be misleading, but they do (also admitting themselves) pump out videos at such a high rate relative to their time to fact check, that they just don't do any of that fact checking at all. And the TLB bullshit was "an anonymous source familiar with the matter" And yeah they don't have the technical knowledge to fact check any of that themselves. Max Yuryev's content was the best before MaxTech blew up and his content focused on his background as a photographer and video maker. He knows what he talks about in that space, and when he first started comparing computers he didn't try to be that technical. He tried to say "from the perspective of someone using them professionally for film/photography, this is the user experience".
But I've left too many comments on their videos when they mention the TLB going "What... Please, if there's actual any logic to this, tell me how the TLB is the problem here?". There's a long thread on Hackernet where they give them a massive benefit of the doubt saying "Maybe there were talking about TiLe Buffer and not Translation Lookaside Buffer? And it's about Tile memory in the GPU?" etc. but there was just no way to make it make sense as the big bottleneck they talk about. I mean you can of course have memory layout that will miss caches and such but you want to pack data for good access patterns regardless of whether it's Apple Silicon or not. So yeah
Jim, Jim, Jim. I don’t have time to watch the video, but if the context is accurate, that’s just silly. Sure, if all you care about is die area and total power dissipation, then it doesn’t matter. Doubling or tripling the size and watts of the instruction decoder won’t matter when you have a chip with 32 cores and tons of cache on it. But there are lots of things other than just die area to worry about. And he is selling the power issue short - needing a higher clock to keep up because your IPC is lower because you can’t reliably decode enough instructions to keep the pipelines full also causes much more power to be burned; it’s not just the power consumed by the instruction decoder itself that matters.
In fairness, I think the quote has some merit too. Like the people who go "ARM is just for phones. Can never be a proper desktop CPU!" - ISA doesn't matter there. And I've used the quote as well when talking to people who were saying that "any ARM will be better than any x86", using the quote to effectively say "The ISA doesn't matter (as much as the actual chip design)" - Apple's Firestorm is not the same as Qualcomm's Snapdragon cores. May both be ARMv8, but the ISA doesn't make the chip. An Alder Lake is not a Pentium II. An Athlon is not a Ryzen. To me the quote says "Give credit to the chip design - it's better cause it's better. Not just cause an ISA is inherently better - there's still al to of work that goes into it after that". Regardless of how Keller actually meant it, I think that's a good message.
Plus, he was trying to sell RISC-V for SciFive. With a fairly niche ISA like that compared to ARM and x86, you kinda need to be arguing "No no, it doesn't matter, I swear!" - I don't know how good RISC-V is in terms of making efficient hardware, but at least from a software support perspective, its niche status can make it a harder sale for some applications at least
 

Andropov

Power User
Vaccinated
Posts
223
Reaction score
160
Location
Spain
But I've left too many comments on their videos when they mention the TLB going "What... Please, if there's actual any logic to this, tell me how the TLB is the problem here?". There's a long thread on Hackernet where they give them a massive benefit of the doubt saying "Maybe there were talking about TiLe Buffer and not Translation Lookaside Buffer? And it's about Tile memory in the GPU?" etc. but there was just no way to make it make sense as the big bottleneck they talk about. I mean you can of course have memory layout that will miss caches and such but you want to pack data for good access patterns regardless of whether it's Apple Silicon or not. So yeah
Also if he meant tile buffer (instead of TLB) as the cause of the problem, it wouldn't explain the scaling issues that he was trying to explain in the first place.
 

theorist9

Power User
Posts
72
Reaction score
44
There's been a lot written about why Apple's CPU's are more efficient (performance : power) than Intel's. It seems it's essentially three things: A macroarchitecture that allows for more efficiency, a microarchitecture designed from the ground up with efficiency in mind, and a freedom from backwards compatibility requirements. Does that cover the essentials?

But I just watched that video, and Anthony mentioned the huge difference in efficiency between Apple's and NVIDIA's GPU's, which got me wondering: What are the essential differences that account for that? And how close are Intel's mobile and desktop integrated GPU's in efficiency to AS?
 

Cmaier

Elite Member
Staff Member
Vaccinated
Site Donor
Posts
2,794
Reaction score
3,785
There's been a lot written about why Apple's CPU's are more efficient (performance : power) than Intel's. It seems it's essentially three things: A macroarchitecture that allows for more efficiency, a microarchitecture designed from the ground up with efficiency in mind, and a freedom from backwards compatibility requirements. Does that cover the essentials?

But I just watched that video, and Anthony mentioned the huge difference in efficiency between Apple's and NVIDIA's GPU's, which got me wondering: What are the essential differences that account for that? And how close are Intel's mobile and desktop integrated GPU's in efficiency to AS?

Years ago i interviewed at nvidia, and they had no idea how to design CPUs. They thought that the ASIC design methodology that they used for GPUs, where time-to-market was the most important thing, would work fine for CPUs, and that it would be impossible to use a custom methodology (like what AMD was using at the time) to achieve the time to market they needed. Their entire design team was composed of people who only knew how to design a chip by writing code (I can’t remember if it was verilog or some sort of C-based language) and letting a tool like Synopsys come up with a netlist and then something like Cadence to auto place & route. I’m guessing not a lot has changed.

When I was handling the methodology at AMD, we’d often have representatives from different EDA vendors come in and try to sell us on their tools. Cadence, Apollo, Synopsys, Mentor, whatever. We’d typically give them some block that we needed for whatever chip we were working on, and say “go use your tools and do the best you can, and we’ll compare it to what we do by hand.” Every single time, they’d come up with something that took 20% more die area, burned 20% more power, and was 20% slower. (Or some slightly different allocation, but it was always a failure).

I’m sure that tile based deferred rendering and unified memory architecture and all that is fantastic and has a lot to do with it, but another advantage Apple has is that they design chips the ”right“ way.
 

mr_roboto

Power User
Posts
120
Reaction score
128
But I just watched that video, and Anthony mentioned the huge difference in efficiency between Apple's and NVIDIA's GPU's, which got me wondering: What are the essential differences that account for that? And how close are Intel's mobile and desktop integrated GPU's in efficiency to AS?
Others have mentioned TBDR efficiency gains, and @Cmaier mentioned NVidia's design methodology (though FYI Cliff, from some die photos of their more recent GPUs, I suspect they've transitioned away from standard cell ASIC - a few generations ago everything other than memories was shapeless APR blobs, but in their recent stuff compute looks more orderly). There's also process node advantage - NVidia's been using Samsung as a foundry and apparently Samsung's 8nm process isn't too competitive with TSMC 5nm.

But I think most important of all is just a very basic design philosophy choice. Every GPU has to have lots of raw compute power. If you want to design a 10 TFLOPs GPU, do you get there by clocking lots of ALUs at a relatively slow speed, or fewer ALUs at much higher clocks?

The former is what Apple seems to be doing. It wastes die area, but increases power efficiency. The latter choice is roughly what Nvidia does - damn the power, we want to make as small a die as possible for a given performance level.

These choices are probably somewhat influenced by TBDR. A TBDR GPU can get away with fewer FLOPs for a given rasterization performance target, since it uses those FLOPs more efficiently (or can, with adequate application software optimization for TBDR). But I think it's far more important that Apple Silicon has a very strong focus on power efficiency, one which comes right from the top of their organization (probably even extending to the CEO).
 
Top Bottom