X86 vs. Arm

Joelist · Feb 9, 2022

Not a bad article, but you need to be clearer that Apple Silicon is not ARM in the strictest sense. It uses an ISA that has the ARM ISA in it but its microarchitecture differs radically from Cortex and also from all the other ARM processors.

jbailey · Feb 10, 2022

Joelist said:
Not a bad article, but you need to be clearer that Apple Silicon is not ARM in the strictest sense. It uses an ISA that has the ARM ISA in it but its microarchitecture differs radically from Cortex and also from all the other ARM processors.

Sorry for the confusion. It’s not written by me. I just reposted it. The article was written by Joel Hruska for ExtremeTech.

Yoused · Feb 10, 2022

Joelist said:
… has the ARM ISA in it but its microarchitecture differs radically …

I am not sure I would say radically. It does the same stuff and is capable of running the same object code, perhaps with an extra feature or two. It just does it significantly more efficiently than does anyone else's implementation. In a way, it is unfortunate that the license does not have a sort of GPL-like clause so that everyone would have to share their design principles with other license holders. That would really make Intel sweat.

Cmaier · Feb 10, 2022

Yoused said:
I am not sure I would say radically. It does the same stuff and is capable of running the same object code, perhaps with an extra feature or two. It just does it significantly more efficiently than does anyone else's implementation. In a way, it is unfortunate that the license does not have a sort of GPL-like clause so that everyone would have to share their design principles with other license holders. That would really make Intel sweat.

Based on my understanding of micro architecture, it’s radically different. Not to be confused with architecture.

Buntschwalbe · Feb 20, 2022

Hi everybody!

I'm wondering where we will get our good and detailed insights into the new M2 chips, since AndreiF doesn't work for Anandtech anymore. Any Ideas?

Cmaier · Feb 20, 2022

Buntschwalbe said:
Hi everybody!

I'm wondering where we will get our good and detailed insights into the new M2 chips, since AndreiF doesn't work for Anandtech anymore. Any Ideas?

I’m sure someone else will step up. If not, we can put together information from multiple sources. But pretty sure M2 single core will look a lot like A15.

Yoused · Feb 20, 2022

Cmaier said:
pretty sure M2 single core will look a lot like A15

How wide do you think they will be able to go?

Cmaier · Feb 20, 2022

Yoused said:
How wide do you think they will be able to go?

You mean issue width? I don’t know. I guess the question is, at what point does going wider not worth it? I suppose they could still go a bit wider, though I wonder if they get more bang for the buck doing things like improving branch prediction, increasing queue depths, improving cache hit rate, etc.

Yoused · Feb 20, 2022

What about transient tracking? Having a pecuiiar fondness for the 6502, I cribbed up what a 64-bit version would look like, and the issue of transient values jumped right out at me.

Imagine, you get a value into, say, r17, you add it to r8, going into r16, then you do nothing else with r17 until another value goes into it – do you ever ultimately commit the original value to r17 if it never gets used before being replaced by some other value? In other words, is there a more efficient way to use/discard rename registers (kind of an op-fusion scheme, as it were), or do they already do that?

mr_roboto · Feb 21, 2022

Yoused said:
What about transient tracking? Having a pecuiiar fondness for the 6502, I cribbed up what a 64-bit version would look like, and the issue of transient values jumped right out at me.

Imagine, you get a value into, say, r17, you add it to r8, going into r16, then you do nothing else with r17 until another value goes into it – do you ever ultimately commit the original value to r17 if it never gets used before being replaced by some other value?

I'm not entirely sure the question makes sense. Depends on what you mean by "commit". Let me pseudocode and label the instructions to make it easier to talk about...

1. load r17, 500; # r17 = 500
2. add r16, r17, r8; # r16 = r17 + r8
...
N. load r17, 501; # r17 = 501

When instruction #2 executes, the machine has already made the value 500 into architecturally visible state for r17. If it hasn't, instruction 2 must stall until its operands are visible in architectural state.

The place where the value 500 is stored might not be what you think of as r17, but at the moment instruction #2 grabs the value 500 from the register file, it is the value of r17.

And if something unusual happens - an exception - at any time between completion of #1 and #N, the machine needs to be able to make 500 the official value of r17. The exception handler has to save it, and later restore it. It has no clue that the value will never be useful again. Can't, it's not psychic! (And the place where the value might actually be useful anyways is in the exception handler. Think debuggers, for example.)

Yoused · Feb 21, 2022

mr_roboto said:
It has no clue that the value will never be useful again. Can't, it's not psychic!

AAUI, the reorder buffer on a Firestorm has something like 630 ops in flight, which suggests that the dispatcher has a pretty panoramic view of what is downstream. I could imagine that an op in the buffer could easily be tagged with a provisional writeback-bypass flag that would allow it to go directly to the retire stage, barring an exception. Compiling code to do most of its work in a small range of scratch registers could optimize this kind of behavior, the same way compilers have become smart enough to turn verbose source into compact object code.

Exception slicing in such a large buffer must give engineers nightmares, though.

Cmaier · Feb 21, 2022

Yoused said:
What about transient tracking? Having a pecuiiar fondness for the 6502, I cribbed up what a 64-bit version would look like, and the issue of transient values jumped right out at me.

Imagine, you get a value into, say, r17, you add it to r8, going into r16, then you do nothing else with r17 until another value goes into it – do you ever ultimately commit the original value to r17 if it never gets used before being replaced by some other value? In other words, is there a more efficient way to use/discard rename registers (kind of an op-fusion scheme, as it were), or do they already do that?

I guess I am not understanding the question. r17 is an architectural register, and you are doing something like:

1. load r17, 500; # r17 = 500
2. add r16, r17, r8; # r16 = r17 + r8
...
N. load r17, 501; # r17 = 501

?

So is your question whether you ever bother putting the 500 into r17?

Seems to me it’s sort of moot. It has to go into a design register in either case, so that it is stable for use by the adder. That design register has the 500 in it, plus a tag (17) identifying the architectural register (and/or you have a content-addressable memory that ties 17 to the design register). I suppose the load and add could be coalesced internally into an immediate add (if you knew that you weren’t going to need r17 for some other purpose), but the benefit seems like it would be pretty minimal.

theorist9 · Jun 5, 2022

Not sure if this has been discussed [when I try using the search function on this thread I always get "No Results], but this article [https://www.microcontrollertips.com/risc-vs-cisc-architectures-one-better/] says:

"The RISC ISA emphasizes software over hardware. The RISC instruction set requires one to write more efficient software (e.g., compilers or code) with fewer instructions. CISC ISAs use more transistors in the hardware to implement more instructions and more complex instructions as well."

I take that to mean software optimization is more critical for ARM (RISC) than x86 (CISC) in order to achieve optimum performance.

They mention this needed optimization refers to both conventional program code (i.e., what most developers write), and optimization of the assembly code generated by the compiler.

So two questions:

1) Does this mean there's a lot more optimization still to be had in programs written for ARM—or has the needed software optimization to which they're referring, for the most part, already been done?

2) It seems this also could refer to optimization of low-level libraries—like, for instance, the ARM equivalent of x86's Intel Math Kernel Library. I note this because it appears that Mathematica still isn't optimized for Apple Silicon. On the WolframMark Benchmark, my 2014 MBP gets 3.0. The M1 should be nearly twice as fast. Yet I've seen several WolframMark benchmarks posted for the M1, and they're never over 3.2. [My 2019 i9 iMac gets 4.5, but it's hard to tell how many cores the benchmark is using; at least I know core number doesn't put the 4+4 core M1 at a disadvantage in comparison to my 4-core MBP.] Some have opined this is partly because no one has yet written a ARM version of the MKL that is as highly optimized as it is. This is typically explained by the substantial time and expertise Intel has devoted to MKL. But could part of this also be (here I'm purely speculating) that it's harder to achieve high optimization of low-level libraries with RISC than CISC because RISC performance is more sensitive to software inefficiencies?

Cmaier · Jun 5, 2022

theorist9 said:
Not sure if this has been discussed [when I try using the search function on this thread I always get "No Results], but this article [https://www.microcontrollertips.com/risc-vs-cisc-architectures-one-better/] says:

"The RISC ISA emphasizes software over hardware. The RISC instruction set requires one to write more efficient software (e.g., compilers or code) with fewer instructions. CISC ISAs use more transistors in the hardware to implement more instructions and more complex instructions as well."

I take that to mean software optimization is more critical for ARM (RISC) than x86 (CISC) in order to achieve optimum performance.

They mention this needed optimization refers to both conventional program code (i.e., what most developers write), and optimization of the assembly code generated by the compiler.

Thus it seems this also could refer to optimization of low-level libraries—like, for instance, the ARM equivalent of x86's Intel Math Kernel Library.

I note this because it appears that Mathematica still isn't optimized for Apple Silicon. I've seen several WolframMark benchmarks posted for the M1, and they're never over 3.2. By contrast, my 2014 MBP gets 3.0 (my 2019 i9 iMac gets 4.5, but it's hard to tell how many cores the benchmark is using; at least I know core number doesn't put the 4+4 core M1 at a disadvantage in comparison to my 4-core MBP). Some have opined this is partly because no one has yet written a ARM version of the MKL that is as highly optimized as it is. This is typically explained by the substantial time and expertise Intel has devoted to MKL. But could part of this also be (here I'm purely speculating) that it's harder to achieve high optimization of low-level libraries with RISC than CISC because RISC performance is more sensitive to software inefficiencies?

Mathematica aside, I'm wondering whether the original quote means there's a lot more optimization still to be had in programs written for ARM—or if the needed software optimization they're referring to has, for the most part, already been done.

It’s just as easy to optimize RISC code as CISC code. In fact, it’s probably easier. Think of it as building a house using Legos. CISC gives you big bricks with lots of complex shapes. RISC gives you tiny 1x1 bricks, from which you can build anything you want.

CISC code is being broken up into microOps by the processor, anyway, at the instruction decoder stage. I‘d rather have a compiler, with lots of resources and the ability to understand the entirety of the code and the developer‘s intent, rather than an instruction decoder that sees only a window into maybe 100 instructions, figure out how to optimize things.

The issue with Mathematioca seems to simply be That people haven‘t yet optimized MKL for ARM. And since my understanding is that MKL comes from Intel, it is unlikely to be any time soon unless someone comes up with their own version.

Nycturne · Jun 5, 2022

It’s also not always clear what the problem with code is without analysis. There’s lots of different traps you can fall into porting code, and the first step is always “get it working”. But for “conventional” code like you’ll find in many apps, the compiler is the one doing the hard work and developers will do passes analyzing areas where things aren’t performing like they should. Low level libraries are another matter, because they can be optimized by hand to take into account quirks of the architecture they run on. They tend to be faster than more common code, but harder to port as a result. But I’d say in my career, these low level libraries are things you write because you need to, not because you just felt like it, so they are less common, but can lie at the heart of big pieces of software, especially legacy software.

And things are different when talking about things like SIMD. Tools like SSE2NEON make it possible to port code that uses Intel SIMD intrinsics to use ARM’s SIMD units quickly, but it may not lead to optimal code. At the end of the port, you are still coupled to Intel’s SIMD, meaning if there’s a better approach available to NEON for a given task, you aren’t necessarily taking advantage of it.

But consider the difference between particular specialized number crunching libraries, and displaying some complex UI that includes a lot of images (even a grid of image thumbnails in Apple Music or Photos). The latter places a very different set of demands across the system, from disk access, to the CPU, to the GPU, but it can still be a considerable load. It also needs to be able to complete a lot of work very quickly to handle 120fps like on the new MBPs. And Apple has done some impressive work on that front. I have to keep reminding myself to test on an Intel system just to make sure that the buttery smooth scrolling/etc I’m getting on the M1 is at least reasonably good on the Intel systems as well.

But I will close by saying: don’t assume that developers put equal effort into each platform they support. They don’t. A deficiency in performance can very well be down to time. Either the decades behind an x86 library, versus a brand new port. Or simply devoting 80% of your engineering time to Windows because that’s what pays the bills.

theorist9 · Jun 5, 2022

Cmaier said:
It’s just as easy to optimize RISC code as CISC code. In fact, it’s probably easier. Think of it as building a house using Legos. CISC gives you big bricks with lots of complex shapes. RISC gives you tiny 1x1 bricks, from which you can build anything you want.

CISC code is being broken up into microOps by the processor, anyway, at the instruction decoder stage. I‘d rather have a compiler, with lots of resources and the ability to understand the entirety of the code and the developer‘s intent, rather than an instruction decoder that sees only a window into maybe 100 instructions, figure out how to optimize things.

Got it. But is the article right that (even though optimizing code for RISC is not an issue), that code optimization is more critical for RISC than CISC?

Cmaier said:
The issue with Mathematioca seems to simply be That people haven‘t yet optimized MKL for ARM. And since my understanding is that MKL comes from Intel, it is unlikely to be any time soon unless someone comes up with their own version.

There is a version for ARM they can use, and I believe are using. It's just (probably) not as good. It seems it's challenging to write a fast math library. E.g., AMD produced its own version (which is now EOL), called ACML (AMD Core Math Library), and it was significantly slower than Intel's MKL, even when run on an AMD system:

Intel MKL vs. AMD Math Core Library

Does anybody have experience programming for both the Intel Math Kernel Library and the AMD Math Core Library? I'm building a personal computer for high performance statistical computations and am

stackoverflow.com

AMD subsequently replaced ACML with AOCL (AMD Optimizing CPU Libraries), which was mostly open-source-based, and faster than ACML, but still not as fast as MKL. Thus it has been SOP for Mathematica users (and others) who owned AMD systems and needed to optimize math performance to run the MKL library. To maintain a competitive advantage for its chips over AMD's, Intel blocked MKL from being used on AMD systems, but a workaround existed until 2019 that enabled AMD users to "fool" the MKL into thinking it was being run on an Intel chip (MKL_DEBUG_CPU_TYPE = 5). Intel responded by blocking this workaround in 2020:

Linking to MKL 2019 with AMD CPUs?

Question Does BinaryBuilder allow Ryzen/Threadripper owners to choose MKL 2019 as the binary to use for MKL.jl and MKLSparse.jl? Context Back in 2019 the MKL libraries had a “loophole” that allowed AMD CPUs to take advantage of all the nifty engineering done by Intel by setting the env variable...

discourse.julialang.org

Finally, I recently read a rumor that Intel, because of competition from ARM, might again allow MKL to run on AMD, in order to provide general support to the x86 ecosystem. No idea if it's true.

Cmaier · Jun 5, 2022

theorist9 said:
Got it. But is the article right that (even though optimizing code for RISC is not an issue), that code optimization is more critical for RISC than CISC?

I‘m not aware of a method of quantizing how “critical“ optimization is for a given architecture. I would tend to disagree with the premise, though. I *think* the premise in what you cited is wrong. It refers to “fewer” instructions in RISC - but that’s not what the “[R]educed” means in RISC, really. There can be just as many instructions in a RISC architecture as in a CISC architecture - each is just reduced in complexity.

It seems to me that *CISC* requires more optimization. For CISC you have to pick the right instruction, understand all of the side-effects of that instruction, and deal with fewer registers. RISC is more forgiving - you have fewer registers, and since each instruction is simple and since memory accesses are limited to a very small subset of the instruction set, you don’t have to work as hard to avoid things like memory bubbles, traps, etc.

CISC made sense in the days where RAM was extremely limited, because you can encode more functionality in fewer instructions (going back to the Lego metaphor - you can use fewer bricks, even if each brick is more complicated). Nowadays that isn’t an issue, so there is absolutely no advantage to CISC.

theorist9 said:
There is a version for ARM they can use, and I believe are using. It's just (probably) not as good. It seems it's challenging to write a fast math library. E.g., AMD produced its own version (which is now EOL), called ACML (AMD Core Math Library), and it was significantly slower than Intel's, even when run on an AMD system:

Intel MKL vs. AMD Math Core Library

Does anybody have experience programming for both the Intel Math Kernel Library and the AMD Math Core Library? I'm building a personal computer for high performance statistical computations and am

stackoverflow.com

AMD is tiny compared to Intel. They just don’t have the resources that Intel has to get it done. And nobody has really dedicated the full court effort to doing so for Arm. Yet. When I worked there, we had maybe one or two people who worked on things like that.

Nycturne · Jun 5, 2022

theorist9 said:
Got it. But is the article right that (even though optimizing code for RISC is not an issue), that code optimization is more critical for RISC than CISC?

I agree it’s the wrong premise, with OoO execution, micro-ops and other techniques, the CPU has a lot of control no matter the ISA. Microarchitecture seems more important to the final result than the ISA. The ISA does place some restrictions on the microarchitecture, but that has become less relevant over the years. And when people write higher level code, and not assembler, the ISA itself is an implementation detail left to the compiler, but you could very well make optimizations based on the microarchitecture’s behaviors if you really need to wring out every drop of performance.

The article is written from the perspective of microcontrollers which are usually years if not decades behind desktop/laptop chips, and even smartphone chips. When PPC/Pentium was the latest thing in the early 2000s, the microcontrollers I worked with were similar to the Z80. These days, the microcontrollers are starting to adopt ARM, but may be running on simpler cores and reliant on Thumb. I’m not even sure OoO is supported on some of these newer microcontrollers.

Cmaier said:
RISC is more forgiving - you have fewer registers, and since each instruction is simple and since memory accesses are limited to a very small subset of the instruction set, you don’t have to work as hard to avoid things like memory bubbles, traps, etc.

I assume you meant something else with the bolded bit? You describe both x86 and RISC having fewer registers.

Cmaier · Jun 5, 2022

Nycturne said:
I agree it’s the wrong premise, with OoO execution, micro-ops and other techniques, the CPU has a lot of control no matter the ISA. Microarchitecture seems more important to the final result than the ISA. The ISA does place some restrictions on the microarchitecture, but that has become less relevant over the years. And when people write higher level code, and not assembler, the ISA itself is an implementation detail left to the compiler, but you could very well make optimizations based on the microarchitecture’s behaviors if you really need to wring out every drop of performance.

The article is written from the perspective of microcontrollers which are usually years if not decades behind desktop/laptop chips, and even smartphone chips. When PPC/Pentium was the latest thing in the early 2000s, the microcontrollers I worked with were similar to the Z80. These days, the microcontrollers are starting to adopt ARM, but may be running on simpler cores and reliant on Thumb. I’m not even sure OoO is supported on some of these newer microcontrollers.

I assume you meant something else with the bolded bit? You describe both x86 and RISC having fewer registers.

LOL, right. RISC has more registers, CISC has fewer (as a rule of thumb).

Yoused · Jun 5, 2022

One difference between x86-64 (which is truly what "CISC" means, since there are no other common CISC processors these days, just a few niche ones) and most RISC achitectures is that x86 has at least 6 special-purpose registers out of 16, whereas most RISC designs emphasize general-use registers. You can do geeral work with the most of the specailized registers, but when you need one of the special operations, those registers become out-of-play. ARMv8+ has two special-purpose registers out of its 32 GPRs, meaning the large register file has 30 registers that can be freely used.

Apple's processors have really big reorder buffers that allow instructions to flow around each other so that instructions that may take longer get folded under as other instructions execute around them. This is facilitated by the "A + B = C" instruction design, as opposed to the "A + B = A" design of x86 (register to register move operations are much less common in most RISC processors).

The reorder logic is complex and the flexibility of RISC means that a large fraction of actual optimization takes place in the CPU, so code optimization is literally becoming less of an issue for Apple CPUs. From my perspective, it looks like optimization is largely a matter of spreading the work over as much of the register file as possible in order to minimize dependencies, and trying to keep conditional branches as far as practical from their predicating operations. The processor will take care of the rest.

The article theorist9 linked is pretty weak sauce. From what I understand, its claim that RISC programs "need more RAM" is just not correct. The size difference in most major programs is in the range of margin-of-error, and some of the housekeeping that x86 code has to do is not necessary on RISC. The evidence for that is that Apple's M1 series machines work quite well with smaller RAM complements.

X86 vs. Arm

Power User

Power User

up

Site Master

Active member

Site Master

up

Site Master

up

Site Champ

up

Site Master

Site Champ

Site Master

Elite Member

Site Champ

Site Master

Elite Member

Site Master

up

Similar threads