X86 vs. Arm

Yoused

up
Posts
5,620
Reaction score
8,937
Location
knee deep in the road apples of the 4 horsemen
I just saw one of those performance charts over on that other site comparing CPUs. It had some intels, up to 11900, and topped with a couple of 5900-series Ryzens.

The M1 Maxes placed 3rd and 6th on the chart. However, there were two bars for each test, and the chart was sorted by SpecInt – both M1s absolutely blew everyone else out of the water on the SpecFP.

Now, I can understand that FP dispatch is faster on ARM, as FP/Neon is baked right into the instruction set rather than how FP/SSE/AVX512/etc is essentially grafted onto x86, and perhaps the few ns gained in dispatch can add up (apparently to quite a lot) over the many cycles of a SpecFP test. But is there some other thing going on there?

Improving integer performance (at least insofar as Spec tests go) seems like it would ultimately yield diminishing returns. Integer ops are pretty basic stuff, and speeding them up makes the easy stuff faster. But the real test of a processor is the hard stuff, which tends to lean toward the FP/SIMD realm.

Does the M1 have better efficiency at the logic level in performing FP, or is the A=B+C design simply that much more efficient than the A=A+B design?
 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,326
Reaction score
8,512
I just saw one of those performance charts over on that other site comparing CPUs. It had some intels, up to 11900, and topped with a couple of 5900-series Ryzens.

The M1 Maxes placed 3rd and 6th on the chart. However, there were two bars for each test, and the chart was sorted by SpecInt – both M1s absolutely blew everyone else out of the water on the SpecFP.

Now, I can understand that FP dispatch is faster on ARM, as FP/Neon is baked right into the instruction set rather than how FP/SSE/AVX512/etc is essentially grafted onto x86, and perhaps the few ns gained in dispatch can add up (apparently to quite a lot) over the many cycles of a SpecFP test. But is there some other thing going on there?

Improving integer performance (at least insofar as Spec tests go) seems like it would ultimately yield diminishing returns. Integer ops are pretty basic stuff, and speeding them up makes the easy stuff faster. But the real test of a processor is the hard stuff, which tends to lean toward the FP/SIMD realm.

Does the M1 have better efficiency at the logic level in performing FP, or is the A=B+C design simply that much more efficient than the A=A+B design?
On x86 the fp is essentially treated like a coprocessor. Also ieee floating point is not the same as x87 floating point. I don’t know what spec does about that.
 

mr_roboto

Site Champ
Posts
288
Reaction score
464
I just saw one of those performance charts over on that other site comparing CPUs. It had some intels, up to 11900, and topped with a couple of 5900-series Ryzens.

The M1 Maxes placed 3rd and 6th on the chart. However, there were two bars for each test, and the chart was sorted by SpecInt – both M1s absolutely blew everyone else out of the water on the SpecFP.

Now, I can understand that FP dispatch is faster on ARM, as FP/Neon is baked right into the instruction set rather than how FP/SSE/AVX512/etc is essentially grafted onto x86, and perhaps the few ns gained in dispatch can add up (apparently to quite a lot) over the many cycles of a SpecFP test. But is there some other thing going on there?
AnandTech did the SPECint and SPECfp benchmarking on M1 which most people cite - they're not official runs submitted to the database, but they use reasonably good and fair methodology. Wouldn't be surprised if the numbers you saw, and possibly even the chart, were from AT.

The guy who did AT's SPEC testing for a long time, Andrei F., thinks M1 Pro/Max SPECfp scores are explained by exceptionally high scores on several benchmarks in the suite which are usually bottlenecked by memory bandwidth rather than raw FLOPS. M1 SoCs are exceptionally good at allowing even individual CPU cores to use lots of bandwidth. In the benchmarks which aren't BW-limited, M1 scores fall back to earth a bit - respectable but not extraordinary.
 

mr_roboto

Site Champ
Posts
288
Reaction score
464
On x86 the fp is essentially treated like a coprocessor. Also ieee floating point is not the same as x87 floating point. I don’t know what spec does about that.
Everything in x87 is legal IEEE 754, just a bit weird compared to all other surviving commercially important IEEE 754 implementations. This is because they implemented a feature actually recommended by 754: "extended precision". In x87, this means that while numbers stored in RAM are in the standard 32-bit or 64-bit IEEE formats, on load they're expanded to an internal 80-bit format, and on store, these 80-bit values are rounded to 64-bit or 32-bit.

Extended precision isn't a bad thing, really. It makes chains of operations conducted without memory spills more precise. In practical terms, though, it can make porting code a little bit more exciting. IEEE 754 FP is already a great way to trip up people who think that the results of every line of C code are 100% identical on every platform, and that's only more true when the porting target uses a different IEEE extended precision option.

But x87 oddness doesn't matter anymore. Back when Apple switched from PowerPC to x86, they did something which Windows and Linux eventually did too: they set up their ABIs and compilers to use SSE2 for scalar FP and never touch x87 at all. SSE2 doesn't implement extended precision, and treats its registers purely as registers rather than a weird stack-like structure. (which is what I think you're talking about when you mention the "coprocessor" thing?)
 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,326
Reaction score
8,512
p
Everything in x87 is legal IEEE 754, just a bit weird compared to all other surviving commercially important IEEE 754 implementations. This is because they implemented a feature actually recommended by 754: "extended precision". In x87, this means that while numbers stored in RAM are in the standard 32-bit or 64-bit IEEE formats, on load they're expanded to an internal 80-bit format, and on store, these 80-bit values are rounded to 64-bit or 32-bit.

Extended precision isn't a bad thing, really. It makes chains of operations conducted without memory spills more precise. In practical terms, though, it can make porting code a little bit more exciting. IEEE 754 FP is already a great way to trip up people who think that the results of every line of C code are 100% identical on every platform, and that's only more true when the porting target uses a different IEEE extended precision option.

But x87 oddness doesn't matter anymore. Back when Apple switched from PowerPC to x86, they did something which Windows and Linux eventually did too: they set up their ABIs and compilers to use SSE2 for scalar FP and never touch x87 at all. SSE2 doesn't implement extended precision, and treats its registers purely as registers rather than a weird stack-like structure. (which is what I think you're talking about when you mention the "coprocessor" thing?)

I designed an SSE unit, an IEEE floating point unit, and an x86 floating point unit. X86 floating point unit was weird. IEEE floating point unit was the hardest to design, because it was for powerpc and the bits were numbered in reverse (so bit 0 was the highest order bit), which caused great mental gymnastics converting from wire names to … math.
 

thekev

Elite Member
Posts
1,110
Reaction score
1,674
Extended precision isn't a bad thing, really. It makes chains of operations conducted without memory spills more precise. In practical terms, though, it can make porting code a little bit more exciting. IEEE 754 FP is already a great way to trip up people who think that the results of every line of C code are 100% identical on every platform, and that's only more true when the porting target uses a different IEEE extended precision option.


Extended precision may not be a bad thing in itself, but making the exact solution to a problem involving floating point arithmetic dependent on the register allocator is just the worst kind of nonsense. It means that any spill to memory, even if compiler generated, impacts the answer to chained arithmetic.
 

Yoused

up
Posts
5,620
Reaction score
8,937
Location
knee deep in the road apples of the 4 horsemen
Wouldn't be surprised if the numbers you saw, and possibly even the chart, were from AT.
I went and looked at the chart and then went to AT to compare the logo (as I was not acquainted with it and
squirrel!
got stuck reading a fascinating piece about IBM turning fat L2s into an interconnected L2/L3/L4 structure. Curse you for tricking me into getting ensnared by them. ;)
 

leman

Site Champ
Posts
639
Reaction score
1,188
Now, I can understand that FP dispatch is faster on ARM, as FP/Neon is baked right into the instruction set rather than how FP/SSE/AVX512/etc is essentially grafted onto x86, and perhaps the few ns gained in dispatch can add up (apparently to quite a lot) over the many cycles of a SpecFP test. But is there some other thing going on there?

Improving integer performance (at least insofar as Spec tests go) seems like it would ultimately yield diminishing returns. Integer ops are pretty basic stuff, and speeding them up makes the easy stuff faster. But the real test of a processor is the hard stuff, which tends to lean toward the FP/SIMD realm.

Does the M1 have better efficiency at the logic level in performing FP, or is the A=B+C design simply that much more efficient than the A=A+B design?

M1 simply has more FP units. It has four independent FP units while x86 CPUs mostly have two full function units (and maybe a some more that are capable of limited functionality). On general-purpose FP code, without much vectorization, M1 can on average execute more operations simultaneously, especially when you combine it with its humongous out of order windows.

The situation is a bit different when looking at SIMD code. M1 FP units are 128-bit wide, so you basically get 512 bits worth of SIMD ops. The units on modern x86 are 256 or even 512 bits (on some Intel CPUs). The net result is about the same, but x86 - especially desktop - often runs higher clock and has more cache bandwidth. So on high-throughtput, tightly optimized SIMD code, M1 will often be slower than a desktop x86 running at high base clock.

Overall, M1 is a more flexible architecture in this regard, while also focusing on power efficiency. Apple deliberately trades some of that SIMD throughput to deliver a CPU that will perform better on real world code while also consuming much less power.
 
Last edited:

mr_roboto

Site Champ
Posts
288
Reaction score
464
I designed an SSE unit, an IEEE floating point unit, and an x86 floating point unit. X86 floating point unit was weird. IEEE floating point unit was the hardest to design, because it was for powerpc and the bits were numbered in reverse (so bit 0 was the highest order bit), which caused great mental gymnastics converting from wire names to … math.
Sorry for the dumb lecture then, sometimes I have issues understanding where people are coming from.

I hate PPC bit numbering too. I encountered it not in processor design, but when designing a PPC single board computer a long time ago. IIRC the bitfield manipulation instructions also use it, which has got to be awful to deal with for programmers.
 

Andropov

Site Champ
Posts
617
Reaction score
776
Location
Spain
Extended precision may not be a bad thing in itself, but making the exact solution to a problem involving floating point arithmetic dependent on the register allocator is just the worst kind of nonsense. It means that any spill to memory, even if compiler generated, impacts the answer to chained arithmetic.
This. I don't know much about Intel's 80-bit precision numbers, but I think it's weird that compilers (by default) are so extremely careful about not rearranging fp operations to avoid weird or non-portable results and then have something like that on the CPU that can mess your fp expectations in a similar way.
 

leman

Site Champ
Posts
639
Reaction score
1,188
This. I don't know much about Intel's 80-bit precision numbers, but I think it's weird that compilers (by default) are so extremely careful about not rearranging fp operations to avoid weird or non-portable results and then have something like that on the CPU that can mess your fp expectations in a similar way.

That’s why nobody uses x87 stuff anymore. It’s slow, it’s awkward, it has complex logic. Since we got streamlined SIMD units with full IEEE support and (mostly) sane behavior, x87 became obsolete. And even if you need more precision than what fp64 can give you you are probably still better off using multiprecision algorithms.
 

throAU

Site Champ
Posts
257
Reaction score
274
Location
Perth, Western Australia
The other thing with comparing specFP numbers is that the M1 chips have a pretty competent GPU that does floating point onboard. They've also got other specialist engines for doing specific tasks at scale - and more importantly are running on a platform with libraries that will make use of them. At the same time as the CPU does something else.

Benchmarks claiming "oh look how fast alder lake is on this" kinda miss the point. Unfortunately there aren't really cross platform benchmarks for alder lake and Apple Silicon, as the libraries to use all the coprocessors do not exist on windows/linux, and alder lake doesn't exist as a properly supported platform in macOS.

Its the performance of the platform at running applications that counts, and whilst benchmarks can give you a little bit of an idea on that, they're not the complete picture.

I mean with the media engines M1 Pro/Max can transcode a whole heap of video in the background whilst the cpu/gpu is IDLE. Try that on Alder Lake? Even if the CPU supported it, Windows doesn't. Sure its not an every-man use case, but there's a bunch of engines in the M1 SOCs that handle a bunch of things regular people do actually use.
 

Nycturne

Elite Member
Posts
1,137
Reaction score
1,484
I mean with the media engines M1 Pro/Max can transcode a whole heap of video in the background whilst the cpu/gpu is IDLE. Try that on Alder Lake? Even if the CPU supported it, Windows doesn't. Sure its not an every-man use case, but there's a bunch of engines in the M1 SOCs that handle a bunch of things regular people do actually use.

Alder Lake does support it. It isn’t a new feature either, with Apple using it for years as part of the Video Toolbox API. The catch is more that in general, the hardware encode blocks are fast, but not horribly flexible. If it doesn’t support the codec you want to use, you are SOL. So the main advantage here that Apple has is that they don’t have to wait on Intel for certain codecs (ProRes), and they can tune it for their cases more specifically, even if it isn’t as efficient on final size at the same quality as x265 for HEVC/H.265 video.
 

mr_roboto

Site Champ
Posts
288
Reaction score
464
Aha. When I posted earlier I was half remembering something, but wasn't sure about it so I didn't say. This bugged me, so I did some searching and confirmed that my memory isn't entirely swiss cheese (yet).

The principal architect of IEEE 754, William Kahan, was actually involved in the design of the 8087. In fact, IEEE 754 was derived from work Kahan did for 8087!


This interview segment from the Turing Award page discusses how Kahan got involved with 754, and its relationship with 8087:



See also: (many valuable insights into the traps which lie at the bottom of any attempt to approximate the continuum with a finite number of digits)


TLDR summary: an ex-student of Kahan's got hired by Intel, ended up in charge of FP, and came back to Kahan to ask for help in specifying how x86 would do floating point. Kahan was also getting involved with the 754 standards process, decided the 8087 work (its numerics, not the ISA) was the right thing to build 754 on, convinced Intel to let him show it to the standards committee, and then set to work convincing everyone it was not merely a good idea but possible to build economically (which he already knew, because he'd designed it for Intel's mass market chip).

Some of these interviews and so forth discuss the rationale for extended precision. You may or may not agree, but Kahan clearly thinks it's a good idea, which is why it's promoted as a desirable optional feature by IEEE 754 even if few modern 754 implementations have it.
 

Yoused

up
Posts
5,620
Reaction score
8,937
Location
knee deep in the road apples of the 4 horsemen
Some of these interviews and so forth discuss the rationale for extended precision. You may or may not agree, but Kahan clearly thinks it's a good idea, which is why it's promoted as a desirable optional feature by IEEE 754 even if few modern 754 implementations have it.

As I recall, M68000 had a 96-bit format, which was actually just an 80-bit format padded with 16 extra bits between the E and the M in order to make it fill three 32-bit words. And I believe PPC tacked 3 bits onto the tail of its numbers in the registers for accuracy sake (which might explain their backward numbering scheme).
 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,326
Reaction score
8,512
As I recall, M68000 had a 96-bit format, which was actually just an 80-bit format padded with 16 extra bits between the E and the M in order to make it fill three 32-bit words. And I believe PPC tacked 3 bits onto the tail of its numbers in the registers for accuracy sake (which might explain their backward numbering scheme).

Nah, the backward numbering scheme also existed for integers, if I recall correctly. (I owned only the FPU on the x704. Though that took half the core real estate - https://en.wikichip.org/wiki/File:x704_floorplan.jpg). Interesting as I look at this floor plan that it isn’t quite right. There was an NP block (numerical processor) where the FPU is, so that’s right. But there was an NPI (numerical processor interface) block that I also owned which takes up space that is assigned to other block here.

I don’t remember for sure, but there’s a very good chance I drew this floor plan myself, since I was responsible fror the JSSC paper on the chip. So I guess I fudged.
 

emagnuson

New member
Posts
1
Reaction score
3
Aha. When I posted earlier I was half remembering something, but wasn't sure about it so I didn't say. This bugged me, so I did some searching and confirmed that my memory isn't entirely swiss cheese (yet).

The principal architect of IEEE 754, William Kahan, was actually involved in the design of the 8087. In fact, IEEE 754 was derived from work Kahan did for 8087!
The above post prompted me to join up and join in.

I saw Prof Kahan give a talk on numerical accuracy either in 1976 or 1977 at Cal. A good portion of the talk was comparing the way HP and TI calculators did arithmetic. At that time, TI advertised that you could take a logarithm of a number then get the same number back when taking the exponent of the logarithm, whereas HP calculators would give a slightly different number. Kahan went on to explain how TI used a total of 13 decimal digits for calculations and displaying 10 digits, while HP just used 10 digits and rounded after the calculation. He then went on to say that taking the logarithm and then exponent of the rounded value would give the rounded value, while if you did enough cycles with the TI calculator, you would start getting a different number.

Kind of surprised me that he would have the extended precision, but wasn't surprised to hear that his intent with the 8087 arithmetic was to make it easy to use "pencil and paper arithmetic" and not have to worry about the finer points of numerical analysis.

One other aspect about the 8087 that puzzled me was the emphasis on partial tangents and partial arctangents along with reference to CORDIC. Finally got around to reading up on CORDIC circa 2010 and the partial tangent and partial arctangent made a lot of sense.
 

Yoused

up
Posts
5,620
Reaction score
8,937
Location
knee deep in the road apples of the 4 horsemen
Someone over at tOP repeated that bullshit line about how x86 has a "RISC-like processor core", which kind of pissed me off. It is not "RISC-like", it is simply the more efficient way of implementing the x86 ISA. I fail to see any real advantage of the μop design over a true RISC front end. I wish people would just stop spreading that nonsense.
 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,326
Reaction score
8,512
Someone over at tOP repeated that bullshit line about how x86 has a "RISC-like processor core", which kind of pissed me off. It is not "RISC-like", it is simply the more efficient way of implementing the x86 ISA. I fail to see any real advantage of the μop design over a true RISC front end. I wish people would just stop spreading that nonsense.
It’s a very popular line by people who know just enough but not really enough. Yeah, we get it, the bits of the instruction no longer go directly into mux inputs in an ALU to control the adder. They haven’t done that since the 80186, in fact, so not sure which iteration of x86 suddenly became “risc-like.”
 

jbailey

Power User
Posts
170
Reaction score
187
Someone over at tOP repeated that bullshit line about how x86 has a "RISC-like processor core", which kind of pissed me off. It is not "RISC-like", it is simply the more efficient way of implementing the x86 ISA. I fail to see any real advantage of the μop design over a true RISC front end. I wish people would just stop spreading that nonsense.
I posted a nuanced article in reply to a reply. I think it does a good job describing the history and current status without making any real predictions on what side the argument will ultimately win.

RISC vs. CISC Is the Wrong Lens for Comparing Modern x86, ARM CPUs
 
Top Bottom
1 2