X86 vs. Arm

Nycturne · Nov 30, 2021

Cmaier said:
It is likely the clock won’t have to be reduced at all. Generally, cooling capacity is proportional to surface area of the die. Since they are likely mirroring what they already have to make these bigger “chips,” they will also add proportionally-identical cooling capacity.

That's what I expect as well. I just deleted it from my post before hitting submit.

throAU · Nov 30, 2021

Nycturne said:
Two things:

1) What makes you think they haven't "actually tried hard" for performance so far?
2) If the Jade 2C and 4C rumors are true, then we're looking at performance already scaling quite favorably to the 2019 Mac Pro.

While more cores tends to pull down the clock speed, I think Apple's approach so far will let them largely avoid that.

Sorry maybe my choice of words were not the best.

They've certainly tried hard, but their focus has been on efficiency not outright performance.

If they run desktop cooling on these things and have the freedom to run to intel power levels (or even half), performance will perhaps be significantly better.

Maybe I should have said "actually push the clocks" or something. Right now I suspect we're seeing the M series processors running at 2/3 throttle.... vs. what they could ramp up to with desktop cooling and power delivery.

I suspect we will see more than N scaling vs. M1-Pro/Max because they seem to be scaling fairly linearly vs. core count, but in the larger machines Apple will have the room for better cooling and more power.

Maybe they won't; maybe they'll just go for the green/small/quiet option, but there's certainly room there to clock the things harder and plenty of room inside even a half size Mac Pro for cooling and power. The M1-Pro in my 14" barely even spins the fan and the heatsink is tiny. Give it proper cooling in a desktop and there's so much headroom...

Cmaier · Nov 30, 2021

throAU said:
Sorry maybe my choice of words were not the best.

They've certainly tried hard, but their focus has been on efficiency not outright performance.

If they run desktop cooling on these things and have the freedom to run to intel power levels (or even half), performance will perhaps be significantly better.

Maybe I should have said "actually push the clocks" or something. Right now I suspect we're seeing the M series processors running at 2/3 throttle.... vs. what they could ramp up to with desktop cooling and power delivery.

I suspect we will see more than N scaling vs. M1-Pro/Max because they seem to be scaling fairly linearly vs. core count, but in the larger machines Apple will have the room for better cooling and more power.

Maybe they won't; maybe they'll just go for the green/small/quiet option, but there's certainly room there to clock the things harder and plenty of room inside even a half size Mac Pro for cooling and power. The M1-Pro in my 14" barely even spins the fan and the heatsink is tiny. Give it proper cooling in a desktop and there's so much headroom...

I’ve found it a little surprising that the clock rates have been pretty much identical across the board from phone to mac. It was pretty much the case every time we spun a chip that we increased the max clock rate. My first job at AMD, after coming from Sun, was to get the ALUs in K6-II to be 20% faster (I may have the number wrong. Long time ago). It took months of hard effort, hand-sizing each gate, drawing wires and taking them away from the router, moving cells, reformulating logic, etc. Maybe because of the environment in which Apple competes, where they need new chip microarchitectures annually instead of every 18-24 months, they just don’t bother.

I guess the short takeaway is unless they do a lot of work, the clock frequency won’t go up (unless they’ve been underclocking, or they have process improvements which move the speed bin distribution northward). You can’t just turn up the voltage and the clock and expect it to work. There’s almost always some stray logic path which doesn’t scale with everything else (usually many dozens of them).

Random aside - some of the weird stuff we did to speed up the chip included things like clock borrowing, where you delay the clock to a flip flop so that the path coming into the flip flop had more time than the path leaving the flip flop (or vice versa, speeding up a clock to a flip flop). This is not a great idea. Sometimes we would manually route wires, and give them extra spacing by blocking the neighboring routing tracks with dummy metal that we would remove before tape out. When we had latches, the things we did were even more kludgy. Gave the static timing tools fits.

throAU · Nov 30, 2021

I guess it could be that it won't clock faster; I guess my experience is based on what intel/amd desktop CPUs will do given the headroom. With that prior experience it seems nuts that there's not performance left on the table at higher power/heat if they were willing to push it further. But like you say maybe it isn't possible.

Is it perhaps apple have optimised the design to perform well at the clocks it runs at, within a given thermal/power envelope and trade-offs have been made for that? i.e., the trade-offs made in the original firestorm/icestorm cores for mobile are going to be keep the clocks down?

Cmaier · Nov 30, 2021

throAU said:
I guess it could be that it won't clock faster; I guess my experience is based on what intel/amd desktop CPUs will do given the headroom. With that prior experience it seems nuts that there's not performance left on the table at higher power/heat if they were willing to push it further. But like you say maybe it isn't possible.

Is it perhaps apple have optimised the design to perform well at the clocks it runs at, within a given thermal/power envelope and trade-offs have been made for that? i.e., the trade-offs made in the original firestorm/icestorm cores for mobile are going to be keep the clocks down?

I don’t think they really made that sort of tradeoff. I think they designed the cores for around 3GHz because any more than that they would need to put in more months of engineering or add more pipe stages (which has its own trade offs), and that’s what they’ll stick with. Instead of putting the engineers to work on a revision that is 15% faster, those engineers go to work on the next microarchitecture or next variation (max, pro, whatever)

mr_roboto · Nov 30, 2021

Another way of looking at it: M1 Pro/Max sustains 3036 MHz in both P clusters with all 10 cores loaded. That's 94% of single-core Fmax, which is 3228 MHz (up very slightly from 3204 on M1).

That small a dropoff from peak single core performance is unheard of in Intel and AMD x86 chips. I suspect this is a distinction Apple wants to maintain. Some of those tradeoffs @Cmaier alluded to would send Apple down the same path Intel chose years ago, where chasing large YoY ST performance wins by uncapping the single core power budget has a long term side effect of creating severe efficiency problems and harming MT performance.

Yoused · Nov 30, 2021

When you go to a smaller process node, how much work is involved in dealing with wire noise? I mean, just shrinking the mask is not enough, right?

Cmaier · Nov 30, 2021

Yoused said:
When you go to a smaller process node, how much work is involved in dealing with wire noise? I mean, just shrinking the mask is not enough, right?

correct. When you go to a new node you shrink the transistors by x%. The wires typically do not shrink by x%. It depends - the metal process often proceeds on its own ticktock independent of the transistors. More importantly, think about a wire. It’s cross-section is a rectangle. You form capacitors to whatever is below you, above you, and to your left and right. So the *height* of the wires is important, and the height scales by some other factor (or not at all). Of course that also affects the wire resistance (which depends on the cross-sectional area of the wire).

Another thing people forget is the minimum spacing between polygons may not (usually does not) scale proportionally to the minimum feature width.

Then your voltage probably doesn’t scale by an equal percentage, either. So you are reworking pretty much everything. At AMD, toward the end of my time there, we got around this by simultaneously designing for both, and giving up some performance in the current node.

Yoused · Dec 2, 2021

Suppose you took an A57-type thing (E core) and gave it four context frames (register sets, etc – around 100 64-bit words each) and hardware logic that could rotate the frames in and out of memory in the background (e.g., the CPU is running in frame 3 so the logic would be swapping frame 1). Each thread swap would cost about half the length of the pipe with ordering barriers. Would you come out ahead on a setup like that?

Cmaier · Dec 2, 2021

Yoused said:
Suppose you took an A57-type thing (E core) and gave it four context frames (register sets, etc – around 100 64-bit words each) and hardware logic that could rotate the frames in and out of memory in the background (e.g., the CPU is running in frame 3 so the logic would be swapping frame 1). Each thread swap would cost about half the length of the pipe with ordering barriers. Would you come out ahead on a setup like that?

I don’t think so?

If you have separate register files that are essentially sharing one set of ALUs and Load/Store units like that, you still have to deal with all the bypass logic. You have all these instructions in various stages of flight through the ALU, many of them prepared to bypass the RF to feed inputs into the next instructions, and then you pull the rug out from all that. You’ve got multiply instructions on cycle 3 of 5 (or whatever), so do you sit around and do nothing until those clear? Do you try and fill the other ALUs with single clock instructions while you wait? What happens to load instructions that are in mid flight? I guess they load into the dedicated RF for that context, but then you have some added gate delays between the cache and RFs that could get hairy when you have one context doing ALU ops but potentially multiple doing load/store.

Sort of reminds me a bit of SPARC register windows, by the way, which was another way at trying to get at this problem.

Andropov · Dec 4, 2021

Cmaier said:
Random aside - some of the weird stuff we did to speed up the chip included things like clock borrowing, where you delay the clock to a flip flop so that the path coming into the flip flop had more time than the path leaving the flip flop (or vice versa, speeding up a clock to a flip flop). This is not a great idea.

What makes it not a great idea?

Cmaier · Dec 4, 2021

Andropov said:
What makes it not a great idea?

Two main reasons. First, the clock network is physically very different than the logic paths. Different types of circuit, different metal dimensions and shapes, etc. As a result, they don’t vary in the same manner from wafer to wafer or even die to die across a big wafer. So on die number 1 you may have a clock with delay X and a logic path with delay Y, but on die 2 X may decrease by 1 percent but Y may decrease by 3. This can cause unpredicted shifts in your binning.

The second reason is that the methodology we used to predict performance was called static timing, and it didn’t account so well for this trick. In static timing we use a tool to model the RC network on all the wires in the logic paths, then use a timing tool to predict what those Rs and Cs will do to the delay on each wire and the input-to-output delay of each logic gate. We don’t model the clock - you generally just set constraints. You tell the tool that the clock arrives at each flip flop every 1ns (or whatever, depending on your clock rate).

To do this trick we had to, essentially, manually muck with these clock constraints. This involved “tricking” the EDA flow to treat portions of the clock network as logic paths. It was very easy to mess up, and not all that accurate since we were only modeling a small portion of that clock circuit, and thus we had to guess at things like the input ramp speed. If we messed up - for example if the RCs were grossly wrong or even missing - it wasn’t always apparent.

And since this was being done by each block owner independently (this was before Cheryl and I put together a universal flow for everyone to use) there was no way to know if everyone was using best practices. - it would be very tempting for someone to say “I made my timing” by hacking things and there would be no way for us to know if they did it right.

Cmaier · Dec 6, 2021

By the way, at the other place, if you see rukia posting stuff, you should believe it. I can confirm he definitely worked at AMD (after I left), and also Apple. He certainly is more up-to-date than I am.

Cmaier · Dec 7, 2021

Cmaier said:
By the way, at the other place, if you see rukia posting stuff, you should believe it. I can confirm he definitely worked at AMD (after I left), and also Apple. He certainly is more up-to-date than I am.

Last night he posted about the TLB bug in K10 - I hadn’t been aware of that story since I left before K10 got kicked off (though I famously predicted at the other place that K10 would not go well, based on the way AMD was swapping in new personnel and forcing talented people out, and the plans that the new people had to put in place a new design methodology that seemed like a bad idea).

Apparently this wasn’t public until he posted it, but the bug was a hold-time violation (I think I’ve mentioned that sort of issue here before - if you speed up gates and wires too much, which seems like a good thing, it can cause a horrible sort of bug where no matter what clock rate you set, you can’t get the chip to function. This is because the results of a calculation do not remain stable long enough to be captured by memory elements).

Anyway, the details of why the bug occurred fascinate me. The issue, he says, is that AMD was designing its own standard cells, in this case a multiplexor cell. A cell has designated “pins,” which are areas you can connect wires to in order to connect to the cell. In this case they didn’t use the normal pin structure, and they connected directly to the source or drain of transistors. The flow which characterizes the timing behavior of the cells so that the timing can be fed as an input into the static timing tools apparently didn’t account for that behavior, and chaos ensued.

The reason this fascinates me is that my new manager, hired in the year or so before I left, introduced himself to me by saying “I hear you’re pretty indispensable around here.” I thanked him. In response he said “The graveyards of Europe are full of indispensable people.”

I found out that the new management did not like the model where a small number of people had a lot of knowledge about different parts of the design methodology and were responsible for making sure it all worked right together - they preferred a large number of interchangeable people, each of which knew about one thing.

And this result is what you would expect to follow from that.

throAU · Dec 8, 2021

Cmaier said:
I found out that the new management did not like the model where a small number of people had a lot of knowledge about different parts of the design methodology and were responsible for making sure it all worked right together - they preferred a large number of interchangeable people, each of which knew about one thing.

Massive trap in any industry, but definitely in tech.

I'm a network architect and have experience/knowledge across VM and storage. I'm not what you would call a specialist in any single field of datacenter stuff, but I know enough about most of the moving parts to know what I don't know (and get specialists to help me with those bits). So many times I've dealt with contractors who are specialists in their single field, but they can't see the forest for the trees (and I end up diagnosing the issue when specialist A blames the part handled by specialist B and vice versa).

You NEED people who have a higher level view of how the pieces fit together (and also to act as a specialist-spewed BS detector). You can't just swap engineers out like components because much as we might like the theory of every individual component being abstracted away with a nice clean interface to every other component that's simply just not how it works in reality.

Bugs, quirks, design compromises, performance hacks, environmental conditions, whatever you want to call them mean that it just never works out that way.

Andropov · Dec 8, 2021

Cmaier said:
I found out that the new management did not like the model where a small number of people had a lot of knowledge about different parts of the design methodology and were responsible for making sure it all worked right together - they preferred a large number of interchangeable people, each of which knew about one thing.

Big companies don't like indispensable people. My current employer is currently switching from a small group of very talented people to a 'throw people at the problem' approach too. Mainly to increase the amount of new features that can be developed each week, but also to avoid parts of the codebase being known by only one person. The obvious problem is, just as in computers, real life tasks also have problems with parallelism.

And as @throAU says, ultimately you NEED some people to have a general view of the project. Good documentation may also help on decision-making so you don't have to reverse-engineer someone else's code every time you have to make a decision that affects other parts of the code. Or you can have zero documentation and throw even more people to the problem.

throAU · Dec 8, 2021

Without someone having a high level overview of the project there is no direction. Like trying to make a movie with a bunch of actors and no director. Or script.

Citysnaps · Dec 8, 2021

Cmaier said:
I found out that the new management did not like the model where a small number of people had a lot of knowledge about different parts of the design methodology and were responsible for making sure it all worked right together - they preferred a large number of interchangeable people, each of which knew about one thing.

Andropov said:
Big companies don't like indispensable people. My current employer is currently switching from a small group of very talented people to a 'throw people at the problem' approach too.

There's not a lot I can contribute to this thread not being involved in cpu design. But damn, both of the above comments certainly resonated and brought back memories.

Way back I was part of a five engineer (which grew to ten over time) startup that designed full custom CMOS, hand laid out, high speed communications-oriented signal processing ASICs in the San Francisco Bay Area. We had a small family of chips fabbed at ES2 and Atmel. Multi channel wideband and narrowband digital downconverters (used in radios), digital up converters (used in transmitters), digital filters, QAM demodulators, power amplifier pre-distortion linearisers, etc. We were pretty lean, but in a very good way - which our customers liked a lot. Our competitors, Analog Devices and Harris Semi couldn't touch our tech.

As cellular telecom started getting a lot of steam, our customer base shifted from defense/aerospace/scientific use to cellular infrastructure (realizing the benefits of digital radio over analog in basestations - especially beam-formed). And with that, increased attention from a couple of large semiconductor manufacturers. One eventually acquired us.

For me, the first year or two under new ownership was pretty good and interesting, being able to propose and develop a new device for a large cellular infrastructure company and a version for general use. After that the bureaucracy became heavy, and I eventually left due to the above comments after my four year obligation. Looking back, that was one of my best decisions.

Colstan · Dec 19, 2021

This is something that I haven't thought about in decades, but when I noticed this particular article on phoronix, I thought Cmaier might enjoy a trip down memory lane, depending on his involvement. What I found striking about the article is that I had always thought about extensions to the x86 ISA as being like a singularity; once something goes in, it never comes out. So much cruft and barnacles have built up over the decades, that it never really occurred to me that compatibility might be intentionally removed. Recently, we've contrasted this philosophy with how Apple handled the switch to Apple Silicon with the Mac and the resultant changes. Sure, there's been some pain from the removal of 32-bit support and other relatively minor issues with switching to ARM, while using the transition as a useful excuse to clean out the gutters, but it's a lot better than still booting into Real Mode, or whatever oddball revenants still lurk within modern x86 CPUs. I get that compatibility is king with Windows and x86, because you never know when you might need to pull up that proprietary spreadsheet application written in 1985, but I wonder why this house of cards hasn't been removed (or collapsed) long ago. The engineering teams that work on these chips must get tired of dealing with shoehorned kludges and would like a clean break. Perhaps that's some of the appeal of working for Apple and other modern RISC vendors: no longer having to worry about legacy garbage. Regardless, we may have lost 3DNow!, but at least there's still SSE4a.

Cmaier · Dec 19, 2021

Colstan said:
This is something that I haven't thought about in decades, but when I noticed this particular article on phoronix, I thought Cmaier might enjoy a trip down memory lane, depending on his involvement. What I found striking about the article is that I had always thought about extensions to the x86 ISA as being like a singularity; once something goes in, it never comes out. So much cruft and barnacles have built up over the decades, that it never really occurred to me that compatibility might be intentionally removed. Recently, we've contrasted this philosophy with how Apple handled the switch to Apple Silicon with the Mac and the resultant changes. Sure, there's been some pain from the removal of 32-bit support and other relatively minor issues with switching to ARM, while using the transition as a useful excuse to clean out the gutters, but it's a lot better than still booting into Real Mode, or whatever oddball revenants still lurk within modern x86 CPUs. I get that compatibility is king with Windows and x86, because you never know when you might need to pull up that proprietary spreadsheet application written in 1985, but I wonder why this house of cards hasn't been removed (or collapsed) long ago. The engineering teams that work on these chips must get tired of dealing with shoehorned kludges and would like a clean break. Perhaps that's some of the appeal of working for Apple and other modern RISC vendors: no longer having to worry about legacy garbage. Regardless, we may have lost 3DNow!, but at least there's still SSE4a.

LOL. Yeah, I was there for the 3dNow! stuff. Good riddance.

X86 vs. Arm

Elite Member

Site Champ

Site Master

Site Champ

Site Master

Site Champ

up

Site Master

up

Site Master

Site Champ

Site Master

Site Master

Site Master

Site Champ

Site Champ

Site Champ

Elite Member

Site Champ

Site Master

Similar threads