Intel Hyperthreading vs Apple M1

somerandomusername

Power User
Posts
96
Reaction score
41

Apparently Intel may be ditching Hyperthreading. I was under the impression ARM doesn’t do Hyperthreading for reasons that X86 can’t do/have.

What exactly does lack of Hyperthreading Mean in A traditional X86 system? And how does that compare with M1?


@Cmaier

Can you please speculate about what this means Please?
 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,329
Reaction score
8,521

Apparently Intel may be ditching Hyperthreading. I was under the impression ARM doesn’t do Hyperthreading for reasons that X86 can’t do/have.

What exactly does lack of Hyperthreading Mean in A traditional X86 system? And how does that compare with M1?


@Cmaier

Can you please speculate about what this means Please?

As I’ve said around here and at the other place, there‘s no particular reason that x86 is more or less suited to hyperthreading than Arm. However, as I’ve also said, when I see hyperthreading in a design, that tells me that there is likely inefficiency in the microarchitecture, and that hyperthreading is a band-aid to get fuller utilization out of the functional units (logic and math units) in the design.

I’ve previously pointed out that Apple’s chips, at least, seem to have a very high utilization for their ALUs. This can be do, in part, to careful choices about the number of pipelines and what each pipeline is capable of doing, and, more importantly, due to very efficient instruction scheduling aided by a very wide instruction issue and very deep look into the instruction stream to determine what instructions are coming next. Wide issue is easier to do on RISC chips because it is far easier to decode the incoming instruction stream (because all instructions are the same length [or at least integer multiples of that length]).

Lately Intel has been going for wider issue, which is possible, but painful, in x86. It requires a lot of resources at the instruction fetch stage, and a lot of speculative buffering and stuff. If you make wrong guesses, you may have to rewind the pipeline. You can compensate for these things with yet more transistors. So the net effect is really that you end up burning more power than a RISC equivalent, but can still get wide issue in x86 (when I say x86, I am always including AMD64 or x86-64 or whatever you want to call it). Intel may have decided that the trade-off makes sense now, especially given that its easy to include lots of transistors now.

A second factor is that some of what was being accomplished by multithreading is much more efficiently accomplished by just adding more real cores. In the 1990’s, the cores took up 80 or 90% of the chip area that wasn’t L2 cache. Now the CPU cores may be a small fraction of the die area. Adding more is relatively cheap, and, all things considered, if you can choose to run your thread on a separate core or run it in a timeslice on a shared core, you would be better off on a separate core most of the time.
 

Yoused

up
Posts
5,623
Reaction score
8,942
Location
knee deep in the road apples of the 4 horsemen
Currently, Intel's P cores have HT while the E cores do not. P cores take up about 5 times the area of E cores but are not all that much better in upper-end performance. It seems likely that Intel may be considering eliminating the P core from most of its processors because you can get basically all the performance from E cores, and without any HT overhead, so you can cram a lot of E cores onto a chip and get better MT performance than with less than half as many P cores.

The irony with HT is that it works fairly well with heterogenous code, whereas the high-load work that is supposed to be handled on HT cores tends to be less structurally complex and more homogenous (e.g., flinging vectors around). In other words, HT is really great where it is not really needed but yields sub-200% output on a split core where you do want it.

Curiously, IBM POWER10 has eight-way SMT (4 times as much as x86) and seem to do well with that. POWER10 is a server processor, do work needs are very different. I think IBM's SMT is more about optimizing EU resources for steady workloads.
 

dada_dave

Elite Member
Posts
2,164
Reaction score
2,148
As I’ve said around here and at the other place, there‘s no particular reason that x86 is more or less suited to hyperthreading than Arm. However, as I’ve also said, when I see hyperthreading in a design, that tells me that there is likely inefficiency in the microarchitecture, and that hyperthreading is a band-aid to get fuller utilization out of the functional units (logic and math units) in the design.

I’ve previously pointed out that Apple’s chips, at least, seem to have a very high utilization for their ALUs. This can be do, in part, to careful choices about the number of pipelines and what each pipeline is capable of doing, and, more importantly, due to very efficient instruction scheduling aided by a very wide instruction issue and very deep look into the instruction stream to determine what instructions are coming next. Wide issue is easier to do on RISC chips because it is far easier to decode the incoming instruction stream (because all instructions are the same length [or at least integer multiples of that length]).

Lately Intel has been going for wider issue, which is possible, but painful, in x86. It requires a lot of resources at the instruction fetch stage, and a lot of speculative buffering and stuff. If you make wrong guesses, you may have to rewind the pipeline. You can compensate for these things with yet more transistors. So the net effect is really that you end up burning more power than a RISC equivalent, but can still get wide issue in x86 (when I say x86, I am always including AMD64 or x86-64 or whatever you want to call it). Intel may have decided that the trade-off makes sense now, especially given that its easy to include lots of transistors now.

A second factor is that some of what was being accomplished by multithreading is much more efficiently accomplished by just adding more real cores. In the 1990’s, the cores took up 80 or 90% of the chip area that wasn’t L2 cache. Now the CPU cores may be a small fraction of the die area. Adding more is relatively cheap, and, all things considered, if you can choose to run your thread on a separate core or run it in a timeslice on a shared core, you would be better off on a separate core most of the time.
What I find interesting … actually kind of odd is the reason given in the article for why it’s being dropped is that “Intel can’t get HT working” with Lion Cove P-cores rather than it being unnecessary with wider execution (although the article linked in the original article does talk about that and I’ve linked that below). It’s possible that’s a leak/telephone error but I wonder if true what the issue might be? The article below gives more details including something about future core plans and “rentable units”:

 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,329
Reaction score
8,521
What I find interesting … actually kind of odd is the reason given in the article for why it’s being dropped is that “Intel can’t get HT working” with Lion Cove P-cores rather than it being unnecessary with wider execution (although the article linked in the original article does talk about that and I’ve linked that below). It’s possible that’s a leak/telephone error but I wonder if true what the issue might be? The article below gives more details including something about future core plans and “rentable units”:

Hard to know what that is supposed to mean, but obviously they could get it working if they wanted to. Of course there are other potential downsides, including susceptibility to certain classes of side-channel attacks, area penalty, etc.
 

mr_roboto

Site Champ
Posts
288
Reaction score
464
Wide issue is easier to do on RISC chips because it is far easier to decode the incoming instruction stream (because all instructions are the same length [or at least integer multiples of that length]).
Honestly I think x86 decode would be fine with a more sane encoding scheme where the first part of the instruction always contains enough information to determine how long the whole instruction is. But it's stuck with its insane prefix/postfix scheme, where you have to step through a variable number of prefix bytes one-by-one and get to the opcode before you know how many bytes there are in the instruction.

Not only does this make parallel decode hard, it also leads to bugs!

 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,329
Reaction score
8,521
Honestly I think x86 decode would be fine with a more sane encoding scheme where the first part of the instruction always contains enough information to determine how long the whole instruction is. But it's stuck with its insane prefix/postfix scheme, where you have to step through a variable number of prefix bytes one-by-one and get to the opcode before you know how many bytes there are in the instruction.

Not only does this make parallel decode hard, it also leads to bugs!

That would certainly help a lot. But it would still be problematic to allow non-integer multiple lengths just because when you read from memory you have to align things. So, let’s say you read 8 bytes at a time. The next instruction, let’s say, is 11 bytes away. Even if you could look at the first byte and figure out you need byte 11 for the next instruction, you’d still then have to either shift byte 11 into byte 0 of some structure (to be consumed in the succeeding logic), or you’d have to make your logic be able to read any byte out of 8 (or however big your buffer is), which would involve a bunch of multiplexing which costs a bunch of gate delays. You could pretty easily end up adding a pipe stage just to deal with all that, which is performance/power expensive any time you have to flush the pipelines.
 

somerandomusername

Power User
Posts
96
Reaction score
41
thanks to everyone for replying. so realistically speaking for a user, does this mean anything in terms of power consumption (more or less), performance (more or less), expense for the chip (more or less), etc? I know it’s hard to reduce It down to just that but is there any inherent drawbacks or benefits If Intel converts to a non hyperthread X86 cpu?

is this a good strategy for Intel? again hard to know for sure since there’s a lot of stuff but I’m just curious if there’s any inherent drawback or benefit compared to their current offering or whatever, because it seems Intel has relied a lot upon this technology and I’ve read they’re pretty terrible at core design, and most of their advantage used to come from the manufacturing process they build
 
Last edited:

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,329
Reaction score
8,521
thanks to everyone for replying. so realistically speaking for a user, does this mean anything in terms of power consumption (more or less), performance (more or less), expense for the chip (more or less), etc? I know it’s hard to reduce It down to just that but is there any inherent drawbacks or benefits If Intel converts to a non hyperthread X86 cpu?

is this a good strategy for Intel? again hard to know for sure since there’s a lot of stuff but I’m just curious if there’s any inherent drawback or benefit compared to their current offering or whatever, because it seems Intel has relied a lot upon this technology and I’ve read they’re pretty terrible at core design, and most of their advantage used to come from the manufacturing process they build

Short answer is “dunno.”

The goal is to increase IPC. If this gives higher IPC than their prior approach, then that is probably good, as the new approach probably doesn’t require more power than hyperthreading. (Given that they’ve presumably changed dozens of other things to compensate).
 

somerandomusername

Power User
Posts
96
Reaction score
41
Thanks! That’s what I wanted to know when reading that article, because Intel really isn’t providing competition to anyone lol. I believe their newest top of the line CPUs can go up to like 900 W or something insane when actually measured. And I’ve read their newest laptop CPUs are a regression in performance so I didn’t know if removing this was a work of intelligence or stupidity inherently
 

dada_dave

Elite Member
Posts
2,164
Reaction score
2,148
Thanks! That’s what I wanted to know when reading that article, because Intel really isn’t providing competition to anyone lol. I believe their newest top of the line CPUs can go up to like 900 W or something insane when actually measured. And I’ve read their newest laptop CPUs are a regression in performance so I didn’t know if removing this was a work of intelligence or stupidity inherently
If it works? the former. If doesn’t? the latter. 🙃

But in all seriousness this seems part of a concerted effort by Intel to catch up to ARM/Apple in both architecture and microarchitecture. There’s an earlier thread on these forums about Intel’s white paper to modify x86 to be more ARM-like. The timeline for implementation is of course unknown for this, but Intel understands the threat that ARM poses and like in the 90s with those RISC competitors it’s attempting to respond. However, today it’s having to play catch up in more than one field. In the 90’s, its manufacturing advantage allowed it some headroom that it doesn’t currently enjoy. Although it might be said that Intel has a larger market entrenchment than it did back then. So we’ll see what happens.

Of course every chip maker has its own headwinds and nothing is certain but if I wanted an easy job I’d rather be at the helm of any of the others right now - not that I have the qualifications 🙃.
 
Last edited:

mr_roboto

Site Champ
Posts
288
Reaction score
464
However, today it’s having to play catch up in more than one field. In the 90’s, its manufacturing advantage allowed it some headroom that it doesn’t currently enjoy. Although it might be said that Intel has a larger market entrenchment than it did back then. So we’ll see what happens.
I don't think Intel had any particular process advantage back then. They may have been better at volume manufacturing (yields, costs, etc), but IIRC it wasn't till the 00s that they began pulling away from everyone on circuit performance numbers.

What they did have was the lion's share of the most expensive component in by far the most popular personal computer, during a time when the industry was growing by leaps and bounds. That money printer enabled them to hire great engineering teams, and during times when their management didn't screw things up too badly, those teams delivered results.

They still have a lot of the PC marketshare advantage left today, but they missed out on the two biggest growth areas for big expensive chips - smartphone SoCs and high performance GPUs. That means they no longer own nearly as huge a slice of the high-margin big-chip business as they once did.

Also, they aren't necessarily attracting the cream of the crop any more, in part because they're being penny pinchers (or so I've heard), in part because everyone who's been around a while knows someone (or at least a friend of a friend) who worked at Intel and got out because of how toxic the company's internal culture is. (If you do anything good, better be prepared for a rival team's manager to undermine your work to make themselves look better.)
 

leman

Site Champ
Posts
641
Reaction score
1,196

Apparently Intel may be ditching Hyperthreading. I was under the impression ARM doesn’t do Hyperthreading for reasons that X86 can’t do/have.

I am a bit surprised that this is being reported as news by notebookcheck? Roadmap leaks claimed already a while ago that Intel will be ditching SMT in order to pursue higher single-core performance and efficiency. If I remember correctly they are supposed to be working on something called “rentable units”, and while the information was very incomplete there was some speculation that it could mean cores that can borrow each other resources to adapt to the situation.
 

dada_dave

Elite Member
Posts
2,164
Reaction score
2,148
I don't think Intel had any particular process advantage back then. They may have been better at volume manufacturing (yields, costs, etc), but IIRC it wasn't till the 00s that they began pulling away from everyone on circuit performance numbers.

What they did have was the lion's share of the most expensive component in by far the most popular personal computer, during a time when the industry was growing by leaps and bounds. That money printer enabled them to hire great engineering teams, and during times when their management didn't screw things up too badly, those teams delivered results.

They still have a lot of the PC marketshare advantage left today, but they missed out on the two biggest growth areas for big expensive chips - smartphone SoCs and high performance GPUs. That means they no longer own nearly as huge a slice of the high-margin big-chip business as they once did.

Also, they aren't necessarily attracting the cream of the crop any more, in part because they're being penny pinchers (or so I've heard), in part because everyone who's been around a while knows someone (or at least a friend of a friend) who worked at Intel and got out because of how toxic the company's internal culture is. (If you do anything good, better be prepared for a rival team's manager to undermine your work to make themselves look better.)
My memory is that they did have a process advantage but I could be wrong. Perhaps it was as you say in volume and cost which could still be a substantial advantage depending.

In terms of market entrenchment yes they missed on smartphones and GPUs but the PC and especially server market grew on top of Intel chips. Those worlds running on Intel for so long is a massive built in advantage. The server chips is where they make a huge proportion of their revenue and profits and turns over more slowly intrinsically. Hyperscalers like Amazon, Google, and Microsoft may have more ability to switch to new processors if they want, but smaller servers are expected to be more resistant to change.
I am a bit surprised that this is being reported as news by notebookcheck? Roadmap leaks claimed already a while ago that Intel will be ditching SMT in order to pursue higher single-core performance and efficiency. If I remember correctly they are supposed to be working on something called “rentable units”, and while the information was very incomplete there was some speculation that it could mean cores that can borrow each other resources to adapt to the situation.
Notebookcheck did report on that earlier too based on those leaks, they reference that article in the OP article and I put up in my later post. What was surprising in the OP article and why they felt it merited a new article was that the latest leaks are that Intel is ditching HT earlier than expected and that they are doing so because they’re having trouble getting it working with the new cores. Since they were already planning on ditching it in later cores, it isn’t considered important enough to spend the engineering resources to get it working properly.
 
Last edited:

Yoused

up
Posts
5,623
Reaction score
8,942
Location
knee deep in the road apples of the 4 horsemen
How about a bullet point summary comparing these processors using terms that non-technical people can understand. :)
Really dumbing it down,
Picture a hose. You want to get water through it quickly. It works well up to a point, but if you try to push too hard, it gets all frothy, which reduces the volume of liquid that flows through. Intel looked at this and thought, if we try to push two streams side by side, they can fill each other's bubbles and thus be more efficient. It did work fairly well, but on heavy workloads, the advantage is not that great because there are fewer bubbles to fill.

PowerPC and 64-bit ARM instead have tried to make the hose bigger in order to move more liquid volume, which is a rather more difficult thing to do with the Intel design. The bigger hose lets them push it through slower for the same volume, which is one reason Apple and ARM chips tend to be more efficient (pushing slower takes less power and creates less froth).

Obviously, the design issues are much more complex. Making the hose bigger involves a pretty elaborate design, but so does feeding two streams side-by-side. One advantage to the latter is that the space of two processor cores can fit into only a little more than the space that one core takes up. But it does not normally perform like two cores, usually more like 1.7 or so. And there are security concerns, about a program on one half of the core being able to spy on the other half.

Intel seems to be getting pretty good performance out of their single-thread cores, so they are looking at ways to make better use of those. There are some complex logic units that take up a lot of space, so if two cores are put next to each other, with the big logic unit between them where either core can use it, the size advantage of dual-stream cores is gained without the logic that has to balance the two streams. As long as both cores are not hitting up the shared logic at the same time, it should be a net gain, at least in area (ARM is already doing this with the small cores).
 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,329
Reaction score
8,521
My memory is that they did have a process advantage but I could be wrong. Perhaps it was as you say in volume and cost which could still be a substantial advantage depending.

As someone who worked at AMD in the 1990’s, when we had our own fabs (in Austin and Dresden), I can definitely say both are true. Our fab yields were much worse and our processes were less predictable than Intel, and that was probably Intel’s biggest advantage. That said, while our processes were generally in the ballpark of what Intel could do (ignoring yield), they were generally inferior. We would leapfrog them from time to time but any advantage we had didn’t last very long.
 

mr_roboto

Site Champ
Posts
288
Reaction score
464
As someone who worked at AMD in the 1990’s, when we had our own fabs (in Austin and Dresden), I can definitely say both are true. Our fab yields were much worse and our processes were less predictable than Intel, and that was probably Intel’s biggest advantage. That said, while our processes were generally in the ballpark of what Intel could do (ignoring yield), they were generally inferior. We would leapfrog them from time to time but any advantage we had didn’t last very long.
From the outside, AMD's big bet on SOI seemed problematic - hard to swim against the rest of the industry's current. Curious whether you think that played a role.
 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,329
Reaction score
8,521
From the outside, AMD's big bet on SOI seemed problematic - hard to swim against the rest of the industry's current. Curious whether you think that played a role.
The reverse - SOI saved our bacon, and allowed us to get decent performance for K8 despite our transistor architecture and interconnect being not as good as Intel. It was a pleasure designing Opteron on SOI, which made my life a lot easier. And yields were pretty good, too.
 

dada_dave

Elite Member
Posts
2,164
Reaction score
2,148
The reverse - SOI saved our bacon, and allowed us to get decent performance for K8 despite our transistor architecture and interconnect being not as good as Intel. It was a pleasure designing Opteron on SOI, which made my life a lot easier. And yields were pretty good, too.
I had to look up SOI, sounds interesting. What was nice about it and why don’t more fabs use it or still use it?
 
Top Bottom
1 2