Nuvia: don’t hold your breath

Yoused

up
Posts
5,621
Reaction score
8,938
Location
knee deep in the road apples of the 4 horsemen
But you want to optimize memory bus saturation, based on the workload, just like you want EU saturation inside a core. There should be a unit that specifically assesses throughput efficiency and adjusts the clocks to minimize stalls while keeping everyone that has work to do busy. Where I used to work, we ran our machines much slower than top speed, because every fault stop was wasted productivity: often, you can get more work done at a slower pace by running steadily, just like you can get through town more efficiently by driving slower so that you are not stopping for every red light.
 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,328
Reaction score
8,520
But you want to optimize memory bus saturation, based on the workload, just like you want EU saturation inside a core. There should be a unit that specifically assesses throughput efficiency and adjusts the clocks to minimize stalls while keeping everyone that has work to do busy. Where I used to work, we ran our machines much slower than top speed, because every fault stop was wasted productivity: often, you can get more work done at a slower pace by running steadily, just like you can get through town more efficiently by driving slower so that you are not stopping for every red light.

Slowing the clock wouldn’t do much, for a few reasons. It’s better to run at normal speed and if you take a stall you take a stall - the core burns zero dynamic power if it really has nothing to do (and, modernly, almost zero static power, because not only do you shut off the clocks, but you locally raise VSS to VDD and shut off power to circuits that have nothing to do).

You can only slow the clock so far before you run into hold-time violations and start producing wrong answers. And slowing the clock is only a linear effect, so you want to reduce V, too (squared effect). But reducing V increases slew times on the wires, which can result in noise injection errors from neighboring wires. Which is a long-winded way of saying that you can slow to whatever your minimum safe frequency is, and that’s about it.

So then the question is, if you know you’re going to have nothing to do in 2 out of 10 cycles, is it better to run full speed for 8 and then do nothing for 2, or is it better to slow the clock so as to spread out the 8 to take up the time of 10 cycles. Probably the former, because you can’t easily figure out what effect the bandwidth starvation is having on the user. You never know when an interrupt can come along and moot your bandwidth pattern, or some interaction between processes will change and moot it. It would be a lot of guesswork. And, you have to burn current for the circuitry to figure all that out. Plus it smells to me like it would introduce the possibility of all sorts of side-channel attacks. And the gain seems pretty minimal.

That said, it would be an interesting thing to simulate to see what the effect might be with real workloads.
 

leman

Site Champ
Posts
641
Reaction score
1,196
I'm having trouble getting power metrics to display the old format (cluster, CPU, DRAM, package). I guess it looks like this now?

View attachment 29165

@leman did they change the format? I looked at the man page but couldn't figure out how to access the previous data, tried --unhide-info <samplers> comma separated list of samplers to unhide (backwards compatibility) with various "dram_power" or "package_power" to no avail. EDIT: it seems they have removed some of the old sensors?

Yeah, they removed the DRAM counters a while ago. I also don't see a way to query this information in their various frameworks.
 

mr_roboto

Site Champ
Posts
288
Reaction score
464
You can only slow the clock so far before you run into hold-time violations and start producing wrong answers.

I would have thought hold time violations are frequency independent, what's the mechanism behind this? Clock tree effects? If clock edges arrive at two flops involved in a potential hold time violation with the same skew across the whole frequency range, it seems to me like it shouldn't matter what the frequency is.
 

Jimmyjames

Site Champ
Posts
675
Reaction score
763
Over at Reddit, Andrei F seemingly posted this:
“The table is misinterpreted and wrong as how it's portrayed - it's not per-SKU power variance, you should just wait for actual products. The workload is also not something realistic.”

I’m not sure it clears things up.
 

leman

Site Champ
Posts
641
Reaction score
1,196
Over at Reddit, Andrei F seemingly posted this:
“The table is misinterpreted and wrong as how it's portrayed - it's not per-SKU power variance, you should just wait for actual products. The workload is also not something realistic.”

I’m not sure it clears things up.

I don't think it clears up things at all.

Of course, it should be obvious that the AndroidAuthority article is a pile of poop, they got basic specs wrong, their language is confusing, and there is no discussion of methodology or what the numbers actually mean. Could be that this is some sort of dumb stress test, which would be worthless.

We need to wait for the final products.
 

Jimmyjames

Site Champ
Posts
675
Reaction score
763
I don't think it clears up things at all.

Of course, it should be obvious that the AndroidAuthority article is a pile of poop, they got basic specs wrong, their language is confusing, and there is no discussion of methodology or what the numbers actually mean. Could be that this is some sort of dumb stress test, which would be worthless.

We need to wait for the final products.

Agreed to all of this. One thing I find amusing is, every site reporting on this agrees that the only tests performed are those approved by Qualcomm and only on their machines. So if these are not realistic tests, why are Qualcomm using them?
 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,328
Reaction score
8,520
I would have thought hold time violations are frequency independent, what's the mechanism behind this? Clock tree effects? If clock edges arrive at two flops involved in a potential hold time violation with the same skew across the whole frequency range, it seems to me like it shouldn't matter what the frequency is.

Skew is always frequency dependent. And wires and gates have capacitance. At low frequency, you allow more time for things to discharge.

We used to have to run timing analysis at an assumed minimum clock speed (“min time”) to make sure we didn’t break anything when the clock wasn’t running at full speed (which is the corner where we spent most of our design effort).
 

dada_dave

Elite Member
Posts
2,163
Reaction score
2,148
Power is proportional to frequency x voltage^2, yes. (There’s a C and a ½ in there, too). But not sure I understand the rest of your post. Voltage and frequency are, in a sense, independent. You can, in theory, increase frequency without increasing voltage, and vice versa. (Though, to achieve more than a little frequency gain you likely need to increase voltage, because higher voltage causes transistors to switch faster).

That relation describes independently switching circuits. When amalgamated into a large chip, we usually see the kind of curve I drew above. As the curve gets more and more horizontal (there’s a horizontal asymptote), a small increase in performance can require a huge increase in power.
While the numbers from Android Authority seem suspect, for my own edification, I was wondering if I could plumb this direction further. If I understand, that not only is P ~= Fx V^2 a feature for simple circuits but also the blurb I found earlier that made it seem like voltage and frequency were collinear was an oversimplification for their toy example. In reality, full chips have a more complex relationship between the two and even for simple circuits increases in frequency may necessitate a range of possible voltage increases, including no increase at all or potentially greater percentage increase in voltage than frequency?

For example assuming the numbers from Android Authority were correct and assuming a simple circuit, then where the observed ratio of power of the top tier to the second tier chip is 2x (80W/40W) and the increase in all core frequency was 3.8/3.4, the needed increase in voltage to explain the power increase is about 1.34.

(P1/P0) = (F1/F0)*(V1/V0)^2 => V1/V0 = sqrt(2*3.8/3.4) = 1.34

So to explain the apparent 2x power draw (which, as @Jimmyjames wrote, Andrei is seemingly disputing) and the stated clock increase, they would've had needed to increase the voltage through the chip by 34%, presumably to cause the transistors to switch fast enough to keep up with the clocks (again, assuming a simple circuit, which it is not).

Do I have that right? or am I still not understanding something?

Over at Reddit, Andrei F seemingly posted this:
“The table is misinterpreted and wrong as how it's portrayed - it's not per-SKU power variance, you should just wait for actual products. The workload is also not something realistic.”

I’m not sure it clears things up.

I don't think it clears up things at all.

Of course, it should be obvious that the AndroidAuthority article is a pile of poop, they got basic specs wrong, their language is confusing, and there is no discussion of methodology or what the numbers actually mean. Could be that this is some sort of dumb stress test, which would be worthless.

We need to wait for the final products.

I got to admit I am a little frustrated. While Apple can be annoyingly vague in their product announcements at least you generally only have to wait a few weeks tops to see results in the wild. Qualcomm announced in October, and I get why they did that, but they are trying to have their cake and eat it too with really early product announcements while simultaneously being Apple-like (or worse actually, Apple is more forthcoming which is saying a lot). They then have to send Andrei (presumably he got clearance to say things, generally companies don't like engineers just spouting off the cuff) to clean up their own communications with "well actually that's not correct but I can't tell you what's correct, wait for final product release". That's a little aggravating. I mean launch is not *that* far away now but ... still ...
 
Last edited:

Jimmyjames

Site Champ
Posts
675
Reaction score
763
I got to admit I am a little frustrated. While Apple can be annoyingly vague in their product announcements at least you generally only have to wait a few weeks tops to see results in the wild. Qualcomm announced in October, and I get why they did that, but they are trying to have their cake and eat it too with really early product announcements while simultaneously being Apple-like (or worse actually, Apple is more forthcoming which is saying a lot). They then have to send Andrei (presumably he got clearance to say things, generally companies don't like engineers just spouting off the cuff) to clean up their own communications with "well actually that's not correct but I can't tell you what's correct, wait for final product release". That's a little aggravating. I mean launch is not *that* far away now but ... still ...

100%. They are talking a lot and saying very little. It’s an endless charade of “this score, but also not this score….except maybe it is”. Scores that don’t make sense. High Geekbench scores, and then mediocre Cinebench scores.it feels like they are obfuscating until they can get stuff out there.

I wouldn’t mind much but I found there initial “we’re setting the standard for performance“ which then turned out to be wrong, rather aggravating. Then there is the nonsense with power usage. In October there was talk of “we can match the m2 at 30% less power”. Now it uses 100 watts. It’s possible it all makes sense when these devices come out, but for now it’s a little much to take.
 
Last edited:

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,328
Reaction score
8,520
While the numbers from Android Authority seem suspect, for my own edification, I was wondering if I could plumb this direction further. If I understand, that not only is P ~= Fx V^2 a feature for simple circuits but also the blurb I found earlier that made it seem like voltage and frequency were collinear was an oversimplification for their toy example. In reality, full chips have a more complex relationship between the two and even for simple circuits increases in frequency may necessitate a range of possible voltage increases, including no increase at all or potentially greater percentage increase in voltage than frequency?

Right. The issue is that P=½CfV^2 holds, but V and f are not independent variables at the chip level. At a given voltage, there is a range of frequencies that works, but if you want to increase the frequency beyond that range, you need to increase V. So if you zoom in close on that curve I drew, it would be made up of lots of tiny f=2P/CV^2 sections, where different parts of the curve have different V’s. Because as you move to the right V has to get higher and higher, and you square it to get P, the curve flattens out toward an asymptote as you move to the right.



For example assuming the numbers from Android Authority were correct and assuming a simple circuit, then where the observed ratio of power of the top tier to the second tier chip is 2x (80W/40W) and the increase in all core frequency was 3.8/3.4, the needed increase in voltage to explain the power increase is about 1.34.

(P1/P0) = (F1/F0)*(V1/V0)^2 => V1/V0 = sqrt(2*3.8/3.4) = 1.34

So to explain the apparent 2x power draw (which, as @Jimmyjames wrote, Andrei is seemingly disputing) and the stated clock increase, they would've had needed to increase the voltage through the chip by 34%, presumably to cause the transistors to switch fast enough to keep up with the clocks (again, assuming a simple circuit, which it is not).

Do I have that right? or am I still not understanding something?

I’ve sort of lost track of all the benchmark numbers, so I’ll go by my understanding of what you just said. Increasing the clock 12% ((3.8-3.4)/3.4) would, ceteris paribus, cause a 12% increase in power dissipation. The “top tier” chip would be expected to have smaller C than the second tier chip - that’s one of the things that makes it bin faster. But if you assume C is the same, then the rest must be voltage. So to achieve 12% faster clock, they had to raise the voltage by 9 or 10% or so? (9^2 + 12 approximating 100% power increase?)
 
Top Bottom
1 2