AMD's 7000 series GPUs - RDNA 3

leman

Site Champ
Posts
643
Reaction score
1,197
I am a bit confused about their 2x claimed ALU throughput claims. From Anandtech:

The biggest impact is how AMD is organizing their ALUs. In short, AMD has doubled the number of ALUs (Stream Processors) within a CU, going from 64 ALUs in a single Dual Compute Unit to 128 inside the same unit. AMD is accomplishing this not by doubling up on the Dual Compute Units, but instead by giving the Dual Compute Units the ability to dual-issue instructions. In short, each SIMD lane can now execute up to two instructions per cycle.

How is that supposed to work? Don't you actually need to double the execution units to execute twice as many instructions instructions per cycle? Can someone with better understanding of the matter explain to me what exactly is happening here?
 

Nycturne

Elite Member
Posts
1,141
Reaction score
1,492
It is rather confusingly worded.

I think the key here is this bit "AMD has doubled the number of ALUs (Stream Processors) within a CU". So there are double the ALUs to execute the operations, based on what AMD was claiming during the presentation.
 

dada_dave

Elite Member
Posts
2,175
Reaction score
2,171
I am a bit confused about their 2x claimed ALU throughput claims. From Anandtech:



How is that supposed to work? Don't you actually need to double the execution units to execute twice as many instructions instructions per cycle? Can someone with better understanding of the matter explain to me what exactly is happening here?
It is rather confusingly worded.

I think the key here is this bit "AMD has doubled the number of ALUs (Stream Processors) within a CU". So there are double the ALUs to execute the operations, based on what AMD was claiming during the presentation.

I think what the key bit is what they say next:

But, as with all dual-issue configurations, there is a trade-off involved. The SIMDs can only issue a second instruction when AMD’s hardware and software can extract a second instruction from the current wavefront. This means that RDNA 3 is now explicitly reliant on extracting Instruction Level Parallelism (ILP) from wavefronts in order to hit maximum utilization. If the next instruction in a wavefront cannot be executed in parallel with the current instruction, then those additional ALUs will go unfilled.

So there are two execution units but they are tied together in a single ALU such that if there isn't any ILP you don't get the added benefit of the second execution resource. It goes on to say GCN tried this and RDNA 1 moved away from it because they had trouble actually getting significant advantage from ILP in GCN. Being able to take advantage of high ILP is of course one of the ways Apple's ARM CPU makes its mark, but that's a CPU and even then beyond 8-wide there is thought to be diminishing returns. This RDNA 3 design is apparently 2-wide. No idea what GCN was and nor do I know how much ILP the typical GPU algorithm on average has (EDIT: I think I read on average you can expect to get 6 parallel instructions in a CPU algorithm, though the distribution I would imagine is probably important too for determining if going wide is beneficial - not just the average - but I doubt that it is typical for GPU algorithms and I don't even know for the CPU algorithm what the original source for that 6 parallel instructions expected in flight on average or how or when it was calculated).
 
Last edited:

leman

Site Champ
Posts
643
Reaction score
1,197
I think what the key bit is what they say next:



So there are two execution units but they are tied together in a single ALU such that if there isn't any ILP you don't get the added benefit of the second execution resource. It goes on to say GCN tried this and RDNA 1 moved away from it because they had trouble actually getting significant advantage from ILP in GCN. Being able to take advantage of high ILP is of course one of the ways Apple's ARM CPU makes its mark, but that's a CPU and even then beyond 8-wide there is thought to be diminishing returns. This RDNA 3 design is apparently 2-wide. No idea what GCN was and nor do I know how much ILP the typical GPU algorithm on average has (EDIT: I think I read on average you can expect to get 6 parallel instructions in a CPU algorithm, though the distribution I would imagine is probably important too for determining if going wide is beneficial - not just the average - but I doubt that it is typical for GPU algorithms and I don't even know for the CPU algorithm what the original source for that 6 parallel instructions expected in flight on average or how or when it was calculated).

Yeah you see that’s what I find confusing. Modern AMD ALUs already were 32-wide from what I understand (just like Apples) - a single instruction operates on 32 lanes in parallel. So when you write “RDNA3 is 2-wide”… how exactly is it supposed to work? ALU “width” usually refers to SIMD, the number of data elements a single ALU can process. But a single ALU is limited to a single operation. What they are talking about however is the ability to execute two instructions per cycle, so there must be twice as many independently scheduled ALUs. This is essentially superscalar execution, just like Nvidia has been doing for a while.

What I find a bit surprising is that Anandtech writes that this is a “cheap” way to increase compute performance. I would expect superscalar execution to come at additional scheduler complexity and state overhead. Unless we are really talking about very limited form of superscalar where there is no speculation or dependency tracking and the code essentially consists of two instruction streams (each associated with set A or B of ALUs) with dependency tracking done by the compiler. E.g. if I want to compute something like x = a + b + c + d the program could look something like this (provided unsafe math transforms are enabled of course):

Code:
A: temp1 = a + b               B: temp2 = c + d
   x = temp1 + temp2              NOP


AMD usually publishes low-level details of their GPUs so I hope the exact organization of their CUs and the execution model will be more clear in the future.
 

dada_dave

Elite Member
Posts
2,175
Reaction score
2,171
Yeah you see that’s what I find confusing. Modern AMD ALUs already were 32-wide from what I understand (just like Apples) - a single instruction operates on 32 lanes in parallel. So when you write “RDNA3 is 2-wide”… how exactly is it supposed to work? ALU “width” usually refers to SIMD, the number of data elements a single ALU can process. But a single ALU is limited to a single operation. What they are talking about however is the ability to execute two instructions per cycle, so there must be twice as many independently scheduled ALUs. This is essentially superscalar execution, just like Nvidia has been doing for a while.

What I find a bit surprising is that Anandtech writes that this is a “cheap” way to increase compute performance. I would expect superscalar execution to come at additional scheduler complexity and state overhead. Unless we are really talking about very limited form of superscalar where there is no speculation or dependency tracking and the code essentially consists of two instruction streams (each associated with set A or B of ALUs) with dependency tracking done by the compiler. E.g. if I want to compute something like x = a + b + c + d the program could look something like this

Code:
A: temp1 = a + b               B: temp2 = c + d
   x = temp1 + temp2              NOP


AMD usually publishes low-level details if they’re GPUs so I hope the exact organization of their CUs and the execution model will be more clear in the future.

Yeah I think that's exactly it, each thread can now perform two independent instructions per cycle within each thread. Thus the 2-wide is separate from the width of the SIMD, and it probably a very simple these instructions right after each other are independent, do them both at the same time. It is confusingly worded, but I think that's what they meant. So you can theoretically perform 2x instructions per cycle per ALU from before but the actual width of the SIMD, the number of threads, hasn't increased.
 

diamond.g

Site Champ
Posts
254
Reaction score
89

dada_dave

Elite Member
Posts
2,175
Reaction score
2,171
Folks think N31 is broken hardware wise. They are saying because the chips are A0 that it is broken. Apparently most GPU's are A0 so that isn't it. Really they have driver issues, and possible VBIOS issues.
I was out of the loop too so I didn’t understand the context of Ian’s tweet either. Thanks!
 
Top Bottom
1 2