AMD's 7000 series GPUs - RDNA 3

exoticspice1 · Nov 2, 2022

AMD will announce their 7000 RDNA3 GPUs on Thrusday. Here's the link:

-

exoticspice1 · Nov 3, 2022

Very good price. Raster looks good. RT very disappointing compared to Nvidia.

diamond.g · Nov 3, 2022

Not happy with how they were pushing FSR for pretty much every metric...

leman · Nov 4, 2022

I am a bit confused about their 2x claimed ALU throughput claims. From Anandtech:

The biggest impact is how AMD is organizing their ALUs. In short, AMD has doubled the number of ALUs (Stream Processors) within a CU, going from 64 ALUs in a single Dual Compute Unit to 128 inside the same unit. AMD is accomplishing this not by doubling up on the Dual Compute Units, but instead by giving the Dual Compute Units the ability to dual-issue instructions. In short, each SIMD lane can now execute up to two instructions per cycle.

How is that supposed to work? Don't you actually need to double the execution units to execute twice as many instructions instructions per cycle? Can someone with better understanding of the matter explain to me what exactly is happening here?

Nycturne · Nov 4, 2022

It is rather confusingly worded.

I think the key here is this bit "AMD has doubled the number of ALUs (Stream Processors) within a CU". So there are double the ALUs to execute the operations, based on what AMD was claiming during the presentation.

dada_dave · Nov 4, 2022

leman said:
I am a bit confused about their 2x claimed ALU throughput claims. From Anandtech:

How is that supposed to work? Don't you actually need to double the execution units to execute twice as many instructions instructions per cycle? Can someone with better understanding of the matter explain to me what exactly is happening here?

Nycturne said:
It is rather confusingly worded.

I think the key here is this bit "AMD has doubled the number of ALUs (Stream Processors) within a CU". So there are double the ALUs to execute the operations, based on what AMD was claiming during the presentation.

I think what the key bit is what they say next:

But, as with all dual-issue configurations, there is a trade-off involved. The SIMDs can only issue a second instruction when AMD’s hardware and software can extract a second instruction from the current wavefront. This means that RDNA 3 is now explicitly reliant on extracting Instruction Level Parallelism (ILP) from wavefronts in order to hit maximum utilization. If the next instruction in a wavefront cannot be executed in parallel with the current instruction, then those additional ALUs will go unfilled.

So there are two execution units but they are tied together in a single ALU such that if there isn't any ILP you don't get the added benefit of the second execution resource. It goes on to say GCN tried this and RDNA 1 moved away from it because they had trouble actually getting significant advantage from ILP in GCN. Being able to take advantage of high ILP is of course one of the ways Apple's ARM CPU makes its mark, but that's a CPU and even then beyond 8-wide there is thought to be diminishing returns. This RDNA 3 design is apparently 2-wide. No idea what GCN was and nor do I know how much ILP the typical GPU algorithm on average has (EDIT: I think I read on average you can expect to get 6 parallel instructions in a CPU algorithm, though the distribution I would imagine is probably important too for determining if going wide is beneficial - not just the average - but I doubt that it is typical for GPU algorithms and I don't even know for the CPU algorithm what the original source for that 6 parallel instructions expected in flight on average or how or when it was calculated).

leman · Nov 4, 2022

dada_dave said:
I think what the key bit is what they say next:

So there are two execution units but they are tied together in a single ALU such that if there isn't any ILP you don't get the added benefit of the second execution resource. It goes on to say GCN tried this and RDNA 1 moved away from it because they had trouble actually getting significant advantage from ILP in GCN. Being able to take advantage of high ILP is of course one of the ways Apple's ARM CPU makes its mark, but that's a CPU and even then beyond 8-wide there is thought to be diminishing returns. This RDNA 3 design is apparently 2-wide. No idea what GCN was and nor do I know how much ILP the typical GPU algorithm on average has (EDIT: I think I read on average you can expect to get 6 parallel instructions in a CPU algorithm, though the distribution I would imagine is probably important too for determining if going wide is beneficial - not just the average - but I doubt that it is typical for GPU algorithms and I don't even know for the CPU algorithm what the original source for that 6 parallel instructions expected in flight on average or how or when it was calculated).

Yeah you see that’s what I find confusing. Modern AMD ALUs already were 32-wide from what I understand (just like Apples) - a single instruction operates on 32 lanes in parallel. So when you write “RDNA3 is 2-wide”… how exactly is it supposed to work? ALU “width” usually refers to SIMD, the number of data elements a single ALU can process. But a single ALU is limited to a single operation. What they are talking about however is the ability to execute two instructions per cycle, so there must be twice as many independently scheduled ALUs. This is essentially superscalar execution, just like Nvidia has been doing for a while.

What I find a bit surprising is that Anandtech writes that this is a “cheap” way to increase compute performance. I would expect superscalar execution to come at additional scheduler complexity and state overhead. Unless we are really talking about very limited form of superscalar where there is no speculation or dependency tracking and the code essentially consists of two instruction streams (each associated with set A or B of ALUs) with dependency tracking done by the compiler. E.g. if I want to compute something like x = a + b + c + d the program could look something like this (provided unsafe math transforms are enabled of course):

Code:

A: temp1 = a + b               B: temp2 = c + d
   x = temp1 + temp2              NOP

AMD usually publishes low-level details of their GPUs so I hope the exact organization of their CUs and the execution model will be more clear in the future.

dada_dave · Nov 4, 2022

leman said:
Yeah you see that’s what I find confusing. Modern AMD ALUs already were 32-wide from what I understand (just like Apples) - a single instruction operates on 32 lanes in parallel. So when you write “RDNA3 is 2-wide”… how exactly is it supposed to work? ALU “width” usually refers to SIMD, the number of data elements a single ALU can process. But a single ALU is limited to a single operation. What they are talking about however is the ability to execute two instructions per cycle, so there must be twice as many independently scheduled ALUs. This is essentially superscalar execution, just like Nvidia has been doing for a while.

What I find a bit surprising is that Anandtech writes that this is a “cheap” way to increase compute performance. I would expect superscalar execution to come at additional scheduler complexity and state overhead. Unless we are really talking about very limited form of superscalar where there is no speculation or dependency tracking and the code essentially consists of two instruction streams (each associated with set A or B of ALUs) with dependency tracking done by the compiler. E.g. if I want to compute something like x = a + b + c + d the program could look something like this

Code:

A: temp1 = a + b B: temp2 = c + d x = temp1 + temp2 NOP

AMD usually publishes low-level details if they’re GPUs so I hope the exact organization of their CUs and the execution model will be more clear in the future.

Yeah I think that's exactly it, each thread can now perform two independent instructions per cycle within each thread. Thus the 2-wide is separate from the width of the SIMD, and it probably a very simple these instructions right after each other are independent, do them both at the same time. It is confusingly worded, but I think that's what they meant. So you can theoretically perform 2x instructions per cycle per ALU from before but the actual width of the SIMD, the number of threads, hasn't increased.

Andropov · Dec 18, 2022

There seems to be some controversy regarding RDNA 3 that I'm not aware of, because I didn't quite get the context of this tweet.

https://twitter.com/IanCutress/status/1604135525648158720?s=20&t=xKWnhzXwsQEWxnTqU3l5mw

diamond.g · Dec 18, 2022

Andropov said:
There seems to be some controversy regarding RDNA 3 that I'm not aware of, because I didn't quite get the context of this tweet.

https://twitter.com/IanCutress/status/1604135525648158720?s=20&t=xKWnhzXwsQEWxnTqU3l5mw

Folks think N31 is broken hardware wise. They are saying because the chips are A0 that it is broken. Apparently most GPU's are A0 so that isn't it. Really they have driver issues, and possible VBIOS issues.

dada_dave · Dec 18, 2022

diamond.g said:
Folks think N31 is broken hardware wise. They are saying because the chips are A0 that it is broken. Apparently most GPU's are A0 so that isn't it. Really they have driver issues, and possible VBIOS issues.

I was out of the loop too so I didn’t understand the context of Ian’s tweet either. Thanks!

AMD's 7000 series GPUs - RDNA 3

exoticspice1

Site Champ

exoticspice1

Site Champ

diamond.g

Site Champ

leman

Site Champ

Nycturne

Elite Member

dada_dave

Elite Member

leman

Site Champ

dada_dave

Elite Member

Andropov

Site Champ

diamond.g

Site Champ

dada_dave

Elite Member

Similar threads