What does Apple need to do to catch Nvidia?

dada_dave

Elite Member
Posts
2,163
Reaction score
2,148
They work exactly the same as the Vulkan extension on the CUDA warp matrix functions. You will find the info in the Metal shading language reference section 6.7
So I see the programming interface but I guess what I’m trying to understand is at the hardware level what’s happening. Like the CUDA warp matrix is accelerated by the tensor units and the warp vote function is accelerated by the lanes in a warp being able to communicate (shuffle, broadcast) across their registers. How does the matrix multiplication get accelerated on Apple hardware? You mentioned something about the index permutations being accelerated at the point of register fetching to the point of being essentially free. This sounds very similar to the non-tensor warp level intrinsics (shuffle, vote, broadcast, etc …) that I’m used to from Nvidia hardware. And the description of “permutations” for Metal shading language reference mentions the same warp functions. I’m just not quite sure why using those would be as efficient/performant as a dedicated processor.

Basically, if the matrix function is wrapping over the same process as the Nvidia warp functions then I’m curious as to why Nvidia went with tensor cores if they could accelerate matrix multiplication without the additional silicon* and whether Nvidia could gain even additional acceleration by utilizing their pipelines on top of their tensor cores. And if what you’re describing is different then I really don’t understand what’s going on.

*Edit: I’m not suggesting that you should have an inside knowledge of Nvidia’s thought process here. 🙃 I’m just trying to wrap my head around all of this.
 
Last edited:

dada_dave

Elite Member
Posts
2,163
Reaction score
2,148
So I see the programming interface but I guess what I’m trying to understand is at the hardware level what’s happening. Like the CUDA warp matrix is accelerated by the tensor units and the warp vote function is accelerated by the lanes in a warp being able to communicate (shuffle, broadcast) across their registers. How does the matrix multiplication get accelerated on Apple hardware? You mentioned something about the index permutations being accelerated at the point of register fetching to the point of being essentially free. This sounds very similar to the non-tensor warp level intrinsics (shuffle, vote, broadcast, etc …) that I’m used to from Nvidia hardware. And the description of “permutations” for Metal shading language reference mentions the same warp functions. I’m just not quite sure why using those would be as efficient/performant as a dedicated processor.

Basically, if the matrix function is wrapping over the same process as the Nvidia warp functions then I’m curious as to why Nvidia went with tensor cores if they could accelerate matrix multiplication without the additional silicon* and whether Nvidia could gain even additional acceleration by utilizing their pipelines on top of their tensor cores. And if what you’re describing is different then I really don’t understand what’s going on.

*Edit: I’m not suggesting that you should have an inside knowledge of Nvidia’s thought process here. 🙃 I’m just trying to wrap my head around all of this.

Okay mystery solved I think:




If I've got this right (and someone please correct if I don't): basically Nvidia has the same (or similar enough) warp-level matrix instructions as Apple does currently but also has asynchronous warp group level matrix instructions. The former appear to rely on the shuffle and broadcast instructions which Apple calls data permutation just as @leman described. The latter are issued in-order with the commands for the CUDA cores but sent to the Tensor cores and executed asynchronously with any CUDA core commands (just as I believe NVIDIA's RT core commands are likewise asynchronous). This means NVIDIA GPUs can do both matmul across a warp using normal FP pipelines and also do them on the Tensor cores where they can be executed simultaneously with normal FP operations.

I agree with @leman that Apple could go either way, continue with the current system or add dedicated matmul cores, depending on their silicon priorities and design which that patent @leman found is certainly tantalizing. But I lean towards adding dedicated matmul cores on their own pipelines. Even though it costs extra silicon and maybe the performance advantage of simultaneous execution is not that high, I would imagine that dedicated cores would be much more energy efficient which we know Apple appreciates in a design. It also seems like that patent is leaning into the parallel nature of data paths and the efficient doling out of different execution paths to threads which again might support adding extra execution path types in an efficient manner. But, maybe like with forward progress guarantees, I'm reading too much into the patent!
 
Last edited:

leman

Site Champ
Posts
641
Reaction score
1,196
Okay mystery solved I think:




If I've got this right (and someone please correct if I don't): basically Nvidia has the same (or similar enough) warp-level matrix instructions as Apple does currently but also has asynchronous warp group level matrix instructions. The former appear to rely on the shuffle and broadcast instructions which Apple calls data permutation just as @leman described. The latter are issued in-order with the commands for the CUDA cores but sent to the Tensor cores and executed asynchronously with any CUDA core commands (just as I believe NVIDIA's RT core commands are likewise asynchronous). This means NVIDIA GPUs can do both matmul across a warp using normal FP pipelines and also do them on the Tensor cores where they can be executed simultaneously with normal FP operations.

I agree with @leman that Apple could go either way, continue with the current system or add dedicated matmul cores, depending on their silicon priorities and design which that patent @leman found is certainly tantalizing. But I lean towards adding dedicated matmul cores on their own pipelines. Even though it costs extra silicon and maybe the performance advantage of simultaneous execution is not that high, I would imagine that dedicated cores would be much more energy efficient which we know Apple appreciates in a design. It also seems like that patent is leaning into the parallel nature of data paths and the efficient doling out of different execution paths to threads which again might support adding extra execution path types in an efficient manner. But, maybe like with forward progress guarantees, I'm reading too much into the patent!

From what I understand (and I very well might be wrong) is that the asynchronous instructions are there because access to shared memory is not simultaneous for all data elements. So it's not like they are executed asynchronously with other commands, what needs to be synchronized is the data access, not the instruction stream. All GPUs I am aware of are in-order, when an instruction starts, the program stalls until that instruction is done. It doesn't really matter where the instruction is executed: on a main general-purpose SIMD ALU or on the texture unit. However, even if the instruction has executed successfully, it does not mean that the results of the instruction are immediately usable. This is what "asynchronous" execution means in this context. You need a barrier or you might get garbage.

What is particularly confusing about this entire topic is the notion of concurrent execution. Yes, GPUs do execute instructions concurrently. But these are instructions from different threads. I am not aware of any GPU that can execute multiple instructions from one thread concurrently, unless you count VLIW and it's variants (which would be concurrent in-thread execution fully controlled and scheduled by the compiler). When a thread stalls (because it has to wait for data, for example), the GPU schedule will pick some other thread to execute.

Going back to matmul, I think the main difference between Nvidia and Apple is that Nvidia has some dedicated matmul hardware, in a form of a processing pipeline that can do more matrix operations per clock than just the usual SIMD ALU. Apple does not have any hardware of this kind, however, they do have some hardware switches that allow them to route data to the SIMD ALU so that they can do a matmul with zero-cost data swizzles. I don't know if this makes much sense. Anyway, here is the patent: https://patentscope.wipo.int/search/en/detail.jsf?docId=US321831830&_cid=P21-LUB0EI-83832-1
 

dada_dave

Elite Member
Posts
2,163
Reaction score
2,148
From what I understand (and I very well might be wrong) is that the asynchronous instructions are there because access to shared memory is not simultaneous for all data elements. So it's not like they are executed asynchronously with other commands, what needs to be synchronized is the data access, not the instruction stream. All GPUs I am aware of are in-order, when an instruction starts, the program stalls until that instruction is done. It doesn't really matter where the instruction is executed: on a main general-purpose SIMD ALU or on the texture unit. However, even if the instruction has executed successfully, it does not mean that the results of the instruction are immediately usable. This is what "asynchronous" execution means in this context. You need a barrier or you might get garbage.

What is particularly confusing about this entire topic is the notion of concurrent execution. Yes, GPUs do execute instructions concurrently. But these are instructions from different threads. I am not aware of any GPU that can execute multiple instructions from one thread concurrently, unless you count VLIW and it's variants (which would be concurrent in-thread execution fully controlled and scheduled by the compiler). When a thread stalls (because it has to wait for data, for example), the GPU schedule will pick some other thread to execute.

So from Robert Crovella's posts, he talks about this. Instructions to the warp are issued back-to-back (caveat that there are four warp schedulers) but the execution of those instructions to different execution units can be simultaneous.



I believe this does apply to RT cores as well and indeed different threads in different warps might be using different execution units.


Going back to matmul, I think the main difference between Nvidia and Apple is that Nvidia has some dedicated matmul hardware, in a form of a processing pipeline that can do more matrix operations per clock than just the usual SIMD ALU. Apple does not have any hardware of this kind, however, they do have some hardware switches that allow them to route data to the SIMD ALU so that they can do a matmul with zero-cost data swizzles. I don't know if this makes much sense. Anyway, here is the patent: https://patentscope.wipo.int/search/en/detail.jsf?docId=US321831830&_cid=P21-LUB0EI-83832-1

I think this makes sense, they've made an intrinsic out of the shuffle and multiply-add steps across the steps of matrix multiplication. I know Nvidia has a similar operation: wmm (as opposed to wmma, wmm async, which targets the tensor cores). However, there is a catch here, another very similar operation is reduce and while many Nvidia GPUs can technically perform the operation across a warp-sized cooperative group, only the most recent ones can accelerate the calculation across the warp in hardware. This Apple patent would appear to describe the equivalent hardware acceleration for the matmul operation across threads in a warp on an Apple GPU. It's possible that while technically an Nvidia GPU is able to perform said operation, it isn't currently accelerated. If that accelerated matmul operation across threads in a warp performs as well and as efficiently as a tensor core (or close enough), Apple indeed may choose to save themselves the silicon and double down on that approach. Either way, I'm hoping they'll increase matmul performance in the next generation(s) and it'll be fascinating to see what approach they take.
 
Last edited:

leman

Site Champ
Posts
641
Reaction score
1,196
So from Robert Crovella's posts, he talks about this. Instructions to the warp are issued back-to-back (caveat that there are four warp schedulers) but the execution of those instructions to different execution units can be simultaneous.

If I remember correctly Nvidia instructions contain scheduling information. So I can absolutely see two instructions in a single warp being issued back to back if the second does not depend on the first. Generally though the scheduler will interleave different warps to hide execution latency.

I know Nvidia has a similar operation: wmm (as opposed to wmma, wmm async, which targets the tensor cores).

Are you sure these target different hardware units? I could find this kind of info in the documentation, then again I only lightly skimmed it.
 

dada_dave

Elite Member
Posts
2,163
Reaction score
2,148
If I remember correctly Nvidia instructions contain scheduling information. So I can absolutely see two instructions in a single warp being issued back to back if the second does not depend on the first. Generally though the scheduler will interleave different warps to hide execution latency.
Yup. So we're on the same page that different execution units can simultaneously execute independent commands from the same/different warp. However, I was very wrong on the next point.
Are you sure these target different hardware units? I could find this kind of info in the documentation, then again I only lightly skimmed it.

I think I'm wrong on two counts here: the trivial first mistake is that the commands are wmma (sync) vs wgmma (async) and the more important second mistake is that, rereading their descriptions, no I don't think they actually target different hardware units. With the lengthy description of wmma I got confused between distributing registers and calculations across the threads of a warp using floating point units vs tensor units to the calculations, thinking it was describing the former and not the latter. Combined with RCrovella's description of tensor cores operating asynchronously, I assumed that meant wgmma was the tensor core version and wmma was the floating point unit version. The first clue should've been that wgmma has a comparatively shorter description just because it is the same as wmma but with memory fences to allow it to execute asynchronously with even dependent commands (added in Hopper) and the second is that wmma requires GPUs with tensor cores and the different commands correspond to when they added those features to additional generations of tensor cores. The thing I have to keep in mind is that the registers for the matrices are still being held by the thread in each warp regardless of what execution unit is operating on them. Even though I knew it wasn't the case, I was still thinking of the tensor core more like an accelerator within the GPU (a la AMX), which, again, it isn't.
 
Last edited:

dada_dave

Elite Member
Posts
2,163
Reaction score
2,148
It looks like I can't go back to strike out wrong information from posts that are too old. So for everyone keeping score at home this the current state of play as far the growth of my limited understanding mucking my way through the topic:

1. Yes Nvidia implemented the VK_NV_cooperative_matrix command which can work regardless of tensor cores (TC) but is accelerated with presence of tensor cores. So that means Nvidia GPUs can theoretically do matmul using Floating point (FP) execution units, but it isn't going to be "accelerated" and on modern Nvidia GPUs is likely to be run on the TC units.

2. Point #1 has nothing to do with the differences between the wmma vs wgmma instructions as posited by post #82 and post 82 is flawed as a result. Those instructions are both for the TC unit.

3. By splitting the tensor core off as a separate execution unit Nvidia can execute independent instructions on FP and TC execution units (wmma) and Hopper can even execute dependent FP/TC instructions with the appropriate memory fences (wgmma).

4. Apple have a patent to do the matrix calculations on their normal FP pipelines and accelerating the routing of data but they note that their routing circuitry is execution unit independent. Nvidia has similar data routing capabilities but added an execution unit specifically designed for matrix calculations. Whether Apple adds one to the GPU remains to be seen and they may or may not feel the need to based on the capabilities already in place/yet to be added.

Let's see how much of the above is still wrong. :)
 
Last edited:

dada_dave

Elite Member
Posts
2,163
Reaction score
2,148
@theorist9, @mr_roboto you’ve probably already figured this out and I definitely don’t want to sound condescending, but, in the Macrumors version of this thread, sunny5 is just a troll. I used to think they were just naive and inquisitive but that notion dissipated quickly. He makes m7chy look good (at least m7chy had some technical knowledge, even if they abused what little they had).

On the other hand you (and others) have posted valuable and interesting information for other people to read in response to his inane trolling, but I just don’t want you guys spending sanity on such a grating individual. If you’re not, feel free to ignore me - again not trying to mother you, just giving a probably unneeded heads up. I have precious little sanity left so that’s why I don’t post there much anymore and, even when I do, I try not to engage with posters like sunny5. But that’s just me.
 

Jimmyjames

Site Champ
Posts
675
Reaction score
763
@theorist9, @mr_roboto you’ve probably already figured this out and I definitely don’t want to sound condescending, but, in the Macrumors version of this thread, sunny5 is just a troll. I used to think they were just naive and inquisitive but that notion dissipated quickly. He makes m7chy look good (at least m7chy had some technical knowledge, even if they abused what little they had).

On the other hand you (and others) have posted valuable and interesting information for other people to read in response to his inane trolling, but I just don’t want you guys spending sanity on such a grating individual. If you’re not, feel free to ignore me - again not trying to mother you, just giving a probably unneeded heads up. I have precious little sanity left so that’s why I don’t post there much anymore and, even when I do, I try not to engage with posters like sunny5. But that’s just me.
I’ve seen that and it‘s absolute trolling. Goalpost moving, avoiding evidence. Sunny5 is emblematic of everything wrong with macrumors.
 

mr_roboto

Site Champ
Posts
288
Reaction score
464
@theorist9, @mr_roboto you’ve probably already figured this out and I definitely don’t want to sound condescending, but, in the Macrumors version of this thread, sunny5 is just a troll. I used to think they were just naive and inquisitive but that notion dissipated quickly. He makes m7chy look good (at least m7chy had some technical knowledge, even if they abused what little they had).

On the other hand you (and others) have posted valuable and interesting information for other people to read in response to his inane trolling, but I just don’t want you guys spending sanity on such a grating individual. If you’re not, feel free to ignore me - again not trying to mother you, just giving a probably unneeded heads up. I have precious little sanity left so that’s why I don’t post there much anymore and, even when I do, I try not to engage with posters like sunny5. But that’s just me.
Oh, believe me, I know. I had a much more sharply worded post in that thread, calling him out in a more direct way. As often happens, the moderation staff there decided to protect the troll and deleted my post.
 

leman

Site Champ
Posts
641
Reaction score
1,196
Oh, believe me, I know. I had a much more sharply worded post in that thread, calling him out in a more direct way. As often happens, the moderation staff there decided to protect the troll and deleted my post.

I got banned twice because of sunny-boy :D or what is michy-kun? I forget. They are like in unholy duality for me. Great training for my anger management issues though.
 

jbailey

Power User
Posts
170
Reaction score
187
I got banned twice because of sunny-boy :D or what is michy-kun? I forget. They are like in unholy duality for me. Great training for my anger management issues though.
Use the ignore user function liberally. There is a new one over there called high iq person or something like that. Immediate ignore.
 

B01L

SlackMaster
Posts
175
Reaction score
131
Location
Diagonally parked in a parallel universe...
Rumblings of the next Ultra SoC being a stand-alone die removing the redundant subsystems, rather than two Mn Max chips stitched together...

Maybe the next Ultra SoC has an UltraFusion of some sort, and pairs said Ultra chip with a GPU-specific chip, so the new Extreme SoC is equal to two top end Nvidia GPUs...?



I feel kinda dirty linking Max Tech in here...
 

casperes1996

Power User
Posts
185
Reaction score
171
Rumblings of the next Ultra SoC being a stand-alone die removing the redundant subsystems, rather than two Mn Max chips stitched together...

Maybe the next Ultra SoC has an UltraFusion of some sort, and pairs said Ultra chip with a GPU-specific chip, so the new Extreme SoC is equal to two top end Nvidia GPUs...?



I feel kinda dirty linking Max Tech in here...

That whole video is very unlikely to hold any truth at all IMO. Maybe the Ultra will be a new monolithic die but I’ll bet on there not being an M4 an ultra reveal at wwdc. M3 Ultra sure. But I highly doubt us skipping straight into 4 Ultra
 

B01L

SlackMaster
Posts
175
Reaction score
131
Location
Diagonally parked in a parallel universe...
Maybe the next Ultra SoC has an UltraFusion of some sort, and pairs said Ultra chip with a GPU-specific chip, so the new Extreme SoC is equal to two top end Nvidia GPUs...?
...the Ultra will be a new monolithic die... ...skipping straight into 4 Ultra...

"Presenting the all-new M4 Extreme, a monolithic M4 Ultra paired with a monolithic GPU-specific die; we think you're going to love it!" - Tim Cook, probably...
 
Top Bottom
1 2