The mystery of Apple M3 on-chip shared memory

leman · Dec 12, 2023

So, as I've been sick with Covid lately, my feverish brain wanted to finally do some Apple GPU microbenchmarks (yay!). One particular topic of interest is the shared memory (threadgroup memory as Apple calls it). Why is this interesting? Well, it has been long known in GPGPU that shared memory is banked and different access patterns can have very different performance. This is something that has been extensively documented for CUDA and is part of any Nvidia optimization guide. A nice description of the phenomenon is here: http://cuda-programming.blogspot.com/2013/02/bank-conflicts-in-shared-memory-in-cuda.html So when designing high-performance algorithms which use cooperative kernels, it is important to keep this in mind and try to avoid bank conflicts.

How does this look for Apple and what would be the best practices? I wrote a series of kernels that hammer the threadgroup memory in three different scenarios (write only, load only, copy). As usual, treat this with the grain of salt, as these are fairly artificial scenarios and do not reflect real world usage. First the results (M1 is G13, M3 is G15):

M1 is easy — this is a classical shared memory with 32 independent banks. Using stride of 2 (i.e. thread_index*2) means that you are hitting only every second bank, so your performance goes down. Stride of 4 means you are hitting every 4th bank etc — the penalty is pretty much the same up to the stride of 32, where all threads are accessing the same memory bank. So exactly the same as your mainstream Nvidia/AMD GPU.

M3 instead is a hot mess. I don't understand anything. Looking at store-only kernel performance, we can see some bank conflict-like effects (strides 8, 16, 24, 32 are slower), but there is no discernible penalty for even strides and also we have a clear preference for coalesced stores (stride up to 4), plus some additional semi-cyclical effects (?) We can see similar behaviour in the copy (load+store) kernel, just much more subtle. The load-accumulate kernel is just confusing. There is an obvious penalty for even strides, but we also have this lower performance region in the middle, and stride of 32 is really fast. This is all consistent across multiple runs btw. I have no idea how this memory is organised in the background. Maybe someone here with an actual background in caches and memory can see something in these graphs.

dada_dave · Dec 12, 2023

Very interesting! Also feel better soon

dada_dave · Mar 2, 2024

Would you mind sharing your code for probing the cache and especially the float/integer throughput? I have the same machine so I’m not expecting anything new but just thought I’d like to play around with it. Thanks!

leman · Mar 3, 2024

Will do, just need to clean up the stuff a bit…

leman · Apr 19, 2024

dada_dave said:
Would you mind sharing your code for probing the cache and especially the float/integer throughput? I have the same machine so I’m not expecting anything new but just thought I’d like to play around with it. Thanks!

Wanted to let you know that I have not forgotten about this. It's just that with the three suddenly super urgent project deadlines, the extremely stressful job negotiations in the US, and the fact that we are trying to sell the family assets in a country currently at war, I don't really get much spare time. So I don't currently have an ETA.

dada_dave · Apr 19, 2024

leman said:
Wanted to let you know that I have not forgotten about this. It's just that with the three suddenly super urgent project deadlines, the extremely stressful job negotiations in the US, and the fact that we are trying to sell out family assets in a country currently at war, I don't really get much spare time. So I don't currently have an ETA.

Take your time, I appreciate the update. And of course, my best wishes for all of that. That sounds like a rough time.

Jimmyjames · Apr 19, 2024

leman said:
Wanted to let you know that I have not forgotten about this. It's just that with the three suddenly super urgent project deadlines, the extremely stressful job negotiations in the US, and the fact that we are trying to sell the family assets in a country currently at war, I don't really get much spare time. So I don't currently have an ETA.

Sounds very stressful. Hope it improves quickly for you and your family.

leman · Apr 19, 2024

@dada_dave @Jimmyjames Thanks for your encouraging words! It's not that bad, just a lot of things to take care of. Hopefully the second half of the year will be calmer.

The mystery of Apple M3 on-chip shared memory

leman

Site Champ

dada_dave

Elite Member

dada_dave

Elite Member

leman

Site Champ

leman

Site Champ

dada_dave

Elite Member

Jimmyjames

Site Champ

leman

Site Champ

Similar threads