Metal 3

Andropov

Power User
Vaccinated
Posts
223
Reaction score
160
Location
Spain
Don't know if a lot of people here is interested in the new Metal stuff. For me, the Metal 3 announcement has been super exciting. The SOTU didn't go into a lot of depth about Metal 3, but the most interesting aspects seem to be:
  • MetalFX Upscaling
  • Mesh shaders
  • Faster loading I/O for models
  • More talks about bindless rendering
MetalFX Upscaling seems particularly interesting. Is it something akin to NVIDIA's DLSS? Apple said that it renders at a lower resolution while maintaining the *same* rendering quality. That can't be true, but maybe it does come very close.

Also, the last Metal talk is about GPU scaling on Apple GPUs. I bet we'll get a peek on why the M1 Ultra GPU didn't scale linearly in some applications.
 

diamond.g

Power User
Posts
109
Reaction score
42
Don't know if a lot of people here is interested in the new Metal stuff. For me, the Metal 3 announcement has been super exciting. The SOTU didn't go into a lot of depth about Metal 3, but the most interesting aspects seem to be:
  • MetalFX Upscaling
  • Mesh shaders
  • Faster loading I/O for models
  • More talks about bindless rendering
MetalFX Upscaling seems particularly interesting. Is it something akin to NVIDIA's DLSS? Apple said that it renders at a lower resolution while maintaining the *same* rendering quality. That can't be true, but maybe it does come very close.

Also, the last Metal talk is about GPU scaling on Apple GPUs. I bet we'll get a peek on why the M1 Ultra GPU didn't scale linearly in some applications.
I think it is closer to FSR 2.0 (in that it doesn't need dedicated ML hardware) than DLSS, but yeah same concept. Wonder if they will "Open Source" it like AMD did to get better adoption (on cross platform titles).
 

Andropov

Power User
Vaccinated
Posts
223
Reaction score
160
Location
Spain
I think it is closer to FSR 2.0 (in that it doesn't need dedicated ML hardware) than DLSS, but yeah same concept. Wonder if they will "Open Source" it like AMD did to get better adoption (on cross platform titles).
Makes sense, since we haven't heard of any dedicated ML hardware on the GPU that could be used for this. I didn't know AMD (and Intel, apparently) had a technology for this until just now. Interesting.

I hope they give more info about how the upscaling tech works on tomorrow's Metal talks. I think I'll be able to use it for the app I'm developing.
 

diamond.g

Power User
Posts
109
Reaction score
42
Makes sense, since we haven't heard of any dedicated ML hardware on the GPU that could be used for this. I didn't know AMD (and Intel, apparently) had a technology for this until just now. Interesting.

I hope they give more info about how the upscaling tech works on tomorrow's Metal talks. I think I'll be able to use it for the app I'm developing.
I wonder how low of a resolution they are targeting for the upscaling, like render at 1600p and upscale to 4k?
 

Andropov

Power User
Vaccinated
Posts
223
Reaction score
160
Location
Spain
I wonder how low of a resolution they are targeting for the upscaling, like render at 1600p and upscale to 4k?
They mentioned the MacBook Air running at 1080p-like resolution for one of the games, so I guess 720p -> 1080p on the low(er) end machines?
 

tomO2013

Active member
Vaccinated
Posts
26
Reaction score
43
There had been rumours that Apple may be considering making a gaming orientated Apple TV. Kind of possible when you consider the lack of AppleTV mention today and capabilities such as upscaling in metal 3.

I’m also interested to watch talks on ray tracing to see what’s new too.
 

theorist9

Power User
Posts
72
Reaction score
44
Some basic questions, since I don't know anything about this:

1) Games (and scientific visualizations): What's the story with TBDR? From what I've gathered (which may be wrong), (a) if you're writing a AAA-class game and want it to be maximally optimized to run on AS, you need to make use of this; and (b) other platforms don't use this, so writing games in a way that takes advantage of TBDR is a barrier in terms of developer expertise. Or is it the case that, if you're writing in Metal, using TBDR is simply a natural part of that? When Capcom made the new RE game for MacOS, did they likely make extensive use of TBDR?

2) ML: IIUC, NVIDIA is the dominant player in high-end ML for three reasons: performance (which means both the software tools [CUDA, including cuDNN] and the hardware), easy of use (a lot of ML users are scientists rather than programmers, and most find CUDA pretty accessible), and the size of the community/knowledge base. That doesn't seem to be Apple's market (e.g., their GPU's are still single-precision). But how much of a place is there for AS in ML at the lower end? It seems that, even at the lower end, for someone that wants to do anything more than occasional ML, the path of least resistance is to get an NVIDIA box. How accessible is Metal for doing ML as compared with coding for CUDA?

3) Is Metal 3 fully supported on every Mac that can run Ventura, or are certain features available only on AS?
 
Last edited:

leman

Power User
Posts
76
Reaction score
176
Some basic questions, since I don't know anything about this:

1) Games (and scientific visualizations): What's the story with TBDR? From what I've gathered (which may be wrong), (a) if you're writing a AAA-class game and want it to be maximally optimized to run on AS, you need to make use of this; and (b) other platforms don't use this, so writing games in a way that takes advantage of TBDR is a barrier in terms of developer expertise. Or is it the case that, if you're writing in Metal, using TBDR is simply a natural part of that? When Capcom made the new RE game for MacOS, did they likely make extensive use of TBDR?

What follows is an attempt of a detailed breakdown, but first a short TL;DR: TBDR is always on and all applications benefit for it; applications running on a TBDR GPU can be further optimised by opting in to use some TBDR-specific APIs, some of which are trivial to use and can offer big wins, some of which require significant redesigns while potentially offering much bigger wins. You choose how far to go down this rabbit hole as a developer. Does Capcom utilise all the features of Apple GPUs? Only they and Apple know :) And now in more detail:

TBDR is a specific way to do primitive rasterization and shading. It consists of two parts: tile-based (TB) rasterization and deferred rendering/shading (DR). In more detail:

Tile-based (TB) rasterization: the screen is split into small tiles (usually 32x32 pixels), the rendered primitives are sorted by the tile they intersect and the processing is done per tile rather than per whole primitive. TB is a memory bandwidth optimisation technique. Since you know that the processing is limited to a specific small area of the screen, you can do all the work using fast on-chip memory, which saves you roundtrips to the video memory. You only need to save the final result (the drawn tile) to the video memory, which is much more efficient than sending back and forth individual pixel data, especially if you have many overlapping primitives etc. And since pixels you are processing are guaranteed to be close to each other, it is also more likely that the texture data you have to fetch will share cache lines, further improving memory bandwidth efficiency.

Deferred rendering/shading (DR): the pixel shader is only run after all the primitives have been rasterised. This improves shader core utilisation. First, you only have to shade pixels — shading is done after the pixel visibility has been fully determined (this is why they say that TBDR has perfect hidden surface removal). Second, with DR you shade the entire tile at once, that is, the shader is run on the full tile of 32x32 pixels, which is much more efficient than dispatching shader work for pixels of individual primitives, especially if the primitives are small (you lose SIMD efficiency around the primitive edges).

Virtually all mobile rasterisers do TB rasterization, since it's easy to do and bring very obvious benefits on devices with constrained memory bandwidth. Even more, as TB approaches massively improve cache utilisation, modern desktop GPUs also adopted some tile-based methods — that is the secret for improved power efficiency of Nvidia Maxwell and later for example. Almost nobody does deferred shading part however, since that is the notoriously difficult part of the story. The only company who managed to successfully develop DR technology is Imagination and Apple got it from them and further refined it. That's why Apple is in a rather unique position that their GPUs are "true" TBDR while other mobile GPUs are just "TB".

So, now that we have stablished the terminology, we can get to your question :)

Any application that does GPU-driven rasterization, no matter the API they use, is going to benefit from TBDR on Apple GPUs, simply because that's how Apple GPUs operate. But it is possible to get more performance/efficiency by explicitly taking advantage of the architecture of these GPUs.

The first, low-hanging fruit is to provide the GPU with some hints as to how the data is used. Recall that a TB GPU operates on small tiles. The data for these tiles have to be fetched from the frame buffer (in the video RAM) and then written back, which obviously requires memory bandwidth. But quite often, you can skip these steps. For example, if you start your drawing from a clean slate, you don't have to fetch the tile contents from the frame buffer. Or, if you are not using the depth buffer for any further post processing (like ambient occlusion), you don't have to save it's data in the system memory at all. By telling the API which components of the frame buffer you will use — and how — a TB GPU can dramatically reduce the memory transfers and thus improve efficiency. Such APIs are not exclusive to Apple — although Apple was the first to come up with them — they will benefit any modern mobile GPU and that's why it is supported everywhere from DX12 to Vulkan (and obviously, in Metal). The only problem is that developers are often not too familiar with them as they are used to desktop GPUs where such hints are meaningless, so they often don't set them up properly. Regardless, doing this is very simple and any bugs can be fixed quickly, so it's the super easy part of making things run better on tile-based renderers.

The other optimisations step, which is much more intricate and limited to Apple GPUs only, is to use the special capabilities Apple exposes in Metal. These capabilities fully expose the nature of the TBDR GPU and allow you to do very complex per-tile processing without every touching the slow video memory. Apple allows you to freely mix pixel and compute shaders and store complex data structures in the on-chip tile memory, which can allow you to implement many complex rendering algorithms in a much simpler and way way more effective fashion. Examples include advanced per-pixel shading, which has traditionally been an expensive technique to use. With a "regular renderer", you have to deal with lights for every pixel. With Apple's TBDR extensions, you can use a compute shader to gather all the lights that can affect pixels in a tile (this is much cheaper to do for a tile than for individual pixels) and then light all pixels in the tile at once. These are really cool features and easily my favourite part of Metal and Apple Silicon, but they require you to rethink your algorithms and approaches and therefore are a less likely target for a straightforward port.




2) ML: IIUC, NVIDIA is the dominant player in high-end ML for three reasons: performance (which means both the software tools [CUDA, including cuDNN] and the hardware), easy of use (a lot of ML users are scientists rather than programmers, and most find CUDA pretty accessible), and the size of the community/knowledge base. That doesn't seem to be Apple's market (e.g., their GPU's are still single-precision). But how much of a place is there for AS in ML at the lower end? It seems that, even at the lower end, for someone that wants to do anything more than occasional ML, the path of least resistance is to get an NVIDIA box. How accessible is Metal for doing ML as compared with coding for CUDA?

I think there are two parts to this. First, almost nobody uses GPU APIs like CUDA or Metal directly to do ML. People use an ML framework such as Tensorflow or PyTorch. That's the beauty of the it all. You can develop and test your ML model on your ultracompact Mac laptop and then push the same model to a supercomputer for the real training. These frameworks abstract away the APIs and the GPUs they run at. Nvidia GPUs are still faster of course. The goal is not to beat Nvidia here, the goal is to make working with these frameworks on a Mac just fast and convenient enough so that other benefits of the Mac (efficiency, weight, ergonomy, convenience) will push data folks to choose a Mac laptop. And who knows what the future will bring. Maybe next-gen Apple prosumer chips will come with a much more capable ML accelerator and you'd get a huge speedup for your existing code.

Second part is to use the GPU APIs to implement ML directly. As I said, this is rarely done, mostly when you have some very exotic needs or you already know what your model is and just want to integrate it into your application/optimize the hell out of it. For the later part, Apple does have the efficient NPU that's optimised for low-power ML inference for applications. But that is not something data scientists are interested in.

3) Is Metal 3 fully supported on every Mac that can run Ventura, or are certain features available only on AS?

It's supported on all newer Intel and AMD GPUs (Iris Plus and Vega and up) — so you get things like raytracing etc. on all of them. Some features (like all the TBDR extensions etc.) are obviously Apple Silicon only.
 

Andropov

Power User
Vaccinated
Posts
223
Reaction score
160
Location
Spain
Nice to see you around, @leman :)

Tile-based (TB) rasterization: the screen is split into small tiles (usually 32x32 pixels), the rendered primitives are sorted by the tile they intersect and the processing is done per tile rather than per whole primitive. TB is a memory bandwidth optimisation technique. Since you know that the processing is limited to a specific small area of the screen, you can do all the work using fast on-chip memory, which saves you roundtrips to the video memory. You only need to save the final result (the drawn tile) to the video memory, which is much more efficient than sending back and forth individual pixel data, especially if you have many overlapping primitives etc. And since pixels you are processing are guaranteed to be close to each other, it is also more likely that the texture data you have to fetch will share cache lines, further improving memory bandwidth efficiency.

Deferred rendering/shading (DR): the pixel shader is only run after all the primitives have been rasterised. This improves shader core utilisation. First, you only have to shade pixels — shading is done after the pixel visibility has been fully determined (this is why they say that TBDR has perfect hidden surface removal). Second, with DR you shade the entire tile at once, that is, the shader is run on the full tile of 32x32 pixels, which is much more efficient than dispatching shader work for pixels of individual primitives, especially if the primitives are small (you lose SIMD efficiency around the primitive edges).

To build a bit on top of this explanation, just before this stage (the tile-based rasterization) vertices have to be sent to each of the tiles they intersect with. This diagram, from Asahi Lina's channel, was very informative to me (the whole start of the stream is well worth watching, really):

Screenshot 2022-06-08 at 10.07.40.png


This is also the reason Multi-Sample Antialiasing (MSAA) is cheaper on Apple Silicon. On the rasterization + HSR stage, you can use a supersampled tile texture (say, for 4x MSAA, you'd render to a 64x64 texture for a 32x32 tile). Since you know which pixels are going to be visible (thanks to HSR), you'd just run your fragment shader for those 64x64 pixels (at most, assuming entire tile is covered by primitives). On immediate mode renderer (IMR) GPUs, you'd shade every triangle individually, and you could end up with a lot more than 64x64 fragment shader calls for a 32x32 section of the image. And, since everything is already on tile memory, there's not a high bandwidth cost associated with it, since you'd be fetching data from tile memory. You don't even have increased bandwidth costs at the end of the render pass, since Metal has the .multisampleResolve store action to downscale the shaded texture back to the original resolution at the end of the tile shading. So the multisampled texture doesn't even 'exist' outside the tile.

And yet, it's not uncommon for Mac games to implement FXAA antialiasing (Borderlands 2 comes to mind), which is a screen-space effect that doesn't leverage any Apple Silicon strengths. But FXAA works on any kind of GPU, and Mac games are often multiplatform.

Also, there are a lot of things that are allowed in Metal but disable HSR. Write masking, device buffer writes or depth texture writes from fragment shaders, for example. Or semi-transparent geometries. Some of those things don't matter much on IMR GPUs, since you'd be shading every triangle anyway. So renderers may carelessly do any of the above (even if there are alternative ways of doing it that avoid disabling HSR) because those things don't have a higher cost associated (on IMR GPUs) but they may substantially underutilize the GPU capabilities when that code is ported to TBDR GPUs.

The other optimisations step, which is much more intricate and limited to Apple GPUs only, is to use the special capabilities Apple exposes in Metal. These capabilities fully expose the nature of the TBDR GPU and allow you to do very complex per-tile processing without every touching the slow video memory. Apple allows you to freely mix pixel and compute shaders and store complex data structures in the on-chip tile memory, which can allow you to implement many complex rendering algorithms in a much simpler and way way more effective fashion. Examples include advanced per-pixel shading, which has traditionally been an expensive technique to use. With a "regular renderer", you have to deal with lights for every pixel. With Apple's TBDR extensions, you can use a compute shader to gather all the lights that can affect pixels in a tile (this is much cheaper to do for a tile than for individual pixels) and then light all pixels in the tile at once. These are really cool features and easily my favourite part of Metal and Apple Silicon, but they require you to rethink your algorithms and approaches and therefore are a less likely target for a straightforward port.
This. Apple emphasizes a lot that opportunities to merge distinct render passes into a single passes should be leveraged, since merged passes don't have to go through system memory (other than at the start/end of the pass). Sadly, it's difficult to do without reorganizing your renderer with Apple Silicon in mind.

It's supported on all newer Intel and AMD GPUs (Iris Plus and Vega and up) — so you get things like raytracing etc. on all of them. Some features (like all the TBDR extensions etc.) are obviously Apple Silicon only.
I'm surprised that they dropped A12/A12X, though. My 12.9" iPad Pro won't support Metal 3, and neither will iPhones X, XS, XR.

Screenshot 2022-06-08 at 10.05.55.png
 

Colstan

Power User
Vaccinated
Posts
216
Reaction score
272
According to a game dev posting on Andrew Tsai's video about Metal 3:
The issue with DirectX 12 is not the lack of geometry shaders or stream output. The issue is the binding model and that has not changed with Metal 3. D3D12 works by binding subsets of descriptor heaps. D3D12 guarantees that descriptor heaps can contain up to a million entries. Metal only supports up to 500k.
Apparently, this is the biggest issue that CodeWeavers has to deal with in supporting DX12 games with CrossOver.
 

Colstan

Power User
Vaccinated
Posts
216
Reaction score
272
Can someone explain to me what that means?
This article from CodeWeavers should be helpful. They're working on DirectX 12 compatibility with CrossOver, but it's a substantial undertaking.

In general, Metal does tessellation differently, and is missing geometry shaders and transform feedback. Specific to DirectX 12 and Metal, there is an issue with limits on resources. Generally, games need access to at least one million shader resource views (SRVs). Access to that many SRVs requires resource binding at the Tier 2 level. Metal only supports about 500,000 resources per argument buffer, so Tier 2 resource binding isn’t possible. Metal’s limit of half a million is sufficient for Vulkan descriptor indexing, but not for D3D12. This limitation means CrossOver Mac can't support Tier 2 binding and therefore a lot of DirectX 12 games will not run.
I recommend the full article. It's short, and lists out the challenges with translating DX12-to-Metal. Keep in mind that they wrote this in December, so Metal 3 may have changed the equation. CodeWeavers and Feral have said that Apple listens to their feedback, so there may be some changes specifically at their request.
 

theorist9

Power User
Posts
72
Reaction score
44
I'm surprised that they dropped A12/A12X, though. My 12.9" iPad Pro won't support Metal 3, and neither will iPhones X, XS, XR.

View attachment 14754
Interesting. It was opined on another thread that the reason Ventura limited support to Macs made 2017 and after was that Apple only wanted to support Macs with T2 chips. But that's not quite the case, since Ventura supports iMacs from 2017 forward, and they didn't get the T2 chip until 2020. So I'm wondering if the reason behind the cutoffs was instead that Ventura uses Metal 3 for its GUI, and thus requires a GPU that can support Metal 3.

Though that chart seems to have an omission or typo, since they're saying AMD GPUs must be at least Radeon Vega or 5000 series, but the Vega didn't appear on the iMac until 2019 (and even then only as the top-end option), and he 5000's didn't appear until 2020. The 2017 iMacs have the 500 series, and the 2019's mostly have the 500X series.
 
Last edited:

Andropov

Power User
Vaccinated
Posts
223
Reaction score
160
Location
Spain
Interesting. It was opined on another thread that the reason Ventura limited support to Macs made 2017 and after was that Apple only wanted to support Macs with T2 chips. But that's not quite the case, since Ventura supports iMacs from 2017 forward, and they didn't get the T2 chip until 2020. So I'm wondering if the reason behind the cutoffs was instead that Ventura uses Metal 3 for its GUI, and thus requires a GPU that can support Metal 3.
My thoughts were that maybe the reasons behind that decision were not technical at all. If they're planning on dropping Intel support altogether soon, it makes sense to start shortening the support of the older Macs to avoid people complaining because their 2015 MacBook and their 2020 MacBook both lost support on the same release, despite being 5 years apart.
 

Colstan

Power User
Vaccinated
Posts
216
Reaction score
272
So I'm wondering if the reason behind the cutoffs was instead that Ventura uses Metal 3 for its GUI, and thus requires a GPU that can support Metal 3.
Sometimes Apple's system requirements are inscrutable or arbitrary. Tiger dropped support for Macs without FireWire. That was hardly a vital feature, but was an easy cutoff point.
 

Andropov

Power User
Vaccinated
Posts
223
Reaction score
160
Location
Spain
BTW, watched the MetalFX Upscaling and the temporally antialiased version looks awesome.
 

leman

Power User
Posts
76
Reaction score
176
Can someone explain to me what that means?

How detailed do you want it? I could try to do a write-up :)

According to a game dev posting on Andrew Tsai's video about Metal 3:

In general, Metal does tessellation differently, and is missing geometry shaders and transform feedback. Specific to DirectX 12 and Metal, there is an issue with limits on resources. Generally, games need access to at least one million shader resource views (SRVs). Access to that many SRVs requires resource binding at the Tier 2 level. Metal only supports about 500,000 resources per argument buffer, so Tier 2 resource binding isn’t possible. Metal’s limit of half a million is sufficient for Vulkan descriptor indexing, but not for D3D12. This limitation means CrossOver Mac can't support Tier 2 binding and therefore a lot of DirectX 12 games will not run.

Apparently, this is the biggest issue that CodeWeavers has to deal with in supporting DX12 games with CrossOver.

I got very curious about this and did some testing... I have zero problem creating, compiling and correctly using a shader with up to five millions resource attachment points on my M1 Max — just didn't tested with more then that but it will probably work. I have a suspicion that the 500,000 object limit has been misunderstood (not that Apple's vague documentation is of any help anyway). It seems that 500,000 resources is the an application can have in use at any time. It does not refer to the number of slots/bindings per argument buffer, which only appears to be restricted by the available memory. This is further corroborated by the fact that my app blocks when I try to create hundreds of thousands of ver small GPU buffers, but works without issues if I create a smaller number of larger buffers and use offsets to bind their subsets to the attachment points.

I will need to run some tests with textures as well, but yeah, at least on my M1 Max hardware there is no hard limit. Of course, I don't know whether it is intended behaviour or not.
 

Cmaier

Elite Member
Staff Member
Vaccinated
Site Donor
Posts
2,797
Reaction score
3,788
How detailed do you want it? I could try to do a write-up :)



I got very curious about this and did some testing... I have zero problem creating, compiling and correctly using a shader with up to five millions resource attachment points on my M1 Max — just didn't tested with more then that but it will probably work. I have a suspicion that the 500,000 object limit has been misunderstood (not that Apple's vague documentation is of any help anyway). It seems that 500,000 resources is the an application can have in use at any time. It does not refer to the number of slots/bindings per argument buffer, which only appears to be restricted by the available memory. This is further corroborated by the fact that my app blocks when I try to create hundreds of thousands of ver small GPU buffers, but works without issues if I create a smaller number of larger buffers and use offsets to bind their subsets to the attachment points.

I will need to run some tests with textures as well, but yeah, at least on my M1 Max hardware there is no hard limit. Of course, I don't know whether it is intended behaviour or not.
What is a resource attachment point?
 
Top Bottom