Will AMX disappear?

Jimmyjames · Feb 16, 2024

So everyone’s favourite Linux dev…well nearly everyone, just posted on the macgaming subreddit. The topic is Asahi Linux OpenGL 4.6 conformance. Among the posts about how it’s going to revolutionise Mac gaming, someone wondered if games that use AVX instructions will be able to run. Someone suggested the devs should somehow map AVX instructions to AMX. It was then suggested that:

The key part for me being the “Apple chips will almost certainly drop support for AMX in favour of SVE..”.

This would be surprising to me. I can’t imagine they didn’t know that SVE would come eventually. Would they really leave AMX for SVE? Thoughts.

mr_roboto · Feb 16, 2024

I dunno, I think Hector missed the mark on this prediction. SVE and AMX don't do quite the same things.

dada_dave · Feb 16, 2024

I have to admit I'm a little confused ...

1) It is entirely reasonable that Marcan and the Asahi devs would not include support for undocumented instructions like AMX in the Linux Kernel. They will undoubtedly support other undocumented accelerators (like the media engine) but as with the GPU that will be through drivers, not direct instruction support.

2) I would've thought that the very well documented and very ARM NEON instructions would be the most natural way to emulate AVX, even AVX-512.

Jimmyjames said:
The key part for me being the “Apple chips will almost certainly drop support for AMX in favour of SVE..”.

This would be surprising to me. I can’t imagine they didn’t know that SVE would come eventually. Would they really leave AMX for SVE? Thoughts.

3) SVE is the replacement for NEON not AMX. In fact in ARM v9 it is SVE2. The corollary in ARMv9 to AMX is SME.

By "I can’t imagine they didn’t know that SVE would come eventually." I assume the "they" is Apple? If so, then yes, and probably for years before it was announced. I believe AMX has been reverse engineered but I do not know how close SME and AMX are. If they are similar, that wouldn't surprise me given the close relationship between ARM and Apple. But even if they aren't, Hector is right that AMX instructions could relatively easily be replaced by SME if Apple wanted to since the AMX instructions are undocumented and accessible only through Apple's Accelerate framework. Thus, the only thing that would need to be recoded is Accelerate. That is in fact the whole reason why if an ISA vendor creates its own customs extensions that ARM demands they remain undocumented.

That said we don't know when Apple will adopt ARM v9. They may choose not to do so (at least not soon) as the main upgrades relevant for them would be the Matrix and Vector extensions and they may decide that NEON and AMX serve them well enough. The other major upgrade in ARM v9 is confidential computing and while Apple is quite concerned with security on its chips, my lay person's impression is that confidential computing is mostly beneficial to server chips. This would explain why Apple has not been in a rush to adopt ARM v9 like they were with ARM v8. Still, maybe it'll come with the next generation of chips.

mr_roboto said:
I dunno, I think Hector missed the mark on this prediction. SVE and AMX don't do quite the same things.

I agree that Hector has seemingly confused the two possibly due to the context of the rest of the conservation. However, AMX may disappear if Apple adopts ARM v9 but it'll be replaced by SME, not SVE2.

Jimmyjames · Feb 16, 2024

mr_roboto said:
I dunno, I think Hector missed the mark on this prediction. SVE and AMX don't do quite the same things.

Yes. I’m certainly no expert but that was my understanding.

Jimmyjames · Feb 16, 2024

dada_dave said:
I have to admit I'm a little confused ...

1) It is entirely reasonable that Marcan and the Asahi devs would not include support for undocumented instructions like AMX in the Linux Kernel. They will undoubtedly support other undocumented accelerators (like the media engine) but as with the GPU that will be through drivers, not direct instruction support.

Yes. I certainly didn’t mean to imply that they should support AMX or indeed any particular proprietary ASi ‘bit’.

dada_dave said:
2) I would've thought that the very well documented and very ARM NEON instructions would be the most natural way to emulate AVX, even AVX-512.

Good point.

dada_dave said:
3) SVE is the replacement for NEON not AMX. In fact in ARM v9 it is SVE2. The corollary in ARMv9 to AMX is SME.

By "I can’t imagine they didn’t know that SVE would come eventually." I assume the "they" is Apple? If so, then yes, and probably for years before it was announced. I believe AMX has been reverse engineered but I do not know how close SME and AMX are. If they are similar, that wouldn't surprise me given the close relationship between ARM and Apple. But even if they aren't, Hector is right that AMX instructions could relatively easily be replaced by SME if Apple wanted to since the AMX instructions are undocumented and accessible only through Apple's Accelerate framework. Thus, the only thing that would need to be recoded is Accelerate.

Yes, I meant Apple would know SVE would come eventually.

dada_dave · Feb 16, 2024

Jimmyjames said:
Yes. I certainly didn’t mean to imply that they should support AMX or indeed any particular proprietary ASi ‘bit’.

Aye that was mostly to the Redditors, like why would they jump to AMX when NEON is right there? Because of AVX 512? I mean it's true that Apple's NEON extensions are 128bit, but properly encoded they are quite effective and if I remember right there are 4 of them in a performance core.

Jimmyjames said:
Good point.

Yes, I meant Apple would know SVE would come eventually.

Yup. The advantage of SVE and SVE2 is mostly in how flexible they are. They may also be faster/more efficient for what I know, but it's their flexibility why ARM adopted them. If one CPU core had a 128bit vector processor and another had a 256bit vector processor then theoretically code would not have to be rewritten to be run on both. In practice you may still need tweaks to optimize your performance - and if you are using vector extensions explicitly in your code as opposed to letting the compiler handle it you probably want that level of optimization. It might also benefit compiler writers who want to auto-vectorize code when possible.

But that’s all largely irrelevant to the point that I’m not sure why I wrote it all as again it’s SME that would replace AMX if Apple were to adopt ARM v9 (unless SME is a superset of AMX and even if it isn’t again only accessible through a framework).

So Hector may still be right on that point just a different extension.

Whether Apple adopts ARM v9 will largely depend on how much they value SVE2 vs NEON and SME vs AMX. (And for the record ARM says that any hardware that supports SVE2 will support NEON so in neither case is backwards compatibility an issue … not that the lack of such has always bothered Apple in the past!

)

I also wonder if there is a Linux equivalent to the Accelerate framework where it might make sense to write a driver for the AMX? I dunno that still might be too close to writing undocumented extensions into the kernel. I’m not sure quite where the distinction lies between supporting the media engine which I think he wants to do and the matrix engine. Possibly how close to the kernel you have to write the code to support the extensions?

leman · Feb 17, 2024

Jimmyjames said:
This would be surprising to me. I can’t imagine they didn’t know that SVE would come eventually. Would they really leave AMX for SVE? Thoughts.

I certainly hope so. That, or they stabilize AMX in some form or fashion and publish the instruction set, so that developers can leverage it in their own code. AMX units are great, but the closed nature of the ISA severely limits their utility.

dada_dave said:
3) SVE is the replacement for NEON not AMX. In fact in ARM v9 it is SVE2. The corollary in ARMv9 to AMX is SME.

SME is an extension of SVE, so I suppose this is what Hector was referring to. With SME, SVE has two operating modes: "regular" (which is a latency-optimized extension to NEON, just as you say), and a "streaming" mode, which models a throughput-optimized coprocessor with a wider vector length and different feature set.

For a long time I've had a suspicion that SVE streaming mode/SME is modeled after Apple AMX. The basic streaming mode profile matches AMX functionality more or less precisely. And the recent updates to SME add some functionality exclusively present in AMX (like lookup table instructions). They are still not an exact match, but the overlap is quite substantial. I hope this is evidence that Apple is in fact working on SVE/SME support (even if much delayed), and that they will release it at some point. At any rate, as I've said, they should at least stabilize some form of coprocessor interface so that we can take advantage of it in our code. Apple is large enough to roll their own instructions if they want to.

leman · Feb 17, 2024

Jimmyjames said:
someone wondered if games that use AVX instructions will be able to run. Someone suggested the devs should somehow map AVX instructions to AMX.

BTW, this doesn't make any sense. There is no intersection between AVX and AMX, and trying to emulate AVX with AMX would carry an insane performance penalty. 256-bit AVX instructions can be easily implemented with two NEON passes, no need to make it too complicated.

mr_roboto · Feb 17, 2024

OK, I've come around - if SME was what Hector was talking about, it seems like a reasonable prediction that Apple could adopt SME to replace AMX. I had heard vague things about Arm's plans to formalize a matrix math extension but not that it was going to be named similarly to and treated as an extension of SVE.

leman · Feb 17, 2024

mr_roboto said:
OK, I've come around - if SME was what Hector was talking about, it seems like a reasonable prediction that Apple could adopt SME to replace AMX. I had heard vague things about Arm's plans to formalize a matrix math extension but not that it was going to be named similarly to and treated as an extension of SVE.

SME was released a few years ago, we are now at revision 2.1. I don't think there is any actual hardware that implements these instructions though. BTW, Dougall Johnson (who has also reverse-engineered the Apple GPU USA) has created an amazing resource for these instructions: https://dougallj.github.io/asil/

Cmaier · Feb 17, 2024

I really hate this alphabet soup. Always have.

dada_dave · Feb 17, 2024

leman said:
I certainly hope so. That, or they stabilize AMX in some form or fashion and publish the instruction set, so that developers can leverage it in their own code. AMX units are great, but the closed nature of the ISA severely limits their utility.

I don’t think they can unless they adopt SME. I believe that under agreement with ARM, vendor created extensions have to be hidden so as not to interfere with ARM’s development or fracture the ISA for software/compiler development.

leman said:
SME is an extension of SVE, so I suppose this is what Hector was referring to.

Interesting, did not know that!

leman said:
With SME, SVE has two operating modes: "regular" (which is a latency-optimized extension to NEON, just as you say), and a "streaming" mode, which models a throughput-optimized coprocessor with a wider vector length and different feature set.

For a long time I've had a suspicion that SVE streaming mode/SME is modeled after Apple AMX. The basic streaming mode profile matches AMX functionality more or less precisely. And the recent updates to SME add some functionality exclusively present in AMX (like lookup table instructions). They are still not an exact match, but the overlap is quite substantial. I hope this is evidence that Apple is in fact working on SVE/SME support (even if much delayed), and that they will release it at some point.

Very good, does sound like it then. Maybe next generation …

leman said:
At any rate, as I've said, they should at least stabilize some form of coprocessor interface so that we can take advantage of it in our code. Apple is large enough to roll their own instructions if they want to.

Indeed although again I believe under agreement with ARM they can’t unless it’s ARM’s. Makes sense: ARM doesn’t want the ISA itself to split the community.

leman said:
BTW, this doesn't make any sense. There is no intersection between AVX and AMX, and trying to emulate AVX with AMX would carry an insane performance penalty. 256-bit AVX instructions can be easily implemented with two NEON passes, no need to make it too complicated.

Yeah that was weird.

Nycturne · Feb 17, 2024

Cmaier said:
I really hate this alphabet soup. Always have.

Shortly out of college, I ran into someone who made the comment “our company has way too many TLAs” and it stuck with me. At the time, engineers loved their initialisms for whatever project they were working on to avoid calling it by the full name which was a mouthful, while marketing was producing some pretty terrible overwrought product names that you couldn’t help but mock precisely because they were a mouthful.

Today, at the same company, marketing is producing less overwrought names. Yet, I’m still having to ask about various initialisms I run across that I never heard of before with every project. It seems like one behavior engineers like to indulge in. Gotta have their buzzwords that make people ask, “What is that?” I guess…

Yoused · Feb 17, 2024

These S- architectures can, theoretically, take up a great deal of internal space. I wonder if Apple is planning on migrating the "dynamic cache" design over to the CPU cores.

Cmaier · Feb 17, 2024

Yoused said:
These S- architectures can, theoretically, take up a great deal of internal space. I wonder if Apple is planning on migrating the "dynamic cache" design over to the CPU cores.

Aren’t all cpu caches dynamic caches? Or are you referring to dynamically allocating a pool of compute resources between cores?

Yoused · Feb 17, 2024

Cmaier said:
Aren’t all cpu caches dynamic caches?

There was talk of an improved GPU architecture in M3, that was called "dynamic cache", or something like that, wherein, AIUI, part of the L1 was fenced off for register file use. And, if you are handling enormous SVE vectors (the max size for the SVE spec is half a Neon register file), it might make sense to use the cache: when you store a register, it just gets switched from internally reserved to a dirty cache line awaiting L2/3 sync.

Will AMX disappear?

Jimmyjames

Site Champ

mr_roboto

Site Champ

dada_dave

Elite Member

Jimmyjames

Site Champ

Jimmyjames

Site Champ

dada_dave

Elite Member

leman

Site Champ

leman

Site Champ

mr_roboto

Site Champ

leman

Site Champ

Cmaier

Site Master

dada_dave

Elite Member

Nycturne

Elite Member

Yoused

up

Cmaier

Site Master

Yoused

up

Similar threads