Where do we think Apple Silicon goes from here?

Joelist · Dec 10, 2021

Hi!

Apple silicon is upon us and is ridiculously performant and efficient. We know the micro-architectural reasons (which is THE reason - much more than the ISA), so the questions is where do you think they go from here?

1) More decoders / ALUs? With 8 lanes at present the AS pipe is already extremely wide. And the real question is how wide do they actually want to be? Above a certain point the returns start to diminish because there simply isn't enough potential activity to keep all the ALUs and decoders busy.
2) More cores? I would think again that the reasoning for the decoders / ALUs applies here too. Something like 128 P Cores sounds way cool but really on a desktop / laptop does it make any sense outside of bragging rights?
3) More specialist processing blocks? This is an area I think we will see expansion in. The M1 Pro and Max effectively now have a built in Afterburner card with their specialist custom encoders and decoders. I expect Apple is already looking at all of the many jobs that a desktop/laptop can be asked to do to see which ones can be bottlenecks that can be offloaded to specially designed blocks.
4) Will Apple possibly start overclocking the memory? I suspect given the crazy fast performance of everything they already are overclocking the RAM but who knows?

What do you think?

Cmaier · Dec 10, 2021

I suspect that long term they may add ALUs and compensate for inability to fill them with hyperthreading. Maybe in a scheme with three types of cores.

Yoused · Dec 10, 2021

Joelist said:
More specialist processing blocks? This is an area I think we will see expansion in.

I agree with this. They will add features that make macOS/iOS run faster but might not be useful for other implementations, along with more generic helper logic stuff. In fact, I would not be surprised to see an increase in FPGA real estate for creating acceleration logic on the fly. And, of course, improvements to the GPU cores to make them more competitive with eGPU designs and expansion and improvement of the ML section.

Right now, it looks like the basic core logic may be plateauing for M series and perhaps for x86 as well. How much harder will they need to work if you can offload the heavy stuff and then gate those units off when they are not needed?

Cmaier · Dec 10, 2021

Yoused said:
I agree with this. They will add features that make macOS/iOS run faster but might not be useful for other implementations, along with more generic helper logic stuff. In fact, I would not be surprised to see an increase in FPGA real estate for creating acceleration logic on the fly. And, of course, improvements to the GPU cores to make them more competitive with eGPU designs and expansion and improvement of the ML section.

Right now, it looks like the basic core logic may be plateauing for M series and perhaps for x86 as well. How much harder will they need to work if you can offload the heavy stuff and then gate those units off when they are not needed?

I think they will push real hard on AI (both ends), and, of course, graphics has a ways to go still.

Citysnaps · Dec 10, 2021

Yoused said:
In fact, I would not be surprised to see an increase in FPGA real estate for creating acceleration logic on the fly.

My knowledge of M1 is not that deep and thus need some help in understanding this. Does that mean a portion of M1 functionality is currently implemented in general purpose FPGA blocks that get programmed after fabrication? Thanks...

Cmaier · Dec 10, 2021

citypix said:
My knowledge of M1 is not that deep and thus need some help in understanding this. Does that mean a portion of M1 functionality is currently implemented in general purpose FPGA blocks that get programmed after fabrication? Thanks...

I haven’t read about an FPGA block on M1, and there doesn’t appear to be one apparent from the micro photographs. I could be missing something, of course.

Yoused · Dec 10, 2021

Cmaier said:
I haven’t read about an FPGA block on M1, and there doesn’t appear to be one apparent from the micro photographs. I could be missing something, of course.

You are the expert. I could have sworn I read that there was one. Possibly someone just making noise, possibly something I read about a Broadcom or Qualcomm SoC.

Citysnaps · Dec 10, 2021

Thanx... The reason I ask is we had a couple of wireless telecom customers who asked about having a block of FGPA added to the signal processing ASICs we developed. And a way of programmatically inserting that in between various signal processing blocks that were part of the ASIC. This was so they could insert secret sauce functions that their competitors (or others via our datahseet) would not be aware of, potentially giving them an advantage. We didn't have any expertise in FPGA. And if we did, I suspect Altera and Xilinx would have closely looked at that potentially stepping on their IP.

Cmaier · Dec 10, 2021

citypix said:
Thanx... The reason I ask is we had a couple of wireless telecom customers who asked about having a block of FGPA added to the signal processing ASICs we developed. And a way of programmatically inserting that in between various signal processing blocks that were part of the ASIC. This was so they could insert secret sauce functions that their competitors (or others via our datahseet) would not be aware of, potentially giving them an advantage. We didn't have any expertise in FPGA. And if we did, I suspect Altera and Xilinx would have closely looked at that potentially stepping on their IP.

Only if they found out

Joelist · Dec 10, 2021

Cmaier said:
I think they will push real hard on AI (both ends), and, of course, graphics has a ways to go still.

M1 Max appears to perform close to or the same as the GTX 3080, and at a rather lower power consumption. In PPW terms M1 blows the 3080 out of the water.

Cmaier · Dec 10, 2021

Joelist said:
M1 Max appears to perform close to or the same as the GTX 3080, and at a rather lower power consumption. In PPW terms M1 blows the 3080 out of the water.

Yep, but I think they want to increase the per-GPU-core performance, and add hardware ray tracing.

mr_roboto · Dec 10, 2021

citypix said:
Thanx... The reason I ask is we had a couple of wireless telecom customers who asked about having a block of FGPA added to the signal processing ASICs we developed. And a way of programmatically inserting that in between various signal processing blocks that were part of the ASIC. This was so they could insert secret sauce functions that their competitors (or others via our datahseet) would not be aware of, potentially giving them an advantage. We didn't have any expertise in FPGA. And if we did, I suspect Altera and Xilinx would have closely looked at that potentially stepping on their IP.

Perhaps it wasn't available when you were working on those ASICs, but Achronix offers FPGA IP cores for integration into SoCs:

Speedcore Embedded FPGA IP | Achronix Semiconductor Corporation

Speedcore embedded FPGA (eFPGA) IP has brought the performance and flexibility of programmable logic to ASICs and SoCs. Customers can integrate a Speedcore eFPGA IP into an ASIC or SoC for high-performance, compute-intensive and real-time processing applications such as artificial intelligence...

www.achronix.com

It's a company with an interesting history - their first products were discrete FPGAs built in Intel 22nm (iirc), but when Intel bought Altera it probably spelled doom for the Achronix-Intel relationship. After that they turned to this IP core idea, and AFAIK they've had some design wins with it. More recently they've made a return to selling discrete FPGAs, this time fabricating them in TSMC 7nm.

It's difficult for a FPGA startup to survive against the Xilinx-Altera/Intel duopoly, but somehow they've managed for quite some time, so either their investors have deep pockets or they've had some success living in the niches not well covered by the duopoly. At work we've given some thought to their Speedster 7t FPGAs as they have what I feel is a better internal architecture for ML inference acceleration than the weird kinda tacked-on approach Xilinx took, but I don't actually have either one in hand to evaluate just yet.

chengengaun · Dec 10, 2021

Joelist said:
3) More specialist processing blocks? This is an area I think we will see expansion in. The M1 Pro and Max effectively now have a built in Afterburner card with their specialist custom encoders and decoders. I expect Apple is already looking at all of the many jobs that a desktop/laptop can be asked to do to see which ones can be bottlenecks that can be offloaded to specially designed blocks.

My side question is, what is the implication of more extensive use of such specialist processing blocks? I guess they cannot be easily updated (like software) and so might have some impact on longevity in terms of software support. I guess it also makes raw benchmark numbers less relevant unless the benchmarks are more application-specific.

Yoused · Dec 10, 2021

chengengaun said:
My side question is, what is the implication of more extensive use of such specialist processing blocks?

Apple tends to be fairly thorough in researching real-world use. They will implement dedicated blocks judiciously, where they will provide the best gains.

Personally, I would like to see them spin off an almost-as-good-as-M-but-better-that-everyone-else CPU vendor, to increase interest in ARM-based systems.

Cmaier · Dec 10, 2021

chengengaun said:
My side question is, what is the implication of more extensive use of such specialist processing blocks? I guess they cannot be easily updated (like software) and so might have some impact on longevity in terms of software support. I guess it also makes raw benchmark numbers less relevant unless the benchmarks are more application-specific.

Such blocks perform useful functions, but not entire algorithms. They provide useful atomic functions that can be used by software. Usually, at least. So software improvements are possible.

Citysnaps · Dec 10, 2021

mr_roboto said:
Perhaps it wasn't available when you were working on those ASICs, but Achronix offers FPGA IP cores for integration into SoCs:

Speedcore Embedded FPGA IP | Achronix Semiconductor Corporation

Speedcore embedded FPGA (eFPGA) IP has brought the performance and flexibility of programmable logic to ASICs and SoCs. Customers can integrate a Speedcore eFPGA IP into an ASIC or SoC for high-performance, compute-intensive and real-time processing applications such as artificial intelligence...

www.achronix.com

It's a company with an interesting history - their first products were discrete FPGAs built in Intel 22nm (iirc), but when Intel bought Altera it probably spelled doom for the Achronix-Intel relationship. After that they turned to this IP core idea, and AFAIK they've had some design wins with it. More recently they've made a return to selling discrete FPGAs, this time fabricating them in TSMC 7nm.

It's difficult for a FPGA startup to survive against the Xilinx-Altera/Intel duopoly, but somehow they've managed for quite some time, so either their investors have deep pockets or they've had some success living in the niches not well covered by the duopoly. At work we've given some thought to their Speedster 7t FPGAs as they have what I feel is a better internal architecture for ML inference acceleration than the weird kinda tacked-on approach Xilinx took, but I don't actually have either one in hand to evaluate just yet.

That's interesting - thanks for the heads-up!

Here's a bit of a ramble, for historical context...

The timeframe of our full-custom ASIC business was in the early 1990s to mid 2000s. Which dovetailed nicely with cellular telecom infrastructure (basestations) providers becoming aware of the benefits of pure digital radio architectures over conventional superhet analog radios and transmitters, especially when multi antenna beamforming is employed where phase information is important. Before that, digital radio architectures were used mostly for defense-related systems (a field I previously worked in). One system I remember seeing from a competitor took a 6' rack of equipment, which imo, was a huge kluge (I can expand on that, if interested). That was pretty much reduced to one of our chips and a high speed A/D converter sampling a wideband IF.

When cellular infrastructure companies became aware of digital radio (sometimes called Software Defined Radio), FPGA manufacturers took notice. But...at that point in time implementing a digital radio in an FPGA fell far short in performance in terms of sample rate, generating complex (Sin/Cos) digital oscillators, digital mixers, digital filters, etc, even in their fastest FPGAs - which were very costly and sucked a lot of power. A lot of that had to do with implementation, not being expert in digital signal processing for communications systems. I think many thought simply reading Oppenheim and Shaefer's book on digital signal processing for communications systems was all that was necessary. And weren't aware of architectural tricks and various optimizations.

One FPGA company wanted to "collaborate" with us. IIRC, I think it was (ostensibly) to use our tech and expertise creating a processing core for use in their FPGAs, which would help them market to cellular infrastructure providers. In reality, I think it was to learn about our architecture tricks and communications/signal processing knowledge - and then go off on their own. We parted ways after a couple of meetings. Eventually, due to FPGAs becoming faster and using less power (basestation operating costs were always a huge concern), there was a point where FPGAs became feasible for that market - around 2010 or so (I left in 2004/5). I assume that's what's used today.

mr_roboto · Dec 10, 2021

citypix said:
One FPGA company wanted to "collaborate" with us. IIRC, I think it was (ostensibly) to use our tech and expertise creating a processing core for use in their FPGAs, which would help them market to cellular infrastructure providers. In reality, I think it was to learn about our architecture tricks and communications/signal processing knowledge - and then go off on their own. We parted ways after a couple of meetings. Eventually, due to FPGAs becoming faster and using less power (basestation operating costs were always a huge concern), there was a point where FPGAs became feasible for that market - around 2010 or so (I left in 2004/5). I assume that's what's used today.

I've never been in the software defined radio world, but it's certainly a major focus for Xilinx. You see it pop up in their marketing materials all the time.

Xilinx relies a great deal on DSP48, a 27x18 multiplier with a 48-bit accumulator and some other tricks. It's a macrocell connected to the signal routing matrix just like the lookup tables and flops used for general purpose logic. They're good for 500 MHz if you use all the pipeline stages (you don't have to, you can choose to bypass the flops for reduced latency at a slower clock speed if you like). Some of the bigger FPGAs have several thousand of them, so theoretical multiply-accumulate throughput is in the range of teraops/sec.

It looks like 2004 was about when Xilinx began shipping Virtex-4, their first FPGA family with DSP48, so that would've been the beginning of the end for dedicated ASICs. Prior to DSP48 it would've been hopeless to do much DSP in general purpose FPGA fabric - even a single multiplier would chew up a lot of fabric resources.

tomO2013 · Dec 10, 2021

Great question

I feel that we can make reasonable deductions as to apples future approach based on their past strategy.

We have seen Apple favour a widening of the pipeline architecture over outright ratcheting clock speeds, so I’d agree with other commentators in that they will likely add to the ALU and widen it further.
The neural engine and GPU are very likely to get both a greater core count and hardware accelerated ‘level 5’ Ray Tracing support.
My money is on the inclusion of PowerVR’s photon IP (https://www.imaginationtech.com/whitepapers/the-powervr-photon-architecture/).

Apple signed an agreement with imagination Tech (https://www.imaginationtech.com/news/press-release/imagination-and-apple-sign-new-agreement/) in 2020. Possibly we’ll see fruits of this agreement in an M3 derived mac?

Actually…. scratch that last comment - @Cmaier from your AMD days…. typically after an agreement such as this is signed, what is the typical lead time (from modern tooling perspectives) to realize the value of that licensing deal through an implementation of that IP in shipping silicon?

One has to wonder that it’s probably one thing to get an IP deal in place, but it’s another another thing for Apples (experienced GPU) team to implement and integrate that IP with Apples Silicon package? Thinking out loud as a lay person, there must be time consumed taping out , testing, rinse repeat etc… what are your estimates before we could see the fruits of this licensing agreement on the GPU side if we are not already seeing some today?

(P.S. on a totally side note, it’s really really nice to have this place to discuss technology, Apple Silicon, ARM instead of the dumpster fire that is taking place over at another place).

Cmaier · Dec 10, 2021

tomO2013 said:
Great question

I feel that we can make reasonable deductions as to apples future approach based on their past strategy.

We have seen Apple favour a widening of the pipeline architecture over outright ratcheting clock speeds, so I’d agree with other commentators in that they will likely add to the ALU and widen it further.
The neural engine and GPU are very likely to get both a greater core count and hardware accelerated ‘level 5’ Ray Tracing support.
My money is on the inclusion of PowerVR’s photon IP (https://www.imaginationtech.com/whitepapers/the-powervr-photon-architecture/).

Apple signed an agreement with imagination Tech (https://www.imaginationtech.com/news/press-release/imagination-and-apple-sign-new-agreement/) in 2020. Possibly we’ll see fruits of this agreement in an M3 derived mac?

Actually…. scratch that last comment - @Cmaier from your AMD days…. typically after an agreement such as this is signed, what is the typical lead time (from modern tooling perspectives) to realize the value of that licensing deal through an implementation of that IP in shipping silicon?

One has to wonder that it’s probably one thing to get an IP deal in place, but it’s another another thing for Apples (experienced GPU) team to implement and integrate that IP with Apples Silicon package? Thinking out loud as a lay person, there must be time consumed taping out , testing, rinse repeat etc… what are your estimates before we could see the fruits of this licensing agreement on the GPU side if we are not already seeing some today?

(P.S. on a totally side note, it’s really really nice to have this place to discuss technology, Apple Silicon, ARM instead of the dumpster fire that is taking place over at another place).

We never licensed anything, so I don’t know

. And Apple may have just signed the agreement in order to avoid a patent infringement suit. We have no idea.

That said, if someone handed me a ”design” for something like a graphics core - perhaps even just a spec for it - it would probably be around 18 months from start to finish. Less if they provided more details (like a netlist).

Andropov · Dec 11, 2021

Joelist said:
1) More decoders / ALUs? With 8 lanes at present the AS pipe is already extremely wide. And the real question is how wide do they actually want to be? Above a certain point the returns start to diminish because there simply isn't enough potential activity to keep all the ALUs and decoders busy.

I wonder if the number of registers (which I believe is one of the few things Apple can't change) would become a limitation if trying to go wider than 8 lanes? I guess they could add SMT to use the extra ALUs anyway but then it wouldn't do much to improve single core performance.

Joelist said:
2) More cores? I would think again that the reasoning for the decoders / ALUs applies here too. Something like 128 P Cores sounds way cool but really on a desktop / laptop does it make any sense outside of bragging rights?

There are a non-negligible number of tasks that are almost infinitely parallelizable, so yes. Anything stochastic in nature, for example, can likely benefit of being run in N cores in parallel and having all that extra statistic. Whether that tasks are common enough for the target user for Apple to up the core count beyond certain points is a different matter, though. Probably not on the consumer chips.

Cmaier said:
Yep, but I think they want to increase the per-GPU-core performance, and add hardware ray tracing.

Hardware ray tracing is definitely on the horizon. I'm surprised the A15 doesn't have it. Fingers crossed for the A16.

Where do we think Apple Silicon goes from here?

Power User

Site Master

up

Site Master

Elite Member

Site Master

up

Elite Member

Site Master

Power User

Site Master

Site Champ

Slightly Confused

up

Site Master

Elite Member

Site Champ

Power User

Site Master

Site Champ

Similar threads