M4 Rumors (requests for).

Jimmyjames

Site Champ
Posts
675
Reaction score
763
This time last year, we had already heard things about the M3. Much quieter this year afaik. Recently saw the rumor of a much improved Neural Engine. I wonder if this means it’ll be a small improvement, or perhaps the M4 will arrive next year.

Anyone heard anything?
 

dada_dave

Elite Member
Posts
2,164
Reaction score
2,150
Ha! With your title I was hoping YOU had something 🙃. Apart from the improved AI capabilities you already mentioned, no I haven’t heard anything. Presumably there will be a new iPhone SOC as there is every year, but still unclear exactly what the Mac SOC cadence will be - might also depend on new fab availability.
 

Jimmyjames

Site Champ
Posts
675
Reaction score
763
Ha! With your title I was hoping YOU had something 🙃. Apart from the improved AI capabilities you already mentioned, no I haven’t heard anything. Presumably there will be a new iPhone SOC as there is every year, but still unclear exactly what the Mac SOC cadence will be - might also depend on new fab availability.
I wish I had something! Suffering withdrawal from a lack of silicon information. I’ll take anything, even wild speculation.
 

Jimmyjames

Site Champ
Posts
675
Reaction score
763
Since there doesn’t seem to be much in the way of rumours, I’m going to list my wishes for A18/M4. They are:

1) CPU. A larger IPC increase than A16->A17. A17/M3 was a nice increase in performance, but iirc that was mostly a result of N5P to N3B. It would be great if this gen concentrates on IPC increases. If that is possible

2) CPU. An ability to separate desktops from laptops. Does the Studio really need E cores? It would be preferable to have an all P-core desktop chip. Perhaps also the ability to scale frequency higher.

3) In terms of gpu improvements. the M3 has seemingly laid a foundation for future improvements. Hopefully they can now step on the gas. I am intrigued by the possibility of allowing the ALUs to dual issue FP32, dual FP16 or dual INT, rather than the any-two-of-the-three, that the M3 currently employs. Even more enticing would be any three simultaneously!

4) New media engines. The current ones have served the M series well, but they are three years old now and Nvidia has caught up in speed largely while surpassing them in quality.

Thoughts?
 
Last edited:

exoticspice1

Site Champ
Posts
298
Reaction score
101
1) CPU. A larger IPC increase than A16->A17. A17/M3 was a nice increase in performance, but iirc that was mostly a result of N5P to N3B. It would be great if this gen concentrates on IPC increases. If that is possible
Over at Anandtech, Gerard Williams said he worked on Apple Silicon from A7 to A14 and M1.
There have been slow downs from A15. A16 was barely an improvement, performance increased due to node and higher clock. Its the same with A17.

https://forums.anandtech.com/threads/apple-silicon-soc-thread.2587205/post-41161790

3) In terms of gpu improvements. the M3 has seemingly laid a foundation for future improvements. Hopefully they can now step on the gas. I am intrigued by the possibility of allowing the ALUs to dual issue FP32, dual FP16 or dual INT, rather than the any-two-of-the-three, that the M3 currently employs. Even more enticing would be any three simultaneously!
I am more hopeful here. Apple's GPU team can make improvements here considering the recent hires.
4) New media engines. The current ones have served the M series well, but they are three years old now and Nvidia has caught up in speed largely while surpassing them in quality.
Apple needs to AV1 Encode to the media engine in their Mac chips. If Apple cares about streaming its good to add it.
 

Yoused

up
Posts
5,624
Reaction score
8,943
Location
knee deep in the road apples of the 4 horsemen
Does the Studio really need E cores? It would be preferable to have an all P-core desktop chip.
Yes, really, it does need them. If even only for housekeeping, E cores take a load off of the P cores. If you have a major job running that is going to take a while on the P cores, the E cores can handle whatever you are doing in the mean time without pushing so much heat onto the chip. And really, they have been getting much better with each gen. Apple probably has a new trick up their sleeve for M4 that no one is expecting.

I suspect M4 will be on N3P (skipping over N3E), so it will be a lot like M1->M2 type advance. M5 is the one to watch out for. That will probably be on N2, and it will be able to control an entire starship.
 

dada_dave

Elite Member
Posts
2,164
Reaction score
2,150
Since there doesn’t seem to be much in the way of rumours, I’m going to list my wishes for A18/M4. They are:

1) CPU. A larger IPC increase than A16->A17. A17/M3 was a nice increase in performance, but iirc that was mostly a result of N5P to N3B. It would be great if this gen concentrates on IPC increases. If that is possible

I remember that you posted the following analysis that the IPC increases might be larger than was immediately apparent:


I have to admit though I have reservations about it which I expressed at the time - some of the methods, results, and conclusions were odd to me - chiefly if memory serves that the power/clock speeds reported in the tests were very low. I know that the main thrust of his argument was that the peak frequencies are hardly ever reached thus the cores were actually operating with higher IPC than you might think, but these seemed off still.

Over at Anandtech, Gerard Williams said he worked on Apple Silicon from A7 to A14 and M1.
There have been slow downs from A15. A16 was barely an improvement, performance increased due to node and higher clock. Its the same with A17.

https://forums.anandtech.com/threads/apple-silicon-soc-thread.2587205/post-41161790

I am extremely skeptical about the notion that Apple hit a wall because one man left the design team. It’s awfully reductive and facile. For one thing, E-cores have continued to improve generation-on-generation and maintaining low power with those performance improvements is no easy engineering task. So clearly someone designing CPU cores is still showing up for work at Apple. Secondly the M3 is the first actual new design and yeah so far, apart from the aforementioned analysis, its P-core IPC improvements don’t appear great. It’s important to remember that either the M3 was supposed to be the M2 or the M2 was just going to be a stop gap until the N3 node was substantially delayed and the M2 became a full generation. Finally the Nuvia core which GW3 worked on itself appears to be a reworked M1 with very similar characteristics. It’s not like his team came out of the gate and blew Apple away. If the loss of GW3 was really so impactful, then honestly we won’t see it for another few years if Apple continues to struggle to advance their P-and Nuvia/Qualcomm overtake them. So far other explanations seem far more likely.

Partly what I think is going on is that Apple is stretched more thinly than they were in the past. True, they're catching up as they were able to release 3 SOCs at once this time around, but Apple doesn't necessarily have the resources redesign and improve every piece of its SOC lineup every generation (especially these days an iPhone A-series generation updated every year), add in node delays with designs tied to those nodes, and some things are going to get updated less often.

With the regards to the CPU in particular, the E-cores have improved, but the P-cores have, seemingly, not in IPC. The issue with the P-cores in particular may be that the old tricks Apple used to get its spectacular rise in performance every generation which was seemingly every year simply stopped working all that well or delivered diminishing returns. To be reductive myself, they simply kept going wider and past 8-wide decode that may not provide as much benefit. Supposedly the new A17/M3 P-cores are 9-wide and I believe ARM has a X core at 10-wide? But we know that's not the end all be all of even a wide design, if it were and my memory about ARM being even wider is correct then they would have the best core and they don't. My main point is though that supposedly in code the average amount of parallel instructions per branch is about 8. Now depending on if that's median or mean and the characteristics of that distribution in (benchmarking) code I would think that continuing to go wide might still provide benefits, but one can also see how one might hit a brick wall going wider and wider if there is simply less ILP to squeeze out in most code.

Finally it could also just be a blip. Apart from the analysis above that claims we're all just measuring things wrong, this could be simply an underwhelming generation in otherwise promising direction that will get optimized and improved such that years from now we'll look back at forum discussions like this and shake our heads with a knowing smile. At the risk of repeating myself too many times: since the M1 we've only had two main generations (plus a third iPhone generation yes, but see note above for that), one of which was an optimized M1 likely because of the node delays. Given the pace of development of processors, this isn't something to worry about ... yet. Certainly not something to draw iron-clad conclusions from. Those things are best done in hindsight not at the time if at all.

3) In terms of gpu improvements. the M3 has seemingly laid a foundation for future improvements. Hopefully they can now step on the gas. I am intrigued by the possibility of allowing the ALUs to dual issue FP32, dual FP16 or dual INT, rather than the any-two-of-the-three, that the M3 currently employs. Even more enticing would be any three simultaneously!

4) New media engines. The current ones have served the M series well, but they are three years old now and Nvidia has caught up in speed largely while surpassing them in quality.

Thoughts?
I am more hopeful here. Apple's GPU team can make improvements here considering the recent hires.

Apple needs to AV1 Encode to the media engine in their Mac chips. If Apple cares about streaming its good to add it.

Most of my wished for GPU improvements are covered in @Jimmyjames 's earlier thread. Since reviewing @leman 's post on the new Apple GPU L1 caches, I also think another area of improvement would be to increase their performance as well. However, that is no doubt an incredible challenge as that is generally the tradeoff with larger caches which often entails a slower one. It just takes longer to find and output the right piece of information. The characteristics of the new L1 are really quite interesting, and very different than the usual GPU shared memory characteristics, including Apple's own previous ones.

2) CPU. An ability to separate desktops from laptops. Does the Studio really need E cores? It would be preferable to have an all P-core desktop chip. Perhaps also the ability to scale frequency higher.
Yes, really, it does need them. If even only for housekeeping, E cores take a load off of the P cores. If you have a major job running that is going to take a while on the P cores, the E cores can handle whatever you are doing in the mean time without pushing so much heat onto the chip. And really, they have been getting much better with each gen. Apple probably has a new trick up their sleeve for M4 that no one is expecting.

Agreed about E-cores. While a desktop system may not need a bunch, having them is a good thing. That said I agree with @Jimmyjames that overall a desktop oriented SOC wouldn’t be a bad thing. But I have to caveat my caveat here because, as I mentioned earlier, engineering is a finite resource and so focusing on mobile makes sense both for Apple and the overall market.

I suspect M4 will be on N3P (skipping over N3E), so it will be a lot like M1->M2 type advance. M5 is the one to watch out for. That will probably be on N2, and it will be able to control an entire starship.

Also N2 has GAA. M5 should be a nice uplift.

Agreed about M5/N2. Hopefully it will be on schedule. Backside power delivery on M6 will also be quite a nice boost.
 
Last edited:

leman

Site Champ
Posts
641
Reaction score
1,196
I remember that you posted the following analysis that the IPC increases might be larger than was immediately apparent:


I have to admit though I have reservations about it which I expressed at the time - some of the methods, results, and conclusions were odd to me - chiefly if memory serves that the power/clock speeds reported in the tests were very low. I know that the main thrust of his argument was that the peak frequencies are hardly ever reached thus the cores were actually operating with higher IPC than you might think, but these seemed off still.

Yeah, I agree. The main idea here was that IPC was calculated at peak frequency but it seems like the peak frequency is actually rarely achieved in practice. The main weakness of the argument is that even if we assume lower operating frequency, the IPC wont' increase much. We are still talking about less than 4% IPC change at best.


To be reductive myself, they simply kept going wider and past 8-wide decode that may not provide as much benefit. Supposedly the new A17/M3 P-cores are 9-wide and I believe ARM has a X core at 10-wide?

What's interesting is that the 10-wide ARM design still has lower IPC than even A14/M1. It seems it is really hard to extract more performance at this level.

My main point is though that supposedly in code the average amount of parallel instructions per branch is about 8. Now depending on if that's median or mean and the characteristics of that distribution in (benchmarking) code I would think that continuing to go wide might still provide benefits, but one can also see how one might hit a brick wall going wider and wider if there is simply less ILP to squeeze out in most code.

Average ILP should be higher than 8 instructions, and branches are not a barrier for ILP with speculative execution. I can imagine that there are other difficulties to making a wide design in addition to ILP alone...
 

dada_dave

Elite Member
Posts
2,164
Reaction score
2,150
Average ILP should be higher than 8 instructions, and branches are not a barrier for ILP with speculative execution. I can imagine that there are other difficulties to making a wide design in addition to ILP alone...

Eight is the number I keep running across in quotes about code and ILP. I went to go find an example and naturally now I can’t which is annoying because I just read it somewhere again. I think the reason they bring up branches in conjunction with this 8 ILP is because of the threat of rollback from misprediction. Even though prediction rates are well above 90% now it still has a major effect on performance. Myself I would thought in my non-expert opinion that even with such average code characteristics one could go wider without suffering from diminishing returns yet but as I said I’ve seen this brought up multiple times as not necessarily a barrier but definitely a limiter. If I find such a quote I’ll post it here so we can discuss it. Perhaps I’m misunderstanding something. Which would not be surprising.

Hard agree though that there are undoubtedly more issues to designing wide chips than just this, especially given the ARM core results. So we’ll see what Apple and others are able to put together in future cores.
 
Last edited:

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,330
Reaction score
8,523
Partly what I think is going on is that Apple is stretched more thinly than they were in the past.
I doubt this is the issue. The amount of work it takes to design the chip is a function of the number of transistors, not the performance.

I’m pretty sure what’s going on is that they analyzed the engineering and cost tradeoffs and successfully designed the chip they intended to design. Increasing IPC at this point is tough - not a lot of low-hanging fruit now that they’ve gone so wide. If they go twice as wide as they are now, they may gain only 10 percent IPC but burn 30 percent more power. They could hyper thread and add 10 percent at a cost of 20 percent more power. Etc. Instead they are likely to nibble at the margins, improving bandwidth (which is already quite good), adjusting what is in each execution pipeline to add issue flexibility, etc. Maybe they add a third level of super core at some point and throw one or two on the die for threads that truly need the highest possible performance, power be damned. They get more bang for the buck by beefing up other circuits, like GPU and neural engines, where there is still a lot of low hanging fruit.
 

jbailey

Power User
Posts
170
Reaction score
187
Over at Anandtech, Gerard Williams said he worked on Apple Silicon from A7 to A14 and M1.
There have been slow downs from A15. A16 was barely an improvement, performance increased due to node and higher clock. Its the same with A17.
I don't get this. I have a M2 MacBook Air which gets 2618 on Geekbench 6. My new A17 iPhone 15 Pro gets 2954, a nearly 13% improvement at slightly a higher clock (3485 MHz vs 3767 MHz +8 Δ%). I know that Geekbench is not the end-all of benchmarks but it seems at least generally representative. I'm happy with a 13 Δ% over a single generation.
 

leman

Site Champ
Posts
641
Reaction score
1,196
There have been slow downs from A15. A16 was barely an improvement, performance increased due to node and higher clock. Its the same with A17.

Not really though. It has been demonstrated over and over that every iteration of Apple CPUs improves the performance by ~250-300 GB6 points since at least A10 (or maybe even earlier). These numbers are very unlikely to be a coincidence.

The thing is that people expect the same percentage increase every iteration, which would means that the improvements have to be exponential. This is clearly nonsensical. If one insists that Apple has to increase the performance by 20% every year, from years 9 to 10 they'd need to achieve 5x as big increase in performance as from year 1 to 2. I think folks just get too stuck on the fact that Apple did deliver 20% for some years in the row — that was there the additive and multiplicative parts of the curve were very similar. Some basic arithmetics on GB results over the years can easily verify this.
 

Yoused

up
Posts
5,624
Reaction score
8,943
Location
knee deep in the road apples of the 4 horsemen
Here is part of my GB5 chart

GHzmodelcoresΔsc from prev yrΔmc from prev yr
1.5A8X336.0%99.4%
2.2A9X271.4%13.9%
2.1A9X2-0.8%-1.6%
2.3A10X Fusion629.2%92.5%
2.5A12X Bionic833.9%103.5%
2.5A12Z Bionic80.3%0.2%
3.0A14 Bionic841.9%-10.7%
3.2M187.8%73.3%
3.5M2811.2%22.4%

Not counting Pro/Max/Ultra, these are the top-of-the-line perfomance SoCs (iPad Air/Pro). Each line shows increase in single core and multicore GB5 scores, on a yearly rhythm. The average is 25.7%Δsc and 43.7%Δmc, but you can see a non-small variation. Sometimes Apple ground ahead slowly, sometimes they exploded, even with Mr. SuperTalent.
 

Jimmyjames

Site Champ
Posts
675
Reaction score
763
Not really though. It has been demonstrated over and over that every iteration of Apple CPUs improves the performance by ~250-300 GB6 points since at least A10 (or maybe even earlier). These numbers are very unlikely to be a coincidence.

The thing is that people expect the same percentage increase every iteration, which would means that the improvements have to be exponential. This is clearly nonsensical. If one insists that Apple has to increase the performance by 20% every year, from years 9 to 10 they'd need to achieve 5x as big increase in performance as from year 1 to 2. I think folks just get too stuck on the fact that Apple did deliver 20% for some years in the row — that was there the additive and multiplicative parts of the curve were very similar. Some basic arithmetics on GB results over the years can easily verify this.
Oh boy! The arguments I had on Twitter with MaxTech (my fault for engaging) and on Reddit, where I received a long diatribe saying how stupid I was and how it was obvious to everyone that percentage increase were the basis for the industry etc etc.
 

dada_dave

Elite Member
Posts
2,164
Reaction score
2,150
Oh boy! The arguments I had on Twitter with MaxTech (my fault for engaging) and on Reddit, where I received a long diatribe saying how stupid I was and how it was obvious to everyone that percentage increase were the basis for the industry etc etc.
Also most actually new microarchitectures are every 2 years or so in the industry, not every year like Apple was doing at the beginning. And, as everyone has already said, Apple had a lot low hanging fruit in those years. As did their fabrication partners too I might add. The progress Apple made in the 2010s was insane but that was as much a product of where they were in their CPU development cycle as who was there (not to take away anything from their team who did exceptional work). Expecting that level of progress to continue unabated forever was/is a little naive. I don’t care what level of geniuses they have or don’t have on the team.
 

Jimmyjames

Site Champ
Posts
675
Reaction score
763
Also most actually new microarchitectures are every 2 years or so in the industry, not every year like Apple was doing at the beginning. And, as everyone has already said, Apple had a lot low hanging fruit in those years. As did their fabrication partners too I might add. The progress Apple made in the 2010s was insane but that was as much a product of where they were in their CPU development cycle as who was there (not to take away anything from their team who did exceptional work). Expecting that level of progress to continue unabated forever was/is a little naive. I don’t care what level of geniuses they have or don’t have on the team.
Agreed. I’m unsure how much is these people actually expecting exponential increases in performance, and how much is posturing to justify the rage which brings the views and clicks.
 

dada_dave

Elite Member
Posts
2,164
Reaction score
2,150
Agreed. I’m unsure how much is these people actually expecting exponential increases in performance, and how much is posturing to justify the rage which brings the views and clicks.
Both? People remember the exponential increases but forget what came before. AMD with Zen had a huge exponential growth rate in single core but again partly because they were starting out so far behind Intel. Apple was starting from the first ever 64 bit ARM v8 processor produced over a decade after x86-64 development with much tighter constraints on power and size. Intel recently spent a decade giving us Skylake refreshes while their processes got unfucked and at least those processes are now getting up to speed to give credit where it’s due. Basically that level of growth isn’t neither natural nor never-ending. It happens, it will likely happen again at some point, but sometimes people take it for granted.
 

Jimmyjames

Site Champ
Posts
675
Reaction score
763
Both? People remember the exponential increases but forget what came before. AMD with Zen had a huge exponential growth rate in single core but again partly because they were starting out so far behind Intel. Apple was starting from the first ever 64 bit ARM processor produced over a decade after x86-64 development with much tighter constraints on power and size. Intel recently spent a decade giving us Skylake refreshes while their processes got unfucked. Basically that level of growth isn’t neither natural nor never-ending. It happens, it will likely happen again at some point, but sometimes people take it for granted.
I’m sure you’re correct.

Slightly OT but I’ve been seeing rumours that Zen 5 has a huge increase in IPC. The estimates say between 20%-40%. Seems hard to believe.
 
Top Bottom
1 2