M1 vs M1 Max - macOs Archive Utility

aeronatis

New member
Posts
4
Reaction score
8
Location
Istanbul/Türkiye
When I was testing M1 Mac against M1 and Intel MacBook Pro 16" (i9-9880H), I realised both M1 and M1 Pro finished compressing my 30 GB of mixed content in the exact same time. I haven't used any other compression tool yet. Does this mean macOS archive utility does not utilise more than 4 performance cores for now?
 

Cmaier

Elite Member
Staff Member
Vaccinated
Site Donor
Posts
3,071
Reaction score
4,273
When I was testing M1 Mac against M1 and Intel MacBook Pro 16" (i9-9880H), I realised both M1 and M1 Pro finished compressing my 30 GB of mixed content in the exact same time. I haven't used any other compression tool yet. Does this mean macOS archive utility does not utilise more than 4 performance cores for now?

It seems to me that compression times are gated by I/O (reading/writing “disk”). The .zip algorithm is very simple and certainly takes less time than the file access.

Also, welcome!
 

aeronatis

New member
Posts
4
Reaction score
8
Location
Istanbul/Türkiye
It seems to me that compression times are gated by I/O (reading/writing “disk”). The .zip algorithm is very simple and certainly takes less time than the file access.

Also, welcome!

Thank you so much! It's good to be here.

I thought about that as well; however, this time, this leads to a conclusion that the faster SSD does not provide any improvement over the one on the M1 Mac, at least in terms of small random read/write. Am I wrong?

I am quite obsessive when it comes to determining the root cause of the test results. Sorry if I'm being unreasonable 😅
 

Cmaier

Elite Member
Staff Member
Vaccinated
Site Donor
Posts
3,071
Reaction score
4,273
Thank you so much! It's good to be here.

I thought about that as well; however, this time, this leads to a conclusion that the faster SSD does not provide any improvement over the one on the M1 Mac, at least in terms of small random read/write. Am I wrong?

I am quite obsessive when it comes to determining the root cause of the test results. Sorry if I'm being unreasonable 😅

That could be right. The increased bandwidth helps when reading or writing large files, but given the sort of thing you are testing it probably doesn’t help too much because the access times may be dominated by the ”seek” time.
 

Nycturne

Site Champ
Vaccinated
Posts
434
Reaction score
476
It’s also the case that file compression is a very serial task *and* I/O bound. Sort of the worst of both worlds. It’s not something that parallelizes well, especially when using older algorithms and file formats where you want to avoid file fragmentation.

EDIT: If you look in Activity Monitor, you should see that the ArchiveService caps out at ~100% CPU while compressing, confirming it is single threaded.
 
Last edited:

casperes1996

Power User
Vaccinated
Posts
58
Reaction score
37
Have not tested this and no evidence to think it is so, but another thing is that Apple could set the QoS for decompression tasks to background and intentionally have it run just on an efficiency core
 

Cmaier

Elite Member
Staff Member
Vaccinated
Site Donor
Posts
3,071
Reaction score
4,273
Have not tested this and no evidence to think it is so, but another thing is that Apple could set the QoS for decompression tasks to background and intentionally have it run just on an efficiency core
Welcome!
 

Nycturne

Site Champ
Vaccinated
Posts
434
Reaction score
476
Have not tested this and no evidence to think it is so, but another thing is that Apple could set the QoS for decompression tasks to background and intentionally have it run just on an efficiency core

One thing to point out is that it doesn't really matter which cores are used if the cores are the same. Single threaded means that an M1, M1 Pro and M1 Max will all have roughly the same compression performance as one another.

Thankfully, I have tested this though. It uses the performance cores. Apple’s guidelines for QoS are pretty clear, with maybe the exception of utility vs background. Someone asking for a file to be (de)compressed is clearly meant for the “user-initiated” QoS, which will stick to the performance cores. EDIT: It might help to understand that Apple's QoS is more about "why" than "what". The same task can be considered user-initiated or background, depending on the context. Fetching e-mail for example. It should be utility/background QoS when running that "every 15 minutes" fetch, but user-initiated when the user clicks the "fetch e-mail" button.
 

casperes1996

Power User
Vaccinated
Posts
58
Reaction score
37
One thing to point out is that it doesn't really matter which cores are used if the cores are the same. Single threaded means that an M1, M1 Pro and M1 Max will all have roughly the same compression performance as one another.

Thankfully, I have tested this though. It uses the performance cores. Apple’s guidelines for QoS are pretty clear, with maybe the exception of utility vs background. Someone asking for a file to be (de)compressed is clearly meant for the “user-initiated” QoS, which will stick to the performance cores. EDIT: It might help to understand that Apple's QoS is more about "why" than "what". The same task can be considered user-initiated or background, depending on the context. Fetching e-mail for example. It should be utility/background QoS when running that "every 15 minutes" fetch, but user-initiated when the user clicks the "fetch e-mail" button.

Very true.
With regards to threading though I'm still not entirely sure of the behaviour of the macOS scheduler with M1 (Pro/Max). I mean what we set the QoS to in code is just a hint to the scheduler and ultimate control of what cores run what is left up to the scheduler. It would for example be entirely possible to, say create 4 low priority background threads and for the scheduler to start working on two of them on the efficiency cores on an M1 Pro/Max and not push the two remaining jobs to performance cores, instead opting to just time slice the threads on the two efficiency cores - and use all 4 efficiency cores on M1.
Again it's not something I've done extensive testing of or anything, but a heterogenous architecture means I'm throwing out my assumptions of how the scheduler works from Intel Macs. All I knew about XNU scheduling was fairly old anyway

Orthogonal to everything else in this discussion but also interesting to consider that Swift's new concurrency model creates as many kernel threads as there are CPU cores (with SMT if relevant) and if the program itself gives opportunity for more concurrent operations than that, it's managed by the userspace runtime instead of a traditional threading model
 

Nycturne

Site Champ
Vaccinated
Posts
434
Reaction score
476
Very true.
With regards to threading though I'm still not entirely sure of the behaviour of the macOS scheduler with M1 (Pro/Max). I mean what we set the QoS to in code is just a hint to the scheduler and ultimate control of what cores run what is left up to the scheduler. It would for example be entirely possible to, say create 4 low priority background threads and for the scheduler to start working on two of them on the efficiency cores on an M1 Pro/Max and not push the two remaining jobs to performance cores, instead opting to just time slice the threads on the two efficiency cores - and use all 4 efficiency cores on M1.
Again it's not something I've done extensive testing of or anything, but a heterogenous architecture means I'm throwing out my assumptions of how the scheduler works from Intel Macs. All I knew about XNU scheduling was fairly old anyway
The bolded bit is exactly what is happening in my experiences so far. Efficiency cores that are running at 100% under load, but nothing getting moved off them to the idle performance cores, starving the background threads/processes instead.

The truth is, I don’t think Apple’s asymmetric scheduler is as clever as some people have been guessing (or just outright stating as fact) over on the other site. Apple themselves say that improperly setting QoS could have impacts on efficiency and responsiveness if you get it wrong. But it’s also the beauty of a the QoS system Apple picked. The QoS levels are based around importance in terms of responsiveness for the user, mapping them more cleanly to how Apple‘s scheduler wants to assigns threads.

If you want clever, look at the “Thread Director”. :p

When I worked in the mobile phone OS space, one of the biggest battery life killers was all these small tasks that wake up the CPU periodically. Background services, doing network heartbeats so push notifications continued to work, etc, etc. Being able to offload all that onto efficiency cores is a huge win as the performance cores get more power hungry and faster. And on iOS, being able to move background apps to those cores means the performance cores are free to deal with the foreground app without interference from background refresh, geolocation services, etc.

Orthogonal to everything else in this discussion but also interesting to consider that Swift's new concurrency model creates as many kernel threads as there are CPU cores (with SMT if relevant) and if the program itself gives opportunity for more concurrent operations than that, it's managed by the userspace runtime instead of a traditional threading model

The difference is that GCD concurrent queues brute force their way around stalls by creating more threads. Swift concurrency doesn’t have to do that, but the catch is that the work is cooperatively multithreaded, meaning your time slices can get a bit wonky once you run out of threads in the pool to use. Be careful of busy loops, especially. But if you can avoid pitfalls like that, you do get the benefit of not having to have the kernel context switch as often, yes.
 

casperes1996

Power User
Vaccinated
Posts
58
Reaction score
37
If you want clever, look at the “Thread Director”. :p
I've mostly coded for Apple platforms though also quite a bit of general Unix/Posix C and low level system programming. But I know nothing really of Windows' world; Does Windows have APIs that allow you to specify a QoS for threads? Or even just a regular raw numeric niceness or something? When I looked into Intel's Thread Director thingy for Alder Lake I got the impression that it was somewhat closely tied to a partnership with Microsoft though they were going to push for Linux utilising the Thread Director in some form too. I've not really been able to gather that much information about what the claimed advantage really is though? Like what is it the Thread Director really gives the OS scheduler to work with that it doesn't already know for determining how to schedule tasks?
The difference is that GCD concurrent queues brute force their way around stalls by creating more threads. Swift concurrency doesn’t have to do that, but the catch is that the work is cooperatively multithreaded, meaning your time slices can get a bit wonky once you run out of threads in the pool to use. Be careful of busy loops, especially. But if you can avoid pitfalls like that, you do get the benefit of not having to have the kernel context switch as often, yes.
Exactly. Though the co-operative nature is also somewhat opaque. If you mark your function async that's about all you need to do and the compiler will insert yield points in your function. Not like calling a yield yourself to give up your timeslice, it can happen anywhere in your function and even if it is deterministic you shouldn't rely on where it might happen in your function since the compiler could change it with another optimisation level or just from code changes around the project
 

Nycturne

Site Champ
Vaccinated
Posts
434
Reaction score
476
it can happen anywhere in your function and even if it is deterministic you shouldn't rely on where it might happen in your function since the compiler could change it with another optimisation level or just from code changes around the project

Compiler won't insert suspension points except during an await, which is standard behavior of languages that have async/await (source: docs.swift.org):

When calling an asynchronous method, execution suspends until that method returns. You write await in front of the call to mark the possible suspension point. This is like writing try when calling a throwing function, to mark the possible change to the program’s flow if there’s an error. Inside an asynchronous method, the flow of execution is suspended only when you call another asynchronous method—suspension is never implicit or preemptive—which means every possible suspension point is marked with await.

I've mostly coded for Apple platforms though also quite a bit of general Unix/Posix C and low level system programming. But I know nothing really of Windows' world; Does Windows have APIs that allow you to specify a QoS for threads? Or even just a regular raw numeric niceness or something? When I looked into Intel's Thread Director thingy for Alder Lake I got the impression that it was somewhat closely tied to a partnership with Microsoft though they were going to push for Linux utilising the Thread Director in some form too. I've not really been able to gather that much information about what the claimed advantage really is though? Like what is it the Thread Director really gives the OS scheduler to work with that it doesn't already know for determining how to schedule tasks?
My Windows is a bit rusty at this point, but my understanding is that it is effectively similar to POSIX, although it's not called the same. macOS uses the same mechanism, to be honest. The difference is the semantic meanings overlaid on top of the integer values that iOS/macOS uses which is more explicit.

Thread Director includes a microcontroller on the CPU itself that helps assign threads, taking some control away from the OS, and allowing heuristics to be applied in the decision making. I've not done a ton of reading, but what I have read makes me think Intel wanted to avoid making too many required changes to the OS. They wanted "drop in and go" compatibility, and the ability to apply some basic machine learning models to the thing. I think it just winds up being more clever than it really needs to be, but it also has different goals than Apple's efficiency cores.
 

casperes1996

Power User
Vaccinated
Posts
58
Reaction score
37
Compiler won't insert suspension points except during an await, which is standard behavior of languages that have async/await (source: docs.swift.org):
Huh; I stand corrected. I could've sworn they said during WWDC that we should write async functions under the assumption that the function could yield at any time. Though I guess that may still be considered good practice since someone could come along and insert an await call later
My Windows is a bit rusty at this point, but my understanding is that it is effectively similar to POSIX, although it's not called the same. macOS uses the same mechanism, to be honest. The difference is the semantic meanings overlaid on top of the integer values that iOS/macOS uses which is more explicit.

Thread Director includes a microcontroller on the CPU itself that helps assign threads, taking some control away from the OS, and allowing heuristics to be applied in the decision making. I've not done a ton of reading, but what I have read makes me think Intel wanted to avoid making too many required changes to the OS. They wanted "drop in and go" compatibility, and the ability to apply some basic machine learning models to the thing. I think it just winds up being more clever than it really needs to be, but it also has different goals than Apple's efficiency cores.
True. But the way the thread priority is managed on macOS ties into the semantic naming of the QoS levels, so if it also impacts things like what type of core you run on, it's no longer as straight forward as "how early in the queue do you sit and how big a time slice do you get"

From what I had heard, the Thread Director makes no decisions. It just feeds information to the OS scheduler for it to make decisions with. But I don't know. Haven't found good, detailed sources on it. If I get the time I'll check if Intel has a proper manual out for OS developers on it. Or if there are well documented PRs to Linux or something that describe how it's used
 

mr_roboto

Power User
Posts
125
Reaction score
130
From what I had heard, the Thread Director makes no decisions. It just feeds information to the OS scheduler for it to make decisions with. But I don't know. Haven't found good, detailed sources on it. If I get the time I'll check if Intel has a proper manual out for OS developers on it. Or if there are well documented PRs to Linux or something that describe how it's used

This Anandtech article has a link to the Intel documentation:


Search for EHFI (Enhanced Hardware Frequency Interface). Looks like it's mostly about the hardware notifying the OS of changes to the performance and efficiency characteristics of cores. It provides a data table, and two notification mechanisms (one based on polling, one on interrupts) to let the scheduler know when the table has changed. It's up to the OS to decide what to do with the info, or whether to use it at all.
 

casperes1996

Power User
Vaccinated
Posts
58
Reaction score
37
This Anandtech article has a link to the Intel documentation:


Search for EHFI (Enhanced Hardware Frequency Interface). Looks like it's mostly about the hardware notifying the OS of changes to the performance and efficiency characteristics of cores. It provides a data table, and two notification mechanisms (one based on polling, one on interrupts) to let the scheduler know when the table has changed. It's up to the OS to decide what to do with the info, or whether to use it at all.

Right. That matches the impression I had gotten for how it works; Mostly from Anandtech too, haha. Thanks for the link and help in finding the relevant bits. I'm too sleepy to properly digest the info in the Intel manual right now, but will have a look later at specifically how the interface works.

One thing I've been wondering about all this is the potential risk of adding more complexity to the scheduler. After all it runs rather frequently so more complex logic in the scheduler and slowing down it's ability to schedule tasks could be problematic. I'm sure all modern operating systems have rather good scheduler implementations and they wouldn't go mock about with it in a dumb way where it suddenly has seven billion branches with inner loops and O(!n^4) complexity or something, but still. Efficiency cores may help with reducing power draw, but if the x86/Wintel approach winds up using both a new dedicated hardware block and more complicated scheduler logic that eats more cycles to run, aren't some of the efficiency gain also going to be lost there?

Anandtech, or maybe that was Gamer's Nexus? also recently discussed how Windows' scheduler prioritises the foreground application process in a way that may not always be desirable and could potentially mean that your actually important render job in the background starts running on efficiency cores while your mostly idle word processor in the foreground reserves the performance cores. - Probably not quite to that degree where it'd reserve all the performance cores, but perhaps at least reserving a single one where the task could easily be satisfiably handled by an efficiency core.

All interesting, but would also be nice to know more about how XNU's scheduler manages the core topology. Though I don't believe there's any good documentation available other than going digging through the open source code Apple puts out which can be quite hard to dig through without accompanying documentation.

Now that I'm already ranting a bit here, Alder Lake also made me think about AVX-512 and how it's technically available on the P core but not the E core and their solution for now is completely disabling it, unless you disable the E cores and enable it in BIOS (apparently not officially endorsed by Intel). Well I see two potential solutions to that problem too. Migrating AVX code to P cores upon an ILLEGAL OPERATION trap (simple solution, always migrate, if trap occurs again on P core, actually kill the process. More advanced solution, check the value at RIP during trap to see if it were AVX-512 related before migrating the thread). Solution 2) Implement AVX-512 in software in the trap handler, similar to how x87 emulation was part of the OS in the old days when you didn't have x87 hardware. Scheduler should still prioritise migration since AVX code probably wants performance and doing it this way would be very darn slow in comparison, but it could allow all the cores to support the same instructions in some sense at least so if all they needed (in some weird situation) was to perform a single AVX-512 instruction and then move on to regular work, it could be done without migration.
 

Nycturne

Site Champ
Vaccinated
Posts
434
Reaction score
476
This Anandtech article has a link to the Intel documentation:


Search for EHFI (Enhanced Hardware Frequency Interface). Looks like it's mostly about the hardware notifying the OS of changes to the performance and efficiency characteristics of cores. It provides a data table, and two notification mechanisms (one based on polling, one on interrupts) to let the scheduler know when the table has changed. It's up to the OS to decide what to do with the info, or whether to use it at all.

I stand corrected, good to know. Still seems like it's trying to be overly clever, to be honest.

One thing I've been wondering about all this is the potential risk of adding more complexity to the scheduler. After all it runs rather frequently so more complex logic in the scheduler and slowing down it's ability to schedule tasks could be problematic. I'm sure all modern operating systems have rather good scheduler implementations and they wouldn't go mock about with it in a dumb way where it suddenly has seven billion branches with inner loops and O(!n^4) complexity or something, but still. Efficiency cores may help with reducing power draw, but if the x86/Wintel approach winds up using both a new dedicated hardware block and more complicated scheduler logic that eats more cycles to run, aren't some of the efficiency gain also going to be lost there?
Based on what I'm reading, you might have answered your own question here. The microcontroller's job can be done separately from the scheduler, so it's possible the scheduler just looks at whatever snapshot the microcontroller has at that point in time. Because the default for new threads is to put them on the P cores by default unless it has to spill over to the E cores, or the readings suggest it is better suited for the E cores, this is generally okay.

For me, the bigger concern with the Intel approach here is that it's still hardware trying to understand how the cores are being used, and then make recommendations on what to do with threads based on what it sees flowing through the pipeline. i.e. it's attempting to infer what the best place to put a thread is based on existing usage, and not so much based on the priority of the work itself (although I guess the OS could override if it so chooses).

All interesting, but would also be nice to know more about how XNU's scheduler manages the core topology. Though I don't believe there's any good documentation available other than going digging through the open source code Apple puts out which can be quite hard to dig through without accompanying documentation.
Best documentation is code, honestly. XNU's scheduler would take me a bit more time to understand, but the basics are pretty straight-forward. Core clusters are assigned to processor sets, and assigned to the P and E category. Threads are given recommendations based on a few factors, including scheduler flags that can bind a thread to a particular core type, the thread priority, current scheduler policy, and the thread group the thread belongs to, depending on the scheduler policy. There's some neat bits there that suggest kernel task threads are primarily assigned to the E cores, and that while utility and bg are by default limited to the E cores, the kernel can adjust the policy from that default depending on conditions, and have them follow the thread group instead. This last bit makes sense since threads within a thread group are likely to be accessing shared memory, so there's useful cache affinities that can potentially be exploited. It looks like this policy can be expanded with more modes in the future, but Apple hasn't as of macOS 11.5.

But then there's the "spill, steal, rebalance" part of the scheduler. If the P cores are overloaded, threads can spill over to the E cores. In addition to a sort of "push" mechanism with spilled threads, E cores can "pull" by stealing threads waiting to run on a P core to keep latency of these higher priority threads low. Rebalancing is the act of pulling threads meant for the P cores back onto those cores as they become idle and can start taking back the spilled threads.

Note that there's no process here for elevating a thread from an E core to a P core if it is recommended to run on an E core. However, E cores are free to take on work meant for the P cores if none are available (something we already knew), even further pushing out lower priority work that can only run on the E cores. This clearly is the mechanism that would let me create a GCD concurrent queue with "user initiated" priority, load it down with work, and saturate both the P and E cores until that work was completed.
 

casperes1996

Power User
Vaccinated
Posts
58
Reaction score
37
Best documentation is code, honestly. XNU's scheduler would take me a bit more time to understand, but the basics are pretty straight-forward. Core clusters are assigned to processor sets, and assigned to the P and E category. Threads are given recommendations based on a few factors, including scheduler flags that can bind a thread to a particular core type, the thread priority, current scheduler policy, and the thread group the thread belongs to, depending on the scheduler policy. There's some neat bits there that suggest kernel task threads are primarily assigned to the E cores, and that while utility and bg are by default limited to the E cores, the kernel can adjust the policy from that default depending on conditions, and have them follow the thread group instead. This last bit makes sense since threads within a thread group are likely to be accessing shared memory, so there's useful cache affinities that can potentially be exploited. It looks like this policy can be expanded with more modes in the future, but Apple hasn't as of macOS 11.5.

But then there's the "spill, steal, rebalance" part of the scheduler. If the P cores are overloaded, threads can spill over to the E cores. In addition to a sort of "push" mechanism with spilled threads, E cores can "pull" by stealing threads waiting to run on a P core to keep latency of these higher priority threads low. Rebalancing is the act of pulling threads meant for the P cores back onto those cores as they become idle and can start taking back the spilled threads.

Note that there's no process here for elevating a thread from an E core to a P core if it is recommended to run on an E core. However, E cores are free to take on work meant for the P cores if none are available (something we already knew), even further pushing out lower priority work that can only run on the E cores. This clearly is the mechanism that would let me create a GCD concurrent queue with "user initiated" priority, load it down with work, and saturate both the P and E cores until that work was completed.

I normally agree and am typically an ambassador for "Comments get outdated, code speaks the truth", and "Clean code that tells me what it's doing with no documentation is better than bad code with good documentation". But I've tried reading XNU/Darwin code before and frankly sometimes it'd be nice with just some helicopter-view, high level documentation

As for everything else have you read those things in various articles? Does it mostly stem from experimentation, official documentation or reading through XNU/Darwin?

So last time I properly looked at the scheduling system, which is quite some years ago, I concluded it was essentially a multi-level feedback queue. Now a lot of dynamic priority scheduling systems, including to my knowledge XNU at least at the time, will increase the priority of threads that frequently yield, under the logic that it's most likely a UI program waiting for input so frequently letting it run, check if it needs to do anything and then sleep again makes for a good responsive system. On the flip side a thread that eats its entire time slice without any I/O bound waiting or voluntarily yielding has its dynamic priority reduced, usually with some bounds relative to user set nice values and such.
This may happen at a process level rather than a thread level and I'm also unsure about how Mach and the BSD layer work together here since a niceness value and the idea of a Process sits in the BSD layer where Mach is responsible for scheduling, but what I'm really getting to here is that if the type of core you run on is determined by priority then this sort of dynamic priority rebalancing could put CPU intensive tasks on the E cores since they eat up all their CPU cycles and get lowered priority while I/O bound processes have their priority increased and then land on P cores. -- That also isn't in line with behaviour I've seen and would be nonsensical, so either the priority rebalancing system is different now or there's something more going on than just priority balancing too
 

Nycturne

Site Champ
Vaccinated
Posts
434
Reaction score
476
That also isn't in line with behaviour I've seen and would be nonsensical, so either the priority rebalancing system is different now or there's something more going on than just priority balancing too

To speak to your comment about priority rebalancing, here's something to consider: the scheduler priority isn't used when recommending P or E cores for a thread, only the base priority. So this sort of rebalancing wouldn't have the effect of changing the recommendation.

I've mostly been reading through the code since I'm curious. There's certainly pieces that I haven't gleaned yet, but so far the key things driving things that I see in the code match up with the sort of analysis that eclecticlight.co has been doing on the scheduler's behavior. About the only thing I haven't seen in action yet that I've read in the code are around trying to keep threads within the same group on the same CPU cluster. Possibly because it's not a common case where you've got threads in a group with different base priorities. However, it does also apply for keeping those threads on the same P cores rather than having them spread across the two different CPU clusters on the M1 Pro/Max.
 
Top Bottom