Superior External SSD TB/USB4 performance on Apple Silicon vs Intel

dada_dave

Elite Member
Posts
2,170
Reaction score
2,160
Integrating the Thunderbolt controller into the SOC yields dividends:

 

Citysnaps

Elite Member
Staff Member
Site Donor
Posts
3,696
Reaction score
8,996
Main Camera
iPhone
Integrating the Thunderbolt controller into the SOC yields dividends:


How does that jive with supporting dual-lane USB 3.2 Gen 2X2, which supports 20 Gb/s rates, which is twice the normal 10 Gb/s rate?

Samsung's T9 SSDs support that. But from what I've read Apple apparently does not. At least yet.
 

dada_dave

Elite Member
Posts
2,170
Reaction score
2,160
How does that jive with supporting dual-lane USB 3.2 Gen 2X2, which supports 20 Gb/s rates, which is twice the normal 10 Gb/s rate?

Samsung's T9 SSDs support that. But from what I've read Apple apparently does not. At least yet.
Hector’s post is based on the following article:


I’ve updated my title to be more accurate. If the External SSD connects via Thunderbolt/USB4 it gets the full speed.
 

dada_dave

Elite Member
Posts
2,170
Reaction score
2,160

Update
 

Nycturne

Elite Member
Posts
1,140
Reaction score
1,490
Hmm, I still am not sure I completely follow, but I think I get the broad strokes.

TBT is 40Gbps (5GBps), but it’s never fully stated if that’s “each way” or not. After doing some napkin math, I’m going to assume it is, as otherwise you can’t fit 25.6Gbps (3.2GBps). But it’s generally assumed that it should peak out at 32Gbps for PCIe data, with the rest free for DisplayPort as that comes into the controller over a separate link.

So if you put the TBT controller at the end of a PCIe 3.0 x4 link, then any traffic aimed at the controller itself eats into what you can provide for the devices on the bus. There’s also potentially the overhead of “wrapping” the TBT data in PCIe data, were you have both the overhead of addressing the controller’s PCIe address, on top of addressing a TBT device address. i.e. You never really can fully take advantage of the full 32Gbps promised by Intel in this configuration.

Seems like it’s not clear how dedicated controllers vs Intel’s 2-port controllers play out here, but could also have an impact.
 

mr_roboto

Site Champ
Posts
290
Reaction score
469
Hmm, I still am not sure I completely follow, but I think I get the broad strokes.

TBT is 40Gbps (5GBps), but it’s never fully stated if that’s “each way” or not. After doing some napkin math, I’m going to assume it is, as otherwise you can’t fit 25.6Gbps (3.2GBps). But it’s generally assumed that it should peak out at 32Gbps for PCIe data, with the rest free for DisplayPort as that comes into the controller over a separate link.
TBT is full duplex yes - all the modern wired multi-gigabit comms standards I can think of are. (Wireless is a different story. WiFi is half duplex. Full duplex radio is hard since, while the radio is transmitting, its un-attenuated (by distance) transmit power is akin to a 130dB jet engine piped directly into the "ear" of the receive side of the radio.)

Hector's point is that lots of people assume that because Intel says certain things about TBT capabilities, they must be universal to all TBT implementations and baked into the spec at a deep level, when actually they are not.

TBT is a channel for moving packets around. It's designed around primarily being used to encapsulate packets from other packetized standards, most notably PCIe and DisplayPort. Each end of the link needs a bridge to the appropriate other kind of bus, and the bridge handles wrapping the "alien" packets in TBT framing.

The stuff you see about TBT handling 32 Gbps of PCIe comes from Intel's most common implementations, where the host side PCIe-TBT bridge includes a physical Gen3 x4 link to a host PCIe port. But there is no requirement that it actually be done that way, Intel only requires a minimum of 32 Gbps PCIe bandwidth. More is possible.

In Apple's case, there is no physical PCIe link at the host end. They have a PCIe root complex in the SoC, but PCIe is a layered networking spec and Apple can discard the layers it doesn't need. They don't have to include any of the physical layers at all, they can simply clock packets through an internal parallel SoC link into a FIFO for the TBT controller to read out at its own pace.

So, if Apple didn't internally bottleneck it anywhere, their TBT ports can theoretically do 40 Gbps PCIe, assuming the other end of the link's up to it. And as Hector pointed out, there's now some peripheral-end TBT chips whose PCIe bridge provides Gen4 x4 connectivity, meaning it's quite possible (at least in theory) to hit 40G. (You'd end up bottlenecked by TBT rather than the device-end Gen4 x4 link.)

You'll also see lots of people talking about TBT as if there's a fixed DisplayPort bandwidth allocation. I don't think any such thing needs to exist, instead they should just be giving DP packets absolute priority in the TBT bridge's transmission queue. That should automatically "allocate" exactly the amount of bandwidth required by the current resolution, color depth, and refresh rate of the DP device at the other end.
 

Nycturne

Elite Member
Posts
1,140
Reaction score
1,490
I’m mostly going to skim my response as there‘s quite a bit here I was already well aware of.

TBT is full duplex yes - all the modern wired multi-gigabit comms standards I can think of are. (Wireless is a different story. WiFi is half duplex. Full duplex radio is hard since, while the radio is transmitting, its un-attenuated (by distance) transmit power is akin to a 130dB jet engine piped directly into the "ear" of the receive side of the radio.)

Ah, but there’s the other aspect here, is the measured/calculated speed for one or both directions in a full-duplex link? Phrased another way, is the speed the unidirectional or bidirectional speed? That’s more what I was getting at, and the public documentation is surprisingly vague on that point. And I’m not above suspecting that a company being vague might be using the bidirectional speed to make something look better rather than using the unidirectional speed that is more common. Especially when it is Intel. So I’m pleasantly surprised when that’s not the case.

The pinout of TB3 makes it pretty clear (to me) that it’s full-duplex.

Hector's point is that lots of people assume that because Intel says certain things about TBT capabilities, they must be universal to all TBT implementations and baked into the spec at a deep level, when actually they are not.

Which is what I gathered, yes.

The catch here is that the controllers in Apple Silicon are the first non-Intel implementation that I’m aware of. Even when discussing dock/hubs and the like, it’s been an Intel chip. I actually wish Hector commented on a specific bridge chip, as I haven’t seen any with PCIe 4 links yet, and Intel’s ARK still lists their most recent chips as using PCIe 3 links to the host/device. With a bridge chip identified, someone could do some more interesting experiments here.

So, if Apple didn't internally bottleneck it anywhere, their TBT ports can theoretically do 40 Gbps PCIe, assuming the other end of the link's up to it. And as Hector pointed out, there's now some peripheral-end TBT chips whose PCIe bridge provides Gen4 x4 connectivity, meaning it's quite possible (at least in theory) to hit 40G. (You'd end up bottlenecked by TBT rather than the device-end Gen4 x4 link.)

Except in the context of this conversation, it’s very unlikely that we are seeing more PCIe data going over the TBT bus than we’d expect in a Gen3 x4 arrangement. Instead, what we are likely seeing is better utilization due to lower overhead. A zero-overhead Gen3 x4 is 4GB/s, so when eclecticlight reports 3.2GB/s, we’re unlikely looking at a Gen4 device in this example. So I’m not sure we can say much about Apple’s implementation without a client device that isn’t itself a bottleneck. But I’d expect a fully unbottlenecked device to be pushing close to 4GB/s

You'll also see lots of people talking about TBT as if there's a fixed DisplayPort bandwidth allocation. I don't think any such thing needs to exist, instead they should just be giving DP packets absolute priority in the TBT bridge's transmission queue. That should automatically "allocate" exactly the amount of bandwidth required by the current resolution, color depth, and refresh rate of the DP device at the other end.

I haven’t seen this, not even on the other place? It’s been generally assumed that this is the case in the discussions I’ve seen.
 

mr_roboto

Site Champ
Posts
290
Reaction score
469
Ah, but there’s the other aspect here, is the measured/calculated speed for one or both directions in a full-duplex link? Phrased another way, is the speed the unidirectional or bidirectional speed?
Unidirectional. Thunderbolt/USB4 uses four high speed serial pairs, each running at 20 Gbps unidirectional, so you get 2x20 out and 2x20 in. AFAIK, 20 is the line rate before accounting for line coding (probably a low overhead scheme like 64b66b or 128b130b) and packet overheads (header/footer, inter-packet gap if any is required).

That’s more what I was getting at, and the public documentation is surprisingly vague on that point. And I’m not above suspecting that a company being vague might be using the bidirectional speed to make something look better rather than using the unidirectional speed that is more common. Especially when it is Intel. So I’m pleasantly surprised when that’s not the case.
I understand where you're coming from on Intel, but they also created PCIe and the way they market Thunderbolt speeds is fairly consistent with PCIe. The one point of departure is that unlike PCIe, end user marketing materials don't always make it obvious it's a 2-lane link. I think that's an understandable simplification as TBT is always exactly x2; they can just say it's 40 Gbps and leave it at that.

The catch here is that the controllers in Apple Silicon are the first non-Intel implementation that I’m aware of. Even when discussing dock/hubs and the like, it’s been an Intel chip. I actually wish Hector commented on a specific bridge chip, as I haven’t seen any with PCIe 4 links yet, and Intel’s ARK still lists their most recent chips as using PCIe 3 links to the host/device. With a bridge chip identified, someone could do some more interesting experiments here.
ASMedia 2464PD supports Gen4 x4 on the device side:


Except in the context of this conversation, it’s very unlikely that we are seeing more PCIe data going over the TBT bus than we’d expect in a Gen3 x4 arrangement. Instead, what we are likely seeing is better utilization due to lower overhead. A zero-overhead Gen3 x4 is 4GB/s, so when eclecticlight reports 3.2GB/s, we’re unlikely looking at a Gen4 device in this example.
3.2 is about what you expect out of Gen3 x4 with line coding and packet overhead, so yeah, that sounds right.

So I’m not sure we can say much about Apple’s implementation without a client device that isn’t itself a bottleneck. But I’d expect a fully unbottlenecked device to be pushing close to 4GB/s
Interestingly enough, on looking up some benchmarks, that ASM2464PD chip is claimed to hit about 3.8 GB/s read on some Windows PCs, but supposedly not on Apple Silicon Macs, where it's around 3.1. The former number suggests that some of Intel's recent implementations got rid of the Gen3 x4 physical layer (makes sense since AFAIK they have SoC integrated TBT now), the latter I don't know what it means. Need more data points...

I haven’t seen this, not even on the other place? It’s been generally assumed that this is the case in the discussions I’ve seen.
I had seen it in discussions outside of both here and MR and got mixed up about what was said where.
 

Nycturne

Elite Member
Posts
1,140
Reaction score
1,490
Unidirectional. Thunderbolt/USB4 uses four high speed serial pairs, each running at 20 Gbps unidirectional, so you get 2x20 out and 2x20 in. AFAIK, 20 is the line rate before accounting for line coding (probably a low overhead scheme like 64b66b or 128b130b) and packet overheads (header/footer, inter-packet gap if any is required).

Yes, I figured that… Please don’t assume I haven’t figured this part out after doing the math. This is getting into “let me explain what you’ve already figured out” territory, so please stop.

ASMedia 2464PD supports Gen4 x4 on the device side:


Now we’re talking. This looks like it bills itself as a USB4 chipset, which is probably why it dodged my google queries.

3.2 is about what you expect out of Gen3 x4 with line coding and packet overhead, so yeah, that sounds right.

But it turns out the OWC 1M2 is the enclosure in question is using the 2464PD chip, so it should be capable of more than has been reported. Hmmmm….

Also interesting is that on a TB3 port, the 1M2 drops to just under 1GB/s as confirmed in reviews, suggesting it’s not fully TB3 compatible but instead drops to a USB3 mode. That’s… a little disappointing. But it makes sense if ASMedia is doing it to avoid licensing fees to Intel.

Interestingly enough, on looking up some benchmarks, that ASM2464PD chip is claimed to hit about 3.8 GB/s read on some Windows PCs, but supposedly not on Apple Silicon Macs, where it's around 3.1. The former number suggests that some of Intel's recent implementations got rid of the Gen3 x4 physical layer (makes sense since AFAIK they have SoC integrated TBT now), the latter I don't know what it means. Need more data points...

Agreed on the need for more data points. Because this suggests that Apple’s controllers still might be a bottleneck in terms of reservations/policy. At least on current hardware. It’d help knowing what PC hardware is delivering the higher speeds, exactly, to get a better handle on the differences.

EDIT: Speaking of more data points, the ZikeDrive has been tested and billed as one of these faster drives, but I found someone trying it with an M1 Ultra system and getting 3.1GB/s: https://www.fredmiranda.com/forum/topic/1826580

Another data point suggesting Apple’s controllers might has some reservation policies set.
 
Last edited:

Nycturne

Elite Member
Posts
1,140
Reaction score
1,490
Sorry, I genuinely didn't understand where you were coming from there. It's a problem I sometimes have.

No worries, I mostly just wanted to be firm but polite. But I think we have a pretty high engineer ratio in these threads, so we can afford to be a bit generous with each other. I can be somewhat terse sometimes as I am trying to avoid being pedantic.

My Computer Engineering may be quite rusty as I've clawed my way up the software stack during my career, but I'm not completely out of the game yet. :)
 

cbum

Power User
Posts
188
Reaction score
85
But I think we have a pretty high engineer ratio in these threads
Sure, but I would also consider that you, and other engineers, are not the only ones reading posts here. Within reason, I believe crafting responses with a bit of context to make them more widely understandable is a good thing.
 
Top Bottom
1 2