I just learned that a recent version of ARMv9 includes the
equivalents of x86
rep movs and
rep stos, but without the D flag ( forward-only ).
I am of mixed feelings about this. It does fit cleanly into the instruction set coding and is somewhat more flexible than the x86 versions. With out-of-order execution and memory access, I could see how such operations might not bottleneck a processor (other code would flow around them, possibly with a retire-pipeline mod that would let them sit in place until a memory barrier is encountered). 64-bit ARM added a divide instruction, which is multi-cycle, so this would not be that far out of bounds, and it would almost certainly execute faster than code.
Still, it kind of feels like it breaks a rule.