performance - How can the rep stosb instruction execute faster than the equivalent loop?

Friday, 5 May 2017

performance - How can the rep stosb instruction execute faster than the equivalent loop?

How can the instruction rep stosb execute faster than this code?

    Clear: mov byte [edi],AL       ; Write the value in AL to memory
           inc edi                 ; Bump EDI to next byte in the buffer

           dec ecx                 ; Decrement ECX by one position
           jnz Clear               ; And loop again until ECX is 0

Is that guaranteed to be true on all modern CPUs? Should I always prefer to use rep stosb instead of writing the loop manually?

Answer

In modern CPUs, rep stosb's and rep movsb's microcoded implementation actually uses stores that are wider than 1B, so it can go much faster than one byte per clock.

(Note this only applies to stos and movs, not repe cmpsb or repne scasb. They're still slow, unfortunately, like at best 2 cycles per byte compared on Skylake, which is pathetic vs. AVX2 vpcmpeqb for implementing memcmp or memchr. See https://agner.org/optimize/ for instruction tables, and other perf links in the x86 tag wiki.

See Why is this code 6.5x slower with optimizations enabled? for an example of gcc unwisely inlining repnz scasb or a less-bad scalar bithack for a strlen that happens to get large, and a simple SIMD alternative.)

rep stos/movs has significant startup overhead, but ramps up well for large memset/memcpy. (See the Intel/AMD's optimization manuals for discussion of when to use rep stos vs. a vectorized loop for small buffers.) Without the ERMSB feature, though, rep stosb is tuned for medium to small memsets and it's optimal to use rep stosd or rep stosq (if you aren't going to use a SIMD loop).

When single-stepping with a debugger, rep stos only does one iteration (one decrement of ecx/rcx), so the microcode implementation never gets going. Don't let this fool you into thinking that's all it can do.

See What setup does REP do? for some details of how Intel P6/SnB-family microarchitectures implement rep movs.

See Enhanced REP MOVSB for memcpy for memory-bandwidth considerations with rep movsb vs. an SSE or AVX loop, on Intel CPUs with the ERMSB feature. (Note especially that many-core Xeon CPUs can't saturate DRAM bandwidth with only a single thread, because of limits on how many cache misses are in flight at once, and also RFO vs. non-RFO store protocols.)

A modern Intel CPU should run the asm loop in the question at one iteration per clock, but an AMD bulldozer-family core probably can't even manage one store per clock. (Bottleneck on the two integer execution ports handling the inc/dec/branch instructions. If the loop condition was a cmp/jcc on edi, an AMD core could macro-fuse the compare-and-branch.)

One major feature of so-called Fast String operations (rep movs and rep stos on Intel P6 and SnB-family CPUs is that they avoid the read-for-ownership cache coherency traffic when storing to not-previously-cached memory. So it's like using NT stores to write whole cache lines, but still strongly ordered. (The ERMSB feature does use weakly-ordered stores).

IDK how good AMD's implementation is.

(And a correction: I previously said that Intel SnB can only handle a taken-branch throughput of one per 2 clocks, but in fact it can run tiny loops at one iteration per one clock.)

See the optimization resources (esp. Agner Fog's guides) linked from the x86 tag wiki.

Intel IvyBridge and later also ERMSB, which lets rep stos[b/w/d/q] and rep movs[b/w/d/q] use weakly-ordered stores (like movnt), allowing the stores to commit to cache out-of-order. This is an advantage if not all of the destination is already hot in L1 cache. I believe, from the wording of the docs, that there's an implicit memory barrier at the end of a fast string op, so any reordering is only visible between stores made by the string op, not between it and other stores. i.e. you still don't need sfence after rep movs.

So for large aligned buffers on Intel IvB and later, a rep stos implementation of memset can beat any other implementation. One that uses movnt stores (which don't leave the data in cache) should also be close to saturating main memory write bandwidth, but may in practice not quite keep up. See comments for discussion of this, but I wasn't able to find any numbers.

For small buffers, different approaches have very different amounts of overhead. Microbenchmarks can make SSE/AVX copy-loops look better than they are, because doing a copy with the same size and alignment every time avoids branch mispredicts in the startup/cleanup code. IIRC, it's recommended to use a vectorized loop for copies under 128B on Intel CPUs (not rep movs). The threshold may be higher than that, depending on the CPU and the surrounding code.

Intel's optimization manual also has some discussion of overhead for different memcpy implementations, and that rep movsb has a larger penalty for misalignment than movdqu.

See the code for an optimized memset/memcpy implementation for more info on what is done in practice. (e.g. Agner Fog's library).

Blog

Friday, 5 May 2017

performance - How can the rep stosb instruction execute faster than the equivalent loop?

No comments:

Post a Comment

c++ - Does curly brackets matter for empty constructor?