Sunday, 27 November 2016

performance - What setup does REP do?



Quoting Intel® 64 and IA-32 architectures optimization reference manual, §2.4.6 "REP String Enhancement":





The performance characteristics of using REP string can be attributed to two components:
startup overhead and data transfer throughput.



[...]



For REP string of larger granularity data transfer, as ECX value
increases, the startup overhead of REP String exhibit step-wise increase:




  • Short string (ECX <= 12): the latency of REP MOVSW/MOVSD/MOVSQ is about 20 cycles,


  • Fast string (ECX >= 76:
    excluding REP MOVSB): the processor implementation provides hardware
    optimization by moving as many pieces of data in 16 bytes as possible.
    The latency of REP string latency will vary if one of the 16-byte data
    transfer spans across cache line boundary:




    • Split-free: the latency consists of a startup cost of about 40 cycles and each 64 bytes of data adds 4 cycles,

    • Cache splits: the latency consists of a startup
      cost of about 35 cycles
      and each 64 bytes of data adds 6 cycles.



  • Intermediate string lengths: the latency of REP MOVSW/MOVSD/MOVSQ has
    a startup cost of about 15 cycles plus one cycle for each iteration of
    the data movement in word/dword/qword.





(emphasis mine)



There is no further mention of such startup cost. What is it? What does it do and why does it take always more time?



Answer



The rep movs microcode has several strategies to choose from. If the src and dest don't overlap closely, the microcoded loop can transfer in 64b chunks larger. (This is the so-called "fast strings" feature introduced with P6 and occasionally re-tuned for later CPUs that support wider loads/stores). But if dest is only one byte from src, rep movs has to produce the exact same result you'd get from that many separate movs instructions.



So the microcode has to check for overlap, and probably for alignment (of src and dest separately, or relative alignment). It probably also chooses something based on small/medium/large counter values.



According to Andy Glew's comments on an answer to Why are complicated memcpy/memset superior?, conditional branches in microcode aren't subject to branch-prediction. So there's a significant penalty in startup cycles if the default not-taken path isn't the one actually taken, even for a loop that uses the same rep movs with the same alignment and size.



He supervised the initial rep string implementation in P6, so he should know. :)





REP MOVS uses a cache protocol feature that is not available to
regular code. Basically like SSE streaming stores, but in a manner
that is compatible with normal memory ordering rules, etc. // The
"large overhead for choosing and setting up the right method" is
mainly due to the lack of microcode branch prediction. I have long
wished that I had implemented REP MOVS using a hardware state machine
rather than microcode, which could have completely eliminated the
overhead.



By the way, I have long said that one of the things that hardware can do

better/faster than software is complex multiway branches.



Intel x86 have had "fast strings" since the Pentium Pro (P6) in 1996,
which I supervised. The P6 fast strings took REP MOVSB and larger, and
implemented them with 64 bit microcode loads and stores and a no-RFO
cache protocol. They did not violate memory ordering, unlike ERMSB in
iVB.



The big weakness of doing fast strings in microcode was (a) microcode
branch mispredictions, and (b) the microcode fell out of tune with

every generation, getting slower and slower until somebody got around
to fixing it. Just like a library men copy falls out of tune. I
suppose that it is possible that one of the missed opportunities was
to use 128-bit loads and stores when they became available, and so on



In retrospect, I should have written a self-tuning infrastructure, to
get reasonably good microcode on every generation. But that would not
have helped use new, wider, loads and stores, when they became
available. // The Linux kernel seems to have such an autotuning
infrastructure, that is run on boot. // Overall, however, I advocate

hardware state machines that can smoothly transition between modes,
without incurring branch mispredictions. // It is debatable whether
good microcode branch prediction would obviate this.




Based on this, my best guess at a specific answer is: the fast-path through the microcode (as many branches as possible actually take the default not-taken path) is the 15-cycle startup case, for intermediate lengths.



Since Intel doesn't publish the full details, black-box measurements of cycle counts for various sizes and alignments are the best we can do. Fortunately, that's all we need to make good choices. Intel's manual, and http://agner.org/optimize/, have good info on how to use rep movs.







Fun fact: without ERMSB (IvB and later): rep movsb is optimized for small-ish copies. It takes longer to start up than rep movsd or rep movsq for large (more than a couple hundred bytes, I think) copies, and even after that may not achieve the same throughput.



The optimal sequence for large aligned copies without ERMSB and without SSE/AVX (e.g. in kernel code) may be rep movsq and then clean-up with something like an unaligned mov that copies the last 8 bytes of the buffer, possibly overlapping with the last aligned chunk of what rep movsq did. (basically use glibc's small-copy memcpy strategy). But if the size might be smaller than 8 bytes, you need to branch unless it's safe to copy more bytes than needed. Or rep movsb is an option for cleanup if small code-size matters more than performance. (rep will copy 0 bytes if RCX = 0).



A SIMD vector loop is often at least slightly faster than rep movsb even on CPUs with Enhanced Rep Move/Stos B. Especially if alignment isn't guaranteed. (Enhanced REP MOVSB for memcpy, and see also Intel's optimization manual. Links in the x86 tag wiki)


No comments:

Post a Comment

c++ - Does curly brackets matter for empty constructor?

Those brackets declare an empty, inline constructor. In that case, with them, the constructor does exist, it merely does nothing more than t...