Wednesday, 3 August 2016

performance - Why would introducing useless MOV instructions speed up a tight loop in x86_64 assembly?




Background:



While optimizing some Pascal code with embedded assembly language, I noticed an unnecessary MOV instruction, and removed it.



To my surprise, removing the un-necessary instruction caused my program to slow down.



I found that adding arbitrary, useless MOV instructions increased performance even further.



The effect is erratic, and changes based on execution order: the same junk instructions transposed up or down by a single line produce a slowdown.




I understand that the CPU does all kinds of optimizations and streamlining, but, this seems more like black magic.



The data:



A version of my code conditionally compiles three junk operations in the middle of a loop that runs 2**20==1048576 times. (The surrounding program just calculates SHA-256 hashes).



The results on my rather old machine (Intel(R) Core(TM)2 CPU 6400 @ 2.13 GHz):



avg time (ms) with -dJUNKOPS: 1822.84 ms
avg time (ms) without: 1836.44 ms



The programs were run 25 times in a loop, with the run order changing randomly each time.



Excerpt:



{$asmmode intel}
procedure example_junkop_in_sha256;
var s1, t2 : uint32;
begin

// Here are parts of the SHA-256 algorithm, in Pascal:
// s0 {r10d} := ror(a, 2) xor ror(a, 13) xor ror(a, 22)
// s1 {r11d} := ror(e, 6) xor ror(e, 11) xor ror(e, 25)
// Here is how I translated them (side by side to show symmetry):
asm
MOV r8d, a ; MOV r9d, e
ROR r8d, 2 ; ROR r9d, 6
MOV r10d, r8d ; MOV r11d, r9d
ROR r8d, 11 {13 total} ; ROR r9d, 5 {11 total}
XOR r10d, r8d ; XOR r11d, r9d

ROR r8d, 9 {22 total} ; ROR r9d, 14 {25 total}
XOR r10d, r8d ; XOR r11d, r9d

// Here is the extraneous operation that I removed, causing a speedup
// s1 is the uint32 variable declared at the start of the Pascal code.
//
// I had cleaned up the code, so I no longer needed this variable, and
// could just leave the value sitting in the r11d register until I needed
// it again later.
//

// Since copying to RAM seemed like a waste, I removed the instruction,
// only to discover that the code ran slower without it.
{$IFDEF JUNKOPS}
MOV s1, r11d
{$ENDIF}

// The next part of the code just moves on to another part of SHA-256,
// maj { r12d } := (a and b) xor (a and c) xor (b and c)
mov r8d, a
mov r9d, b

mov r13d, r9d // Set aside a copy of b
and r9d, r8d

mov r12d, c
and r8d, r12d { a and c }
xor r9d, r8d

and r12d, r13d { c and b }
xor r12d, r9d


// Copying the calculated value to the same s1 variable is another speedup.
// As far as I can tell, it doesn't actually matter what register is copied,
// but moving this line up or down makes a huge difference.
{$IFDEF JUNKOPS}
MOV s1, r9d // after mov r12d, c
{$ENDIF}

// And here is where the two calculated values above are actually used:
// T2 {r12d} := S0 {r10d} + Maj {r12d};
ADD r12d, r10d

MOV T2, r12d

end
end;


Try it yourself:



The code is online at GitHub if you want to try it out yourself.




My questions:




  • Why would uselessly copying a register's contents to RAM ever increase performance?

  • Why would the same useless instruction provide a speedup on some lines, and a slowdown on others?

  • Is this behavior something that could be exploited predictably by a compiler?


Answer



The most likely cause of the speed improvement is that:





  • inserting a MOV shifts the subsequent instructions to different memory addresses

  • one of those moved instructions was an important conditional branch

  • that branch was being incorrectly predicted due to aliasing in the branch prediction table

  • moving the branch eliminated the alias and allowed the branch to be predicted correctly



Your Core2 doesn't keep a separate history record for each conditional jump. Instead it keeps a shared history of all conditional jumps. One disadvantage of global branch prediction is that the history is diluted by irrelevant information if the different conditional jumps are uncorrelated.



This little branch prediction tutorial shows how branch prediction buffers work. The cache buffer is indexed by the lower portion of the address of the branch instruction. This works well unless two important uncorrelated branches share the same lower bits. In that case, you end-up with aliasing which causes many mispredicted branches (which stalls the instruction pipeline and slowing your program).




If you want to understand how branch mispredictions affect performance, take a look at this excellent answer: https://stackoverflow.com/a/11227902/1001643



Compilers typically don't have enough information to know which branches will alias and whether those aliases will be significant. However, that information can be determined at runtime with tools such as Cachegrind and VTune.


No comments:

Post a Comment

c++ - Does curly brackets matter for empty constructor?

Those brackets declare an empty, inline constructor. In that case, with them, the constructor does exist, it merely does nothing more than t...