Sunday, 3 July 2016

performance - Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs

This is related, but not the same, as this question: Performance optimisations of x86-64 assembly - Alignment and branch prediction and is slightly related to my previous question: Unsigned 64-bit to double conversion: why this algorithm from g++



The following is a not real-world test case. This primality testing algorithm is not sensible. I suspect any real-world algorithm would never execute such a small inner-loop quite so many times (num is a prime of size about 2**50). In C++11:



using nt = unsigned long long;
bool is_prime_float(nt num)
{
for (nt n=2; n<=sqrt(num); ++n) {
if ( (num%n)==0 ) { return false; }
}

return true;
}


Then g++ -std=c++11 -O3 -S produces the following, with RCX containing n and XMM6 containing sqrt(num). See my previous post for the remaining code (which is never executed in this example, as RCX never becomes large enough to be treated as a signed negative).



jmp .L20
.p2align 4,,10
.L37:
pxor %xmm0, %xmm0

cvtsi2sdq %rcx, %xmm0
ucomisd %xmm0, %xmm6
jb .L36 // Exit the loop
.L20:
xorl %edx, %edx
movq %rbx, %rax
divq %rcx
testq %rdx, %rdx
je .L30 // Failed divisibility test
addq $1, %rcx

jns .L37
// Further code to deal with case when ucomisd can't be used


I time this using std::chrono::steady_clock. I kept getting weird performance changes: from just adding or deleting other code. I eventually tracked this down to an alignment issue. The command .p2align 4,,10 tried to align to a 2**4=16 byte boundary, but only uses at most 10 bytes of padding to do so, I guess to balance between alignment and code size.



I wrote a Python script to replace .p2align 4,,10 by a manually controlled number of nop instructions. The following scatter plot shows the fastest 15 of 20 runs, time in seconds, number of bytes padding at the x-axis:



Scatter plot




From objdump with no padding, the pxor instruction will occur at offset 0x402f5f. Running on a laptop, Sandybridge i5-3210m, turboboost disabled, I found that




  • For 0 byte padding, slow performance (0.42 secs)

  • For 1-4 bytes padding (offset 0x402f60 to 0x402f63) get slightly better (0.41s, visible on the plot).

  • For 5-20 bytes padding (offset 0x402f64 to 0x402f73) get fast performance (0.37s)

  • For 21-32 bytes padding (offset 0x402f74 to 0x402f7f) slow performance (0.42 secs)

  • Then cycles on a 32 byte sample




So a 16-byte alignment doesn't give the best performance-- it puts us in the slightly better (or just less variation, from the scatter plot) region. Alignment of 32 plus 4 to 19 gives the best performance.




Why am I seeing this performance difference? Why does this seem to violate the rule of aligning branch targets to a 16-byte boundary (see e.g. the Intel optimisation manual)




I don't see any branch-prediction problems. Could this be a uop cache quirk??



By changing the C++ algorithm to cache sqrt(num) in an 64-bit integer and then make the loop purely integer based, I remove the problem-- alignment now makes no difference at all.

No comments:

Post a Comment

c++ - Does curly brackets matter for empty constructor?

Those brackets declare an empty, inline constructor. In that case, with them, the constructor does exist, it merely does nothing more than t...