performance - Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs

Sunday, 3 July 2016

performance - Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs

This is related, but not the same, as this question: Performance optimisations of x86-64 assembly - Alignment and branch prediction and is slightly related to my previous question: Unsigned 64-bit to double conversion: why this algorithm from g++

The following is a not real-world test case. This primality testing algorithm is not sensible. I suspect any real-world algorithm would never execute such a small inner-loop quite so many times (num is a prime of size about 2**50). In C++11:

using nt = unsigned long long;
bool is_prime_float(nt num)
{
   for (nt n=2; n<=sqrt(num); ++n) {
      if ( (num%n)==0 ) { return false; }
   }

   return true;
}

Then g++ -std=c++11 -O3 -S produces the following, with RCX containing n and XMM6 containing sqrt(num). See my previous post for the remaining code (which is never executed in this example, as RCX never becomes large enough to be treated as a signed negative).

jmp .L20
.p2align 4,,10
.L37:
pxor    %xmm0, %xmm0

cvtsi2sdq   %rcx, %xmm0
ucomisd %xmm0, %xmm6
jb  .L36   // Exit the loop
.L20:
xorl    %edx, %edx
movq    %rbx, %rax
divq    %rcx
testq   %rdx, %rdx
je  .L30   // Failed divisibility test
addq    $1, %rcx

jns .L37
// Further code to deal with case when ucomisd can't be used

I time this using std::chrono::steady_clock. I kept getting weird performance changes: from just adding or deleting other code. I eventually tracked this down to an alignment issue. The command .p2align 4,,10 tried to align to a 2**4=16 byte boundary, but only uses at most 10 bytes of padding to do so, I guess to balance between alignment and code size.

I wrote a Python script to replace .p2align 4,,10 by a manually controlled number of nop instructions. The following scatter plot shows the fastest 15 of 20 runs, time in seconds, number of bytes padding at the x-axis:

Scatter plot

From objdump with no padding, the pxor instruction will occur at offset 0x402f5f. Running on a laptop, Sandybridge i5-3210m, turboboost disabled, I found that

For 0 byte padding, slow performance (0.42 secs)

For 1-4 bytes padding (offset 0x402f60 to 0x402f63) get slightly better (0.41s, visible on the plot).

For 5-20 bytes padding (offset 0x402f64 to 0x402f73) get fast performance (0.37s)

For 21-32 bytes padding (offset 0x402f74 to 0x402f7f) slow performance (0.42 secs)

Then cycles on a 32 byte sample

So a 16-byte alignment doesn't give the best performance-- it puts us in the slightly better (or just less variation, from the scatter plot) region. Alignment of 32 plus 4 to 19 gives the best performance.

Why am I seeing this performance difference? Why does this seem to violate the rule of aligning branch targets to a 16-byte boundary (see e.g. the Intel optimisation manual)

I don't see any branch-prediction problems. Could this be a uop cache quirk??

By changing the C++ algorithm to cache sqrt(num) in an 64-bit integer and then make the loop purely integer based, I remove the problem-- alignment now makes no difference at all.

Blog

Sunday, 3 July 2016

performance - Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs

No comments:

Post a Comment

c++ - Does curly brackets matter for empty constructor?