Saturday, 20 May 2017

performance - Slowdown when accessing data at page boundaries?




(My question is related to computer architecture and performance understanding. Did not find a relevant forum, so post it here as a general question.)



I have a C program which accesses memory words that are located X bytes apart in virtual address space. For instance, for (int i=0;;i+=X){array[i]=4;}.



I measure the execution time with a varying value of X. Interestingly, when X is the power of 2 and is about page size, e.g., X=1024,2048,4096,8192..., I get to huge performance slowdown. But on all other values of X, like 1023 and 1025, there is no slowdown. The performance results are attached in the figure below.



X axis is the value of <code>X</code> and Y-axis is execution time in milliseconds



I test my program on several personal machines, all are running Linux with x86_64 on Intel CPU.




What could be the cause of this slowdown? We have tried row buffer in DRAM, L3 cache, etc. which do not seem to make sense...



Update (July 11)



We did a little test here by adding NOP instructions to the original code. And the slowdown is still there. This sorta veto the 4k alias. The cause by conflict cache misses is more likely the case here.


Answer



There's 2 things here:






Either or both of these effects could be a factor in Why is there huge performance hit in 2048x2048 versus 2047x2047 array multiplication?






Another possible factor is that HW prefetching stops at physical page boundaries. Why does the speed of memcpy() drop dramatically every 4KB? But changing a stride from 1024 to 1023 wouldn't help that by a big factor. "Next-page" prefetching in IvyBridge and later is only TLB prefetching, not data from the next page.






I kind of assumed x86 for most of this answer, but the cache aliasing / conflict-miss stuff applies generally. Set-associative caches with simple indexing are universally used for L1d caches. (Or on older CPUs, direct-mapped where each "set" only has 1 member). The 4k aliasing stuff might be mostly Intel-specific.




Prefetching across virtual page boundaries is likely also a general problem.


No comments:

Post a Comment

c++ - Does curly brackets matter for empty constructor?

Those brackets declare an empty, inline constructor. In that case, with them, the constructor does exist, it merely does nothing more than t...