Friday, 28 October 2016

assembly - Cost of polymorphism



I am looking at the below virtual method call in x86-64:



mov     rcx, qword ptr [x]   
mov rax, qword ptr [rcx]
call qword ptr [rax+8]


and also Agner Fog's latency tables:




http://www.agner.org/optimize/instruction_tables.pdf



As I am using an Ivy Bridge CPU I am looking at page 175.




  1. Am I right in that the first two MOV instructions both only take 2 (they are both move memory to register) CPU cycles? I thought a call to a virtual method was slower than this?


  2. In the instruction latency table page 178 it says the latency of this call is 2 CPU cycles (I think?). What does CALL 'near' mean, as opposed to CALL 'r' (register) and CALL 'm' (memory)?


  3. So the above ASM does take 6 CPU cycles according to the Fog booklet, I haven't misinterpretted anything?





EDIT: I changed the virtual function call to be the second in the vtable.


Answer




Am I right in that the first two MOV instructions both only take 2 (they are both move memory to register) CPU cycles? I thought a call to a virtual method was slower than this?
In the instruction latency table page 178 it says the latency of this call is 2 CPU cycles (I think?).




No, 2 CPU cycles in only the minimal latency.




Let's check the Agner's tables http://www.agner.org/optimize/instruction_tables.pdf




Integer instructions.



Instruction Operands uops fused domain uops unfused domain (p015 p0 p1 p5 p23 p4) Latency Reciprocal throughput Comments



Inst   Oper        fus p23 p4  Latency Rec.
MOV r32/64,m32/64 1 1 2 0.5




To find time, when instruction will produce their results, you should use "Latency" column. And the latency is 2 cycles for each mov, and listed only minimal value (check text in "Explanation of column headings" - "Latency - This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, ...may increase the clock counts considerably.")



If you have lot of different polymorphic calls, the memory needed for them may be not cached. We know cache and memory latencies from different reviews, and all were measured via long chain of dependent MOVs like mov eax, [eax]; mov eax, [eax]; mov eax, [eax]; .... Values for Ivy are: hit in L1 = 4 cycles, hit in L2 = 11 cycles, hit in L3 = 30-40 cycles, miss in cache and access memory = 32cycles + 60 ns (at 3 GHz with 3 cycles per ns > 200 cycles). There are even no easy case to get 2 cycle latency (what is closer to ALU than L1? Only 72-entry load buffer for reordered loads?), and there will be no chance to have 2 cycle latency on the second mov (its operand is result of first mov, so there is nothing to execute out-of-order before first mov retirement).



In the tables http://instlatx64.atw.hu/ linked from Agner's Links there is report for Ivy InstLatX64 for Intel Core i7-3770K, 3700 MHz made with aida_bench64.dll




27 AMD64 :MOV r64, [m64] L: 1.14ns= 4.0c T: 0.14ns= 0.50c





And this table shows real latency (L) for hit in L1 cache, 4 cycles.



Same data (4c for L1, ~12c for L2, 26-31c for L3) in 64-ia-32-architectures-optimization-manual.pdf page 46 section "2.2.5.1 Load and Store Operation Overview", Table "2-10 Lookup Order and Load Latency"




So the above ASM does take 6 CPU cycles according to the Fog booklet, I haven't misinterpretted anything?





In the best case, when first load was executed early with Out-of-order = 2 cycles on critical path; second load hit in L1 = 4 cycles on critical path; 2 cycles for call execution; BTB (branch target prediction/indirect branch target) was successful, which is more probable when from single address of call you always jumps to the same target (or to small number of targets with periodic patterns) -- you will have 8 cycles to confirm that branch was predicted right, which may be partially hidden by OoO execution of target function.



If any load misses in L1/L2, you should add corresponding cache latency. If L3 misses, add 200 cycles.



If BTB misses, you will have at least 15 cycle penalty (check Agner's microarchitecture.pdf, page 27 "3.7 Branch prediction in Intel Sandy Bridge and Ivy Brindge; Misprediction penalty") - for cached uops; more for target in L1i. You may read about older BTB in the same microarchitecture.pdf page 25 "3.5 Branch prediction in PM and Core2; Pattern recognition for indirect jumps and calls" and "BTB organization .. for indirect jumps and indirect calls".



The very useful document is from Intel: "Intel® 64 and IA-32 Architectures
Optimization Reference Manual" 64-ia-32-architectures-optimization-manual.pdf. It has both tuning suggestions and information about performance counters, which will help you to get real latencies and miss rates for your code (check B.6.3.2 section "Virtual Tables and Indirect Calls").


No comments:

Post a Comment

c++ - Does curly brackets matter for empty constructor?

Those brackets declare an empty, inline constructor. In that case, with them, the constructor does exist, it merely does nothing more than t...