Sunday, 21 August 2016

c - Can x86's MOV really be "free"? Why can't I reproduce this at all?



I keep seeing people claim that the MOV instruction can be free in x86, because of register renaming.




For the life of me, I can't verify this in a single test case. Every test case I try debunks it.



For example, here's the code I'm compiling with Visual C++:



#include 
#include
#include

int main(void)
{

unsigned int k, l, j;
clock_t tstart = clock();
for (k = 0, j = 0, l = 0; j < UINT_MAX; ++j)
{
++k;
k = j; // <-- comment out this line to remove the MOV instruction
l += j;
}
fprintf(stderr, "%d ms\n", (int)((clock() - tstart) * 1000 / CLOCKS_PER_SEC));
fflush(stderr);

return (int)(k + j + l);
}


This produces the following assembly code for the loop (feel free to produce this however you want; you obviously don't need Visual C++):



LOOP:
add edi,esi
mov ebx,esi
inc esi

cmp esi,FFFFFFFFh
jc LOOP


Now I run this program several times, and I observe a pretty consistent 2% difference when the MOV instruction is removed:



Without MOV      With MOV
1303 ms 1358 ms
1324 ms 1363 ms
1310 ms 1345 ms

1304 ms 1343 ms
1309 ms 1334 ms
1312 ms 1336 ms
1320 ms 1311 ms
1302 ms 1350 ms
1319 ms 1339 ms
1324 ms 1338 ms


So what gives? Why isn't the MOV "free"? Is this loop too complicated for x86?
Is there a single example out there that can demonstrate MOV being free like people claim?
If so, what is it? And if not, why does everyone keep claiming MOV is free?



Answer



The throughput of the loop in the question doesn't depend on the latency of MOV, or (on Haswell) the benefit of not using an execution unit.



The loop is still only 4 uops for the front-end to issue into the out-of-order core. (mov still has to be tracked by the out-of-order core even if it doesn't need an execution unit, but cmp/jc macro-fuses into a single uop).



Intel CPUs since Core2 have had an issue width of 4 uops per clock, so the mov doesn't stop it from executing at (close to) one iter per clock on Haswell. It would also run at one per clock on Ivybridge (with mov-elimination), but not on Sandybridge (no mov-elimination). On SnB, it would be about one iter per 1.333c cycles, bottlenecked on ALU throughput because the mov would always need one. (SnB/IvB have only three ALU ports, while Haswell has four).



Note that special handling in the rename stage has been a thing for x87 FXCHG (swap st0 with st1) for much longer than MOV. Agner Fog lists FXCHG as 0 latency on PPro/PII/PIII (first-gen P6 core).







The loop in the question has two interlocking dependency chains (the add edi,esi depends on EDI and on the loop counter ESI), which makes it more sensitive to imperfect scheduling. A 2% slowdown vs. theoretical prediction because of seemingly-unrelated instructions isn't unusual, and small variations in order of instructions can make this kind of difference. To run at exactly 1c per iter, every cycle needs to run an INC and an ADD. Since all the INCs and ADDs are dependent on the previous iteration, out-of-order execution can't catch up by running two in a single cycle. Even worse, the ADD depends on the INC in the previous cycle, which is what I meant by "interlocking", so losing a cycle in the INC dep chain also stalls the ADD dep chain.



Also, predicted-taken branches can only run on port6, so any cycle where port6 doesn't executed a cmp/jc is a cycle of lost throughput. This happens every time an INC or ADD steals a cycle on port6 instead of running on ports 0, 1, or 5. IDK if this is the culprit, or if losing cycles in the INC/ADD dep chains themselves is the problem, or maybe some of both.



Adding the extra MOV doesn't add any execution-port pressure, assuming it's eliminated 100%, but it does stop the front-end from running ahead of the execution core. (Only 3 of the 4 uops in the loop need an execution unit, and your Haswell CPU can run INC and ADD on any of its 4 ALU ports: 0, 1, 5, and 6. So the bottlenecks are:




  • the front-end max throughput of 4 uops per clock. (The loop without MOV is only 3 uops, so the front-end can run ahead).

  • taken-branch throughput of one per clock.


  • the dependency chain involving esi (INC latency of 1 per clock)

  • the dependency chain involving edi (ADD latency of 1 per clock, and also dependent on the INC from the previous iteration)



Without the MOV, the front-end can issue the loop's three uops at 4 per clock until the out-of-order core is full. (AFAICT, it "unrolls" tiny loops in the loop-buffer (Loop Stream Detector: LSD), so a loop with ABC uops can issue in an ABCA BCAB CABC ... pattern. The perf counter for lsd.cycles_4_uops confirms that it mostly issues in groups of 4 when it issues any uops.)



Intel CPUs assign uops to ports as they issue into the out-of-order core. The decision is based on counters that track how many uops for each port are already in the scheduler (aka Reservation Station, RS). When there are lots of uops in the RS waiting to execute, this works well and should usually avoid scheduling INC or ADD to port6. And I guess also avoids scheduling the INC and ADD such that time is lost from either of those dep chains. But if the RS is empty or nearly-empty, the counters won't stop an ADD or INC from stealing a cycle on port6.



I thought I was onto something here, but any sub-optimal scheduling should let the front-end catch up and keep the back-end full. I don't think we should expect the front-end to cause enough bubbles in the pipeline to explain a 2% drop below max throughput, since the tiny loop should run from the loop buffer at a very consistent 4 per clock throughput. Maybe there's something else going on.









I used lea to construct a loop that only has one mov per clock, creating a perfect demonstration where MOV-elimination succeeds 100%, or 0% of the time with mov same,same to demonstrate the latency bottleneck that produces.



Since the macro-fused dec/jnz is part of the dependency chain involving the loop counter, imperfect scheduling can't delay it. This is different from the case where cmp/jc "forks off" from the critical-path dependency chain every iteration.



_start:
mov ecx, 2000000000 ; each iteration decrements by 2, so this is 1G iters

align 16 ; really align 32 makes more sense in case the uop-cache comes into play, but alignment is actually irrelevant for loops that fit in the loop buffer.
.loop:
mov eax, ecx
lea ecx, [rax-1] ; we vary these two instructions

dec ecx ; dec/jnz macro-fuses into one uop in the decoders, on Intel
jnz .loop

.end:
xor edi,edi ; edi=0

mov eax,231 ; __NR_exit_group from /usr/include/asm/unistd_64.h
syscall ; sys_exit_group(0)


On Intel SnB-family, LEA with one or two components in the addressing mode runs with 1c latency (See http://agner.org/optimize/, and other links in the tag wiki).



I built and ran this as a static binary on Linux, so user-space perf-counters for the whole process are measuring just the loop with negligible startup / shutdown overhead. (perf stat is really easy compared to putting perf-counter queries into the program itself)



$ yasm -felf64 -Worphan-labels -gdwarf2 mov-elimination.asm && ld -o mov-elimination mov-elimination.o &&
objdump -Mintel -drwC mov-elimination &&

taskset -c 1 ocperf.py stat -etask-clock,context-switches,page-faults,cycles,instructions,branches,uops_issued.any,uops_executed.thread -r2 ./mov-elimination

Disassembly of section .text:

00000000004000b0 <_start>:
4000b0: b9 00 94 35 77 mov ecx,0x77359400
4000b5: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]

00000000004000c0 <_start.loop>:
4000c0: 89 c8 mov eax,ecx

4000c2: 8d 48 ff lea ecx,[rax-0x1]
4000c5: ff c9 dec ecx
4000c7: 75 f7 jne 4000c0 <_start.loop>

00000000004000c9 <_start.end>:
4000c9: 31 ff xor edi,edi
4000cb: b8 e7 00 00 00 mov eax,0xe7
4000d0: 0f 05 syscall

perf stat -etask-clock,context-switches,page-faults,cycles,instructions,branches,cpu/event=0xe,umask=0x1,name=uops_issued_any/,cpu/event=0xb1,umask=0x1,name=uops_executed_thread/ -r2 ./mov-elimination


Performance counter stats for './mov-elimination' (2 runs):

513.242841 task-clock:u (msec) # 1.000 CPUs utilized ( +- 0.05% )
0 context-switches:u # 0.000 K/sec
1 page-faults:u # 0.002 K/sec
2,000,111,934 cycles:u # 3.897 GHz ( +- 0.00% )
4,000,000,161 instructions:u # 2.00 insn per cycle ( +- 0.00% )
1,000,000,157 branches:u # 1948.396 M/sec ( +- 0.00% )
3,000,058,589 uops_issued_any:u # 5845.300 M/sec ( +- 0.00% )

2,000,037,900 uops_executed_thread:u # 3896.865 M/sec ( +- 0.00% )

0.513402352 seconds time elapsed ( +- 0.05% )


As expected, the loop runs 1G times (branches ~= 1 billion). The "extra" 111k cycles beyond 2G is overhead that's present in the other tests, too, including the one with no mov. It's not from occasional failure of mov-elimination, but it does scale with the iteration count so it's not just startup overhead. It's probably from timer interrupts, since IIRC Linux perf doesn't mess around with perf-counters while handling interrupts, and just lets them keep counting. (perf virtualizes the hardware performance counters so you can get per-process counts even when a thread migrates across CPUs.) Also, timer interrupts on the sibling logical core that shares the same physical core will perturb things a bit.



The bottleneck is the loop-carried dependency chain involving the loop counter. 2G cycles for 1G iters is 2 clocks per iteration, or 1 clock per decrement. The confirms that the length of the dep chain is 2 cycles. This is only possible if mov has zero latency. (I know it doesn't prove that there isn't some other bottleneck. It really only proves that the latency is at most 2 cycles, if you don't believe my assertion that latency is the only bottleneck. There's a resource_stalls.any perf counter, but it doesn't have many options for breaking down which microarchitectural resource was exhausted.)



The loop has 3 fused-domain uops: mov, lea, and macro-fused dec/jnz. The 3G uops_issued.any count confirms that: It counts in the fused domain, which is all of the pipeline from decoders to retirement, except for the scheduler (RS) and execution units. (macro-fused instruction-pairs stay as single uop everywhere. It's only for micro-fusion of stores or ALU+load that 1 fused-domain uop in the ROB tracks the progress of two unfused-domain uops.)




2G uops_executed.thread (unfused-domain) tells us that all the mov uops were eliminated (i.e. handled by the issue/rename stage, and placed in the ROB in an already-executed state). They still take up issue/retire bandwidth, and space in the uop cache, and code-size. They take up space in the ROB, limiting out-of-order window size. A mov instruction is never free. There are many possible microarchitectural bottlenecks besides latency and execution ports, the most important usually being the 4-wide issue rate of the front-end.



On Intel CPUs, being zero latency is often a bigger deal than not needing an execution unit, especially in Haswell and later where there are 4 ALU ports. (But only 3 of them can handle vector uops, so non-eliminated vector moves would be a bottleneck more easily, especially in code without many loads or stores taking front-end bandwidth (4 fused-domain uops per clock) away from ALU uops. Also, scheduling uops to execution units isn't perfect (more like oldest-ready first), so uops that aren't on the critical path can steal cycles from the critical path.)



If we put a nop or an xor edx,edx into the loop, those would also issue but not execute on Intel SnB-family CPUs.



Zero-latency mov-elimination can be useful for zero-extending from 32 to 64 bits, and for 8 to 64. (movzx eax, bl is eliminated, movzx eax, bx isn't).







Without mov-elimination



All current CPUs that support mov-elimination don't support it for mov same,same, so pick different registers for zero-extending integers from 32 to 64-bit, or vmovdqa xmm,xmm to zero-extend to YMM in a rare case where that's necessary. (Unless you need the result in the register it's already in. Bouncing to a different reg and back is normally worse.) And on Intel, the same applies for movzx eax,al for example. (AMD Ryzen doesn't mov-eliminated movzx.) Agner Fog's instruction tables show mov as always being eliminated on Ryzen, but I guess he means that it can't fail between two different regs the way it can on Intel.



We can use this limitation to create a micro-benchmark that defeats it on purpose.



mov ecx, ecx      # CPUs can't eliminate  mov same,same
lea ecx, [rcx-1]


dec ecx
jnz .loop

3,000,320,972 cycles:u # 3.898 GHz ( +- 0.00% )
4,000,000,238 instructions:u # 1.33 insn per cycle ( +- 0.00% )
1,000,000,234 branches:u # 1299.225 M/sec ( +- 0.00% )
3,000,084,446 uops_issued_any:u # 3897.783 M/sec ( +- 0.00% )
3,000,058,661 uops_executed_thread:u # 3897.750 M/sec ( +- 0.00% )



This takes 3G cycles for 1G iterations, because the length of the dependency chain is now 3 cycles.



The fused-domain uop count didn't change, still 3G.



What did change is that now the unfused-domain uop count is the same as fused-domain. All the uops needed an execution unit; none of the mov instructions were eliminated, so they all added 1c latency to the loop-carried dep chain.



(When there are micro-fused uops, like add eax, [rsi], the uops_executed count can be higher than uops_issued. But we don't have that.)







Without the mov at all:



lea ecx, [rcx-1]

dec ecx
jnz .loop


2,000,131,323 cycles:u # 3.896 GHz ( +- 0.00% )
3,000,000,161 instructions:u # 1.50 insn per cycle

1,000,000,157 branches:u # 1947.876 M/sec
2,000,055,428 uops_issued_any:u # 3895.859 M/sec ( +- 0.00% )
2,000,039,061 uops_executed_thread:u # 3895.828 M/sec ( +- 0.00% )


Now we're back down to 2 cycle latency for the loop-carried dep chain.



Nothing is eliminated.







I tested on a 3.9GHz i7-6700k Skylake. I get identical results on a Haswell i5-4210U (to within 40k out of 1G counts) for all the perf events. That's about the same margin of error as re-running on the same system.



Note that if I ran perf as root1, and counted cycles instead of cycles:u (user-space only), it measure the CPU frequency as exactly 3.900 GHz. (IDK why Linux only obeys the bios-settings for max turbo right after reboot, but then drops to 3.9GHz if I leave it idle for a couple minutes. Asus Z170 Pro Gaming mobo, Arch Linux with kernel 4.10.11-1-ARCH. Saw the same thing with Ubuntu. Writing balance_performance to each of /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference from /etc/rc.local fixes it, but writing balance_power makes it drop back to 3.9GHz again later.)



1: update: as a better alternative to running sudo perf, I set sysctl kernel.perf_event_paranoid = 0 in /etc/syctl.d/99-local.conf






You should get the same results on AMD Ryzen, since it can eliminate integer mov. AMD Bulldozer-family can only eliminate xmm register copies. (According to Agner Fog, ymm register copies are an eliminated low-half and an ALU op for the high half.)




For example, AMD Bulldozer and Intel Ivybridge can sustain a throughput of 1 per clock for



 movaps  xmm0, xmm1
movaps xmm2, xmm3
movaps xmm4, xmm5
dec
jnz .loop



But Intel Sandybridge can't eliminate moves, so it would bottleneck on 4 ALU uops for 3 execution ports. If it was pxor xmm0,xmm0 instead of movaps, SnB could also sustain one iteration per clock. (But Bulldozer-family couldn't, because xor-zeroing still needs an execution unit on AMD, even though is independent of the old value of the register. And Bulldozer-family only has 0.5c throughput for PXOR.)






Limitations of mov-elimination



Two dependent MOV instructions in a row exposes a difference between Haswell and Skylake.



.loop:
mov eax, ecx

mov ecx, eax

sub ecx, 2
jnz .loop


Haswell: minor run-to-run variability (1.746 to 1.749 c / iter), but this is typical:



 1,749,102,925      cycles:u                  #    2.690 GHz                    
4,000,000,212 instructions:u # 2.29 insn per cycle

1,000,000,208 branches:u # 1538.062 M/sec
3,000,079,561 uops_issued_any:u # 4614.308 M/sec
1,746,698,502 uops_executed_core:u # 2686.531 M/sec
745,676,067 lsd_cycles_4_uops:u # 1146.896 M/sec


Not all the MOV instructions are eliminated: about 0.75 of the 2 per iteration used an execution port. Every MOV that executes instead of being eliminated adds 1c of latency to the loop-carried dep chain, so it's not a coincidence that uops_executed and cycles are very similar. All the uops are part of a single dependency chain, so there's no parallelism possible. cycles is always about 5M higher than uops_executed regardless of run-to-run variation, so I guess there are just 5M cycles being used up somewhere else.



Skylake: more stable than HSW results, and more mov-elimination: only 0.6666 MOVs of every 2 needed an execution unit.




 1,666,716,605      cycles:u                  #    3.897 GHz
4,000,000,136 instructions:u # 2.40 insn per cycle
1,000,000,132 branches:u # 2338.050 M/sec
3,000,059,008 uops_issued_any:u # 7014.288 M/sec
1,666,548,206 uops_executed_thread:u # 3896.473 M/sec
666,683,358 lsd_cycles_4_uops:u # 1558.739 M/sec


On Haswell, lsd.cycles_4_uops accounted for all of the uops. (0.745 * 4 ~= 3). So in almost every cycle where any uops are issued, a full group of 4 is issued (from the loop-buffer. I probably should have looked at a different counter that doesn't care where they came from, like uops_issued.stall_cycles to count cycles where no uops issued).




But on SKL, 0.66666 * 4 = 2.66664 is less than 3, so in some cycles the front-end issued fewer than 4 uops. (Usually it stalls until there's room in the out-of-order core to issue a full group of 4, instead of issuing non-full groups).



It's weird, IDK what the exact microarchitectural limitation is. Since the loop is only 3 uops, each issue-group of 4 uops is more than a full iteration. So an issue group can contain up to 3 dependent MOVs. Perhaps Skylake is designed to break that up sometimes, to allow more mov-elimination?



update: actually this is normal for 3-uop loops on Skylake. uops_issued.stall_cycles shows that HSW and SKL issue a simple 3 uop loop with no mov-elimination the same way they issue this one. So better mov-elimination is a side-effect of splitting up issue groups for some other reason. (It's not a bottleneck because taken branches can't execute faster than 1 per clock regardless of how fast they issue). I still don't know why SKL is different, but I don't think it's anything to worry about.






In a less extreme case, SKL and HSW are the same, with both failing to eliminate 0.3333 of every 2 MOV instructions:




.loop:
mov eax, ecx
dec eax
mov ecx, eax

sub ecx, 1
jnz .loop





 2,333,434,710      cycles:u                  #    3.897 GHz                    
5,000,000,185 instructions:u # 2.14 insn per cycle
1,000,000,181 branches:u # 1669.905 M/sec
4,000,061,152 uops_issued_any:u # 6679.720 M/sec
2,333,374,781 uops_executed_thread:u # 3896.513 M/sec
1,000,000,942 lsd_cycles_4_uops:u # 1669.906 M/sec


All of the uops issue in groups of 4. Any contiguous group of 4 uops will contain exactly two MOV uops that are candidates for elimination. Since it clearly succeeds at eliminating both in some cycles, IDK why it can't always do that.







Intel's optimization manual says that overwriting the result of mov-elimination as early as possible frees up the microarchitectural resources so it can succeed more often, at least for movzx. See Example 3-25. Re-ordering Sequence to Improve Effectiveness of Zero-Latency MOV Instructions.



So maybe it's tracked internally with a limited-size table of ref-counts? Something has to stop the physical register file entry from being freed when it's no longer needed as the value of the original architectural register, if it's still needed as the value of the mov destination. Freeing PRF entries as soon as possible is key, because PRF size can limit the out-of-order window to smaller than ROB size.



I tried the examples on Haswell and Skylake, and found that mov-elimination did in fact work significantly more of the time when doing that, but that it was actually slightly slower in total cycles, instead of faster. The example was intended to show the benefit on IvyBridge, which probably bottlenecks on its 3 ALU ports, but HSW/SKL only bottleneck on resource conflicts in the dep chains and don't seem to be bothered by needing an ALU port for more of the movzx instructions.



See also Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? for more research + guesswork about how mov-elimination works, and whether it could work for xchg eax, ecx. (In practice xchg reg,reg is 3 ALU uops on Intel, but 2 eliminated uops on Ryzen. It's interesting to guess whether Intel could have implemented it more efficiently.)







BTW, as a workaround for an erratum on Haswell, Linux doesn't provide uops_executed.thread when hyperthreading is enabled, only uops_executed.core. The other core was definitely idle the whole time, not even timer interrupts, because I took it offline with echo 0 > /sys/devices/system/cpu/cpu3/online. Unfortunately this can't be done before perf decides that HT is enabled, and my Dell laptop doesn't have a BIOS option to disable HT. So I can't get perf to use all 8 hardware PMU counters at once on that system, only 4. :/


No comments:

Post a Comment

c++ - Does curly brackets matter for empty constructor?

Those brackets declare an empty, inline constructor. In that case, with them, the constructor does exist, it merely does nothing more than t...