Wednesday, 13 April 2016

c - Is Intel's timestamp reading asm code example using two more registers than are necessary?



I'm looking into measuring benchmark performance using the time-stamp register (TSR) found in x86 CPUs. It's a useful register, since it measures in a monotonic unit of time which is immune to the clock

speed changing. Very cool.



Here is an Intel document showing asm snippets for reliably benchmarking using the TSR, including using cpuid for pipeline synchronisation. See page 16:



http://www.intel.com/content/www/us/en/embedded/training/ia-32-ia-64-benchmark-code-execution-paper.html



To read the start time, it says (I annotated a bit):



__asm volatile (
"cpuid\n\t" // writes e[abcd]x

"rdtsc\n\t" // writes edx, eax
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t"
//
:"=r" (cycles_high), "=r" (cycles_low) // outputs
: // inputs
:"%rax", "%rbx", "%rcx", "%rdx"); // clobber


I'm wondering why scratch registers are used to take the values of edx

and eax. Why not remove the movs and read the TSR value right out of edx
and eax? Like this:



__asm volatile(                                                             
"cpuid\n\t"
"rdtsc\n\t"
//
: "=d" (cycles_high), "=a" (cycles_low) // outputs
: // inputs
: "%rbx", "%rcx"); // clobber



By doing this, you save two registers, reducing the likelihood of the C
compiler needing to spill.



Am I right? Or those MOVs are somehow strategic?



(I agree that you do need scratch registers to read the stop time, as
in that scenario the order of the instructions is reversed: you have
rdtscp, ..., cpuid. The cpuid instruction destroys the result of rdtscp).




Thanks


Answer



You're correct, the example is clunky. Usually if mov is the first or last instruction in an inline-asm statement, you're doing it wrong, and should have used a constraint to tell the compiler where you want the input, or where the output is.



See my GNU C inline asm guides / links collection, and other links in the tag wiki. (The tag wiki is full of good stuff for asm in general, too.)






Or for rdtsc specifically, see Get CPU cycle count? for the __rdtsc() intrinsic, and good inline asm in @Mysticial's answer.








it measures in a monotonic unit of time which is immune to the clock speed changing.




Yes, on CPUs made within the last 10 years or so.



For profiling, it's often more useful to have times in core clock cycles, not wall-clock time, so your microbenchmark results don't depend on power-saving / turbo. Performance counters can do this and much more.




Still, if real time is what you want, rdtsc is the lowest-overhead way to get it.






And re: discussion in comments: yes cpuid is there to serialize, making sure that rdtsc and following instructions can't begin executing until after CPUID. You could put another CPUID after RDTSC, but that would increase measurement overhead, and I think give near-zero gain in accuracy / precision.



LFENCE is a cheaper alternative that's useful with RDTSC. The instruction ref manual entry documents the fact that it doesn't let later instructions start executing until it and previous instructions have retired (from the ROB/RS in the out-of-order part of the core). See Are loads and stores the only instructions that gets reordered?, and for a specific example of using it, see clflush to invalidate cache line via C function. Unlike true serializing instructions like cpuid, it doesn't flush the store buffer.



(On recent AMD CPUs without Spectre mitigation enabled, lfence is not even partially serializing, and runs at 4 per clock according to Agner Fog's testing. Is LFENCE serializing on AMD processors?)




Margaret Bloom dug up this useful link, which also confirms that LFENCE serializes RDTSC according to Intel's SDM, and has some other stuff about how to do serialization around RDTSC.


No comments:

Post a Comment

c++ - Does curly brackets matter for empty constructor?

Those brackets declare an empty, inline constructor. In that case, with them, the constructor does exist, it merely does nothing more than t...