I'm trying to analyse an execution on an Intel Haswell CPU (Intel® Core™ i7-4900MQ) with the Top-down Microarchitecture Analysis Method (TMAM), described in Chapters B.1 and B.4 of the Intel® 64 and IA-32 Architectures
Optimization Reference Manual. (I adjust the Sandy Bridge formulas described in B.4 to the Haswell Microarchitecture if needed.)
Therefore I perform performance counter events measurements with Perf. There are some results I don’t understand:
CPU_CLK_UNHALTED.THREAD_P
<CYCLE_ACTIVITY.CYCLES_LDM_PENDING
This holds only for a few measurements, but still is weird. Does the PMU count halted cycles for CYCLE_ACTIVITY.CYCLES_LDM_PENDING
?
CYCLE_ACTIVITY.CYCLES_L2_PENDING
>CYCLE_ACTIVITY.CYCLES_L1D_PENDING
andCYCLE_ACTIVITY.STALLS_L2_PENDING
>CYCLE_ACTIVITY.STALLS_L1D_PENDING
This applies for all measurements. When there is a L1D cache miss, the load gets transferred to the L2 cache, right? So a load missed L2 earlier also missed L1. There is the L1 instruction cache not counted here, but with *_L2_PENDING
being 100x or even 1000x greater than *_L1D_PENDING
it is probably not that.. Are the stalls/cycles being measured somehow separately? But than there is this formula:
%L2_Bound =
(CYCLE_ACTIVITY.STALLS_L1D_PENDING - CYCLE_ACTIVITY.STALLS_L2_PENDING) / CLOCKS
Hence CYCLE_ACTIVITY.STALLS_L2_PENDING
< CYCLE_ACTIVITY.STALLS_L1D_PENDING
is assumed (the result of the formula must be positive). (The other thing with this formula is that it should probably be CYCLES
instead of STALLS
. However this wouldn't solve the problem described above.) So how can this be explained?
edit: My OS: Ubuntu 14.04.3 LTS, kernel: 3.13.0-65-generic x86_64, perf version: 3.13.11-ckt26
No comments:
Post a Comment