Thursday, January 6, 2011

Oprofile performance counter events for Intel Nehalem processor

Some common performance counter events:

NameDescriptionCounters usableUnit mask options
CPU_CLK_UNHALTED Clock cycles when not halted all
UNHALTED_REFERENCE_CYCLES Unhalted reference cycles 0, 1, 2 0x01: No unit mask
LLC_MISSES Last level cache demand requests from this core that missed the LLC all 0x41: No unit mask
LLC_REFS Last level cache demand requests from this core all 0x4f: No unit mask


(Updates) LLC_MISSES is not well-documented by Intel. It seems to include L2 cache misses. Instead, one can use MEM_LOAD_RETIRED:0x10 to collect the number of retired loads that miss the last level cache. My measurement showed that LLC_MISSES can be ten times larger than MEM_LOAD_RETIRED:0x10.

Other useful metrics:
MEM_INST_RETIRED:0x01, the number of instructions with an architecturally-visible load retired on the architected path;

MEM_LOAD_RETIRED:0x04, llc_unshared_hit, the number of retired loads that hit their own, unshared lines in the LLC cache;
MEM_LOAD_RETIRED:0x08, other_core_l2_hit_hitm, the number of retired loads that hit in a sibling core's L2 (on die core);
MEM_LOAD_RETIRED:0x80, dtlb_miss, the number of retired loads that missed the DTLB;

MEM_UNCORE_RETIRED:0x08, remote_cache_local_home_hit, the number of memory load instructions retired where the memory reference missed the L1, L2 and LLC caches and HIT in a remote socket's cache;
MEM_UNCORE_RETIRED:0x10, remote_dram, the number of memory load instructions retired where the memory reference missed the L1, L2 and LLC caches and was remotely homed (dram);
MEM_UNCORE_RETIRED:0x20, local_dram, the number of memory load instructions retired where the memory reference missed the L1, L2 and LLC caches and required a local socket memory reference (dram);