Tuesday, March 15, 2011

Where to find multi-core processor core mapping information

My OS is Fedora 13.

/sys/devices/system/cpu/cpu0/topology/core_siblings_list shows which CPUID are siblings;
/sys/devices/system/cpu/cpu0/topology/thread_siblings_list shows which CPUID are multithreading/hyperthreading siblings (i.e. virtual processors that share the same physical core).

/sys/devices/system/cpu/cpu0/cache/index/* contains all the cache information. Take my Intel Nehalem Xeon E5520 as an example:
/sys/devices/system/cpu/cpu0/cache/index0/level shows this is a L1 cache;
/sys/devices/system/cpu/cpu0/cache/index0/type shows this is a data cache;
/sys/devices/system/cpu/cpu0/cache/index0/size shows the cache size is 32KB.

Similarly,
/sys/devices/system/cpu/cpu0/cache/index1/ describes the L1 Icache;
/sys/devices/system/cpu/cpu0/cache/index2/ depicts the L2 unified cache;
/sys/devices/system/cpu/cpu0/cache/index3/ is for the L3 unified shared cache.

Wednesday, February 2, 2011

GNU compiler "-ffloat-store" option

-ffloat-store
Do not store floating point variables in registers, and inhibit other options that might change whether a floating point value is taken from a register or memory.

This option prevents undesirable excess precision on machines such as the 68000 where the floating registers (of the 68881) keep more precision than a double is supposed to have. Similarly for the x86 architecture. For most programs, the excess precision does only good, but a few programs rely on the precise definition of IEEE floating point. Use -ffloat-store for such programs, after modifying them to store all pertinent intermediate computations into variables.

Thursday, January 6, 2011

Oprofile performance counter events for Intel Nehalem processor

Some common performance counter events:

NameDescriptionCounters usableUnit mask options
CPU_CLK_UNHALTED Clock cycles when not halted all
UNHALTED_REFERENCE_CYCLES Unhalted reference cycles 0, 1, 2 0x01: No unit mask
LLC_MISSES Last level cache demand requests from this core that missed the LLC all 0x41: No unit mask
LLC_REFS Last level cache demand requests from this core all 0x4f: No unit mask


(Updates) LLC_MISSES is not well-documented by Intel. It seems to include L2 cache misses. Instead, one can use MEM_LOAD_RETIRED:0x10 to collect the number of retired loads that miss the last level cache. My measurement showed that LLC_MISSES can be ten times larger than MEM_LOAD_RETIRED:0x10.

Other useful metrics:
MEM_INST_RETIRED:0x01, the number of instructions with an architecturally-visible load retired on the architected path;

MEM_LOAD_RETIRED:0x04, llc_unshared_hit, the number of retired loads that hit their own, unshared lines in the LLC cache;
MEM_LOAD_RETIRED:0x08, other_core_l2_hit_hitm, the number of retired loads that hit in a sibling core's L2 (on die core);
MEM_LOAD_RETIRED:0x80, dtlb_miss, the number of retired loads that missed the DTLB;

MEM_UNCORE_RETIRED:0x08, remote_cache_local_home_hit, the number of memory load instructions retired where the memory reference missed the L1, L2 and LLC caches and HIT in a remote socket's cache;
MEM_UNCORE_RETIRED:0x10, remote_dram, the number of memory load instructions retired where the memory reference missed the L1, L2 and LLC caches and was remotely homed (dram);
MEM_UNCORE_RETIRED:0x20, local_dram, the number of memory load instructions retired where the memory reference missed the L1, L2 and LLC caches and required a local socket memory reference (dram);