Tuesday, March 15, 2011

Where to find multi-core processor core mapping information

My OS is Fedora 13.

/sys/devices/system/cpu/cpu0/topology/core_siblings_list shows which CPUID are siblings;
/sys/devices/system/cpu/cpu0/topology/thread_siblings_list shows which CPUID are multithreading/hyperthreading siblings (i.e. virtual processors that share the same physical core).

/sys/devices/system/cpu/cpu0/cache/index/* contains all the cache information. Take my Intel Nehalem Xeon E5520 as an example:
/sys/devices/system/cpu/cpu0/cache/index0/level shows this is a L1 cache;
/sys/devices/system/cpu/cpu0/cache/index0/type shows this is a data cache;
/sys/devices/system/cpu/cpu0/cache/index0/size shows the cache size is 32KB.

Similarly,
/sys/devices/system/cpu/cpu0/cache/index1/ describes the L1 Icache;
/sys/devices/system/cpu/cpu0/cache/index2/ depicts the L2 unified cache;
/sys/devices/system/cpu/cpu0/cache/index3/ is for the L3 unified shared cache.

Wednesday, February 2, 2011

GNU compiler "-ffloat-store" option

-ffloat-store
Do not store floating point variables in registers, and inhibit other options that might change whether a floating point value is taken from a register or memory.

This option prevents undesirable excess precision on machines such as the 68000 where the floating registers (of the 68881) keep more precision than a double is supposed to have. Similarly for the x86 architecture. For most programs, the excess precision does only good, but a few programs rely on the precise definition of IEEE floating point. Use -ffloat-store for such programs, after modifying them to store all pertinent intermediate computations into variables.

Thursday, January 6, 2011

Oprofile performance counter events for Intel Nehalem processor

Some common performance counter events:

NameDescriptionCounters usableUnit mask options
CPU_CLK_UNHALTED Clock cycles when not halted all
UNHALTED_REFERENCE_CYCLES Unhalted reference cycles 0, 1, 2 0x01: No unit mask
LLC_MISSES Last level cache demand requests from this core that missed the LLC all 0x41: No unit mask
LLC_REFS Last level cache demand requests from this core all 0x4f: No unit mask


(Updates) LLC_MISSES is not well-documented by Intel. It seems to include L2 cache misses. Instead, one can use MEM_LOAD_RETIRED:0x10 to collect the number of retired loads that miss the last level cache. My measurement showed that LLC_MISSES can be ten times larger than MEM_LOAD_RETIRED:0x10.

Other useful metrics:
MEM_INST_RETIRED:0x01, the number of instructions with an architecturally-visible load retired on the architected path;

MEM_LOAD_RETIRED:0x04, llc_unshared_hit, the number of retired loads that hit their own, unshared lines in the LLC cache;
MEM_LOAD_RETIRED:0x08, other_core_l2_hit_hitm, the number of retired loads that hit in a sibling core's L2 (on die core);
MEM_LOAD_RETIRED:0x80, dtlb_miss, the number of retired loads that missed the DTLB;

MEM_UNCORE_RETIRED:0x08, remote_cache_local_home_hit, the number of memory load instructions retired where the memory reference missed the L1, L2 and LLC caches and HIT in a remote socket's cache;
MEM_UNCORE_RETIRED:0x10, remote_dram, the number of memory load instructions retired where the memory reference missed the L1, L2 and LLC caches and was remotely homed (dram);
MEM_UNCORE_RETIRED:0x20, local_dram, the number of memory load instructions retired where the memory reference missed the L1, L2 and LLC caches and required a local socket memory reference (dram);

Wednesday, December 29, 2010

Instrument an MPI program with the Pin tool

Usually an MPI program is started by "mpirun". If mpirun is passed to Pin as the target, Pin will not be able to catch the behavior of each MPI process, and even worse, it can break the code. The solution is to let Pin call mpirun, for example:

mpirun -np 32 pin -t /usr/pin-2.8/source/tools/Memory/obj-intel64/fp.so -o ./bin/foo.out -- ./bin/foo

Tuesday, December 28, 2010

Ruby file operations

FileUtils is a Ruby module that provides basic file operations, such as copy, remove, rename, etc. But I met the following error that I don't have a solution. Clearly the file was there, but the FileUtils.copy didn't work.

/usr/lib/ruby/1.8/FileUtils.rb:1200:in `stat': No such file or directory

Monday, December 20, 2010

64-bit compilation option

Error I met when compiling ft.C.1 in NPB-3.3:

mpif77 -O3 -m64 -o ../bin/ft.C.1 ft.o ../common/randi8.o ../common/print_results.o ../common/timers.o
ft.o: In function `transpose2_finish_':
ft.f:(.text+0x614): relocation truncated to fit: R_X86_64_PC32 against symbol `procgrid_' defined in COMMON section in ft.o
...

The reason is the data gets too large to fit in 2GB.


The related GNU compiler option is -mcmodel, and possible values are:
  • small. Tells the compiler to restrict code and data to the first 2GB of address space. All accesses of code and data can be done with Instruction Pointer (IP)-relative addressing.
  • medium. Tells the compiler to restrict code to the first 2GB; it places no memory restriction on data. Accesses of code can be done with IP-relative addressing, but accesses of data must be done with absolute addressing.
  • large. Places no memory restriction on code or data. All accesses of code and data must be done with absolute addressing.