Use sed:
sed 's/.$//'
Technical Stuff
Thursday, March 24, 2011
Tuesday, March 15, 2011
Where to find multi-core processor core mapping information
My OS is Fedora 13.
/sys/devices/system/cpu/cpu0/topology/core_siblings_list shows which CPUID are siblings;
/sys/devices/system/cpu/cpu0/topology/thread_siblings_list shows which CPUID are multithreading/hyperthreading siblings (i.e. virtual processors that share the same physical core).
/sys/devices/system/cpu/cpu0/cache/index/* contains all the cache information. Take my Intel Nehalem Xeon E5520 as an example:
/sys/devices/system/cpu/cpu0/cache/index0/level shows this is a L1 cache;
/sys/devices/system/cpu/cpu0/cache/index0/type shows this is a data cache;
/sys/devices/system/cpu/cpu0/cache/index0/size shows the cache size is 32KB.
Similarly,
/sys/devices/system/cpu/cpu0/cache/index1/ describes the L1 Icache;
/sys/devices/system/cpu/cpu0/cache/index2/ depicts the L2 unified cache;
/sys/devices/system/cpu/cpu0/cache/index3/ is for the L3 unified shared cache.
/sys/devices/system/cpu/cpu0/topology/core_siblings_list shows which CPUID are siblings;
/sys/devices/system/cpu/cpu0/topology/thread_siblings_list shows which CPUID are multithreading/hyperthreading siblings (i.e. virtual processors that share the same physical core).
/sys/devices/system/cpu/cpu0/cache/index/* contains all the cache information. Take my Intel Nehalem Xeon E5520 as an example:
/sys/devices/system/cpu/cpu0/cache/index0/level shows this is a L1 cache;
/sys/devices/system/cpu/cpu0/cache/index0/type shows this is a data cache;
/sys/devices/system/cpu/cpu0/cache/index0/size shows the cache size is 32KB.
Similarly,
/sys/devices/system/cpu/cpu0/cache/index1/ describes the L1 Icache;
/sys/devices/system/cpu/cpu0/cache/index2/ depicts the L2 unified cache;
/sys/devices/system/cpu/cpu0/cache/index3/ is for the L3 unified shared cache.
Wednesday, February 2, 2011
GNU compiler "-ffloat-store" option
-ffloat-store
- Do not store floating point variables in registers, and inhibit other options that might change whether a floating point value is taken from a register or memory.
This option prevents undesirable excess precision on machines such as the 68000 where the floating registers (of the 68881) keep more precision than a
double
is supposed to have. Similarly for the x86 architecture. For most programs, the excess precision does only good, but a few programs rely on the precise definition of IEEE floating point. Use-ffloat-store
for such programs, after modifying them to store all pertinent intermediate computations into variables.
Thursday, January 6, 2011
Oprofile performance counter events for Intel Nehalem processor
Some common performance counter events:
(Updates) LLC_MISSES is not well-documented by Intel. It seems to include L2 cache misses. Instead, one can use MEM_LOAD_RETIRED:0x10 to collect the number of retired loads that miss the last level cache. My measurement showed that LLC_MISSES can be ten times larger than MEM_LOAD_RETIRED:0x10.
Other useful metrics:
MEM_INST_RETIRED:0x01, the number of instructions with an architecturally-visible load retired on the architected path;
MEM_LOAD_RETIRED:0x04, llc_unshared_hit, the number of retired loads that hit their own, unshared lines in the LLC cache;
MEM_LOAD_RETIRED:0x08, other_core_l2_hit_hitm, the number of retired loads that hit in a sibling core's L2 (on die core);
MEM_LOAD_RETIRED:0x80, dtlb_miss, the number of retired loads that missed the DTLB;
MEM_UNCORE_RETIRED:0x08, remote_cache_local_home_hit, the number of memory load instructions retired where the memory reference missed the L1, L2 and LLC caches and HIT in a remote socket's cache;
MEM_UNCORE_RETIRED:0x10, remote_dram, the number of memory load instructions retired where the memory reference missed the L1, L2 and LLC caches and was remotely homed (dram);
MEM_UNCORE_RETIRED:0x20, local_dram, the number of memory load instructions retired where the memory reference missed the L1, L2 and LLC caches and required a local socket memory reference (dram);
Name | Description | Counters usable | Unit mask options |
CPU_CLK_UNHALTED | Clock cycles when not halted | all | |
UNHALTED_REFERENCE_CYCLES | Unhalted reference cycles | 0, 1, 2 | 0x01: No unit mask |
LLC_MISSES | Last level cache demand requests from this core that missed the LLC | all | 0x41: No unit mask |
LLC_REFS | Last level cache demand requests from this core | all | 0x4f: No unit mask |
(Updates) LLC_MISSES is not well-documented by Intel. It seems to include L2 cache misses. Instead, one can use MEM_LOAD_RETIRED:0x10 to collect the number of retired loads that miss the last level cache. My measurement showed that LLC_MISSES can be ten times larger than MEM_LOAD_RETIRED:0x10.
Other useful metrics:
MEM_INST_RETIRED:0x01, the number of instructions with an architecturally-visible load retired on the architected path;
MEM_LOAD_RETIRED:0x04, llc_unshared_hit, the number of retired loads that hit their own, unshared lines in the LLC cache;
MEM_LOAD_RETIRED:0x08, other_core_l2_hit_hitm, the number of retired loads that hit in a sibling core's L2 (on die core);
MEM_LOAD_RETIRED:0x80, dtlb_miss, the number of retired loads that missed the DTLB;
MEM_UNCORE_RETIRED:0x08, remote_cache_local_home_hit, the number of memory load instructions retired where the memory reference missed the L1, L2 and LLC caches and HIT in a remote socket's cache;
MEM_UNCORE_RETIRED:0x10, remote_dram, the number of memory load instructions retired where the memory reference missed the L1, L2 and LLC caches and was remotely homed (dram);
MEM_UNCORE_RETIRED:0x20, local_dram, the number of memory load instructions retired where the memory reference missed the L1, L2 and LLC caches and required a local socket memory reference (dram);
Wednesday, December 29, 2010
Instrument an MPI program with the Pin tool
Usually an MPI program is started by "mpirun". If mpirun is passed to Pin as the target, Pin will not be able to catch the behavior of each MPI process, and even worse, it can break the code. The solution is to let Pin call mpirun, for example:
mpirun -np 32 pin -t /usr/pin-2.8/source/tools/Memory/obj-intel64/fp.so -o ./bin/foo.out -- ./bin/foo
mpirun -np 32 pin -t /usr/pin-2.8/source/tools/Memory/obj-intel64/fp.so -o ./bin/foo.out -- ./bin/foo
Tuesday, December 28, 2010
Ruby file operations
FileUtils is a Ruby module that provides basic file operations, such as copy, remove, rename, etc. But I met the following error that I don't have a solution. Clearly the file was there, but the FileUtils.copy didn't work.
/usr/lib/ruby/1.8/FileUtils.rb:1200:in `stat': No such file or directory
/usr/lib/ruby/1.8/FileUtils.rb:1200:in `stat': No such file or directory
Monday, December 20, 2010
64-bit compilation option
Error I met when compiling ft.C.1 in NPB-3.3:
mpif77 -O3 -m64 -o ../bin/ft.C.1 ft.o ../common/randi8.o ../common/print_results.o ../common/timers.o
ft.o: In function `transpose2_finish_':
ft.f:(.text+0x614): relocation truncated to fit: R_X86_64_PC32 against symbol `procgrid_' defined in COMMON section in ft.o
...
The reason is the data gets too large to fit in 2GB.
The related GNU compiler option is -mcmodel, and possible values are:
mpif77 -O3 -m64 -o ../bin/ft.C.1 ft.o ../common/randi8.o ../common/print_results.o ../common/timers.o
ft.o: In function `transpose2_finish_':
ft.f:(.text+0x614): relocation truncated to fit: R_X86_64_PC32 against symbol `procgrid_' defined in COMMON section in ft.o
...
The reason is the data gets too large to fit in 2GB.
The related GNU compiler option is -mcmodel, and possible values are:
- small. Tells the compiler to restrict code and data to the first 2GB of address space. All accesses of code and data can be done with Instruction Pointer (IP)-relative addressing.
- medium. Tells the compiler to restrict code to the first 2GB; it places no memory restriction on data. Accesses of code can be done with IP-relative addressing, but accesses of data must be done with absolute addressing.
- large. Places no memory restriction on code or data. All accesses of code and data must be done with absolute addressing.
Subscribe to:
Posts (Atom)