Using the IBM Hardware Performance Monitor Toolkit

Monitoring Program Performance on the Research SP

Table of Contents

Overview
HPMCOUNT
LIBHPM
HPMVIZ


Overview (Return to TOC)

The HPM Toolkit was developed at IBM's T.J. Watson Research Center in order to measure application performance on the IBM SP-P3. It consists of three utilties: an external monitor called hpmcount, a threadsafe libhpm library, and a graphical user interface called hpmviz. Hpmcount is designed to shepherd an application, either serial or parallel, from beginning to end and provides overall statistics that include, not only wallclock time, but a series of hardware performance and utilization statistics. Libhpm is a directive based, instrumentation library which may be used to examine regions of either serial or parallel applications in FORTRAN, C or C++. Parallel applications may be MPI, threaded or mixed-mode. Hpmviz is a GUI designed to visually view the output file created by libhpm.


HPMCOUNT (Return to TOC)

Hpmcount may be used either serially or in the parallel operating environment (poe). Its usage is simple. Either:
 

prompt% hpmcount [-h] [-o <filename>] [-s <set>] [-e ev[,ev]*] <program>

or:

prompt% poe hpmcount [-h] [-o <filename>] [-s <set>] [-e ev[,ev]*] <program>

where:

"-h" prints a help page

"-o <filename> " directs output to a file name. This filename is appended with a process ID number so a parallel run of eight P.E.'s will generate eight unique output files.

"-s <set>" are predefined sets of events. They correspond to the following:

Event set 1
Event set 2
Event set 3
Event set 4
Cycles
Cycles
Cycles
Cycles
Inst. completed
Inst. completed
Inst. completed
TLB misses
TLB misses
TLB misses
I cache misses
Loads completed
Stores completed
Stores dispatched
FXU0 ops
Stores completed
Loads completed
L1 store misses
FXU1 ops
L2 load misses
FPU0 ops
Loads dispatched
FXU2 ops
L2 store misses
FPU1 ops
L1 load misses
FPU0 ops
Branches
FMAs executed
LSU idle
FPU1 ops
Misspredicted branches
 
"-e ev[,ev]*" list of event numbers, separated by commas. ev<i> corresponds to event selected for counter <i>.

"<program>" the program executable.


Note: On the Indiana University IBM SP, hpmcount exists in "/miscapps/HPMToolkit/bin." If this directory is not in your path, the hpmcount command will require an absolute path name to reach it.

An example output of our default events:
 

1: PM_CYC (Cycles) : 5212517
1: PM_INST_CMPL (Instructions completed) : 4244988
1: PM_TLB_MISS (TLB misses) : 9921
1: PM_ST_CMPL (Stores completed) : 1094537
1: PM_LD_CMPL (Loads completed) : 843816
1: PM_FPU0_CMPL (FPU 0 instructions) : 343
1: PM_FPU1_CMPL (FPU 1 instructions) : 84
1: PM_EXEC_FMA (FMAs executed) : 65
NOTE: In the above example, the "1:" is a result of setting the environment variable, MP_LABELIO , to "YES" in a poe (MPI) job. This is because each processor returns identical categories of statistics. Without this variable set, it is impossible to tell with P.E. is reporting which set of statistics.

Also present in the output file are a set of Resource Usage Statistics. The default set of the these statistics for a sample job are:

1: ######## Resource Usage Statistics ########
 

1: Total amount of time in user mode : 0.030000 seconds
1: Total amount of time in system mode : 0.040000 seconds
1: Maximum resident set size : 2664 Kbytes
1: Average shared memory use in text segment : 20 Kbytes*sec
1: Average unshared memory use in data segment : 10488 Kbytes*sec
1: Number of page faults without I/O activity : 939
1: Number of page faults with I/O activity : 1
1: Number of times process was swapped out : 0
1: Number of times file system performed INPUT : 0
1: Number of times file system performed OUTPUT : 0
1: Number of IPC messages sent : 0
1: Number of IPC messages received : 0
1: Number of signals delivered : 36
1: Number of voluntary context switches : 394
1: Number of involuntary context switches : 11


And a summary:
 

1: Utilization rate : 0.196 %
1: Avg number of loads per TLB miss : 85.054
1: Load and store operations : 1.938 M
1: Instructions per load/store : 2.190
1: MIPS : 0.600
1: Instructions per cycle : 0.814
1: HW Float points instructions per Cycle : 0.000
1: Floating point instructions + FMAs : 0.000 M
1: Float point instructions + FMA rate : 0.000 Mflip/s
1: FMA percentage : 26.423 %
1: Computation intensity : 0.000


1: Total execution time (wall clock time): 7.077179 seconds
 


LIBHPM (Return to TOC)

The libhpm library provides for instrumenting requested sections of FORTRAN, C and C++ program applications. In the words of the developers:

Libhpm supports multiple instrumentation points, nested instrumentation, and each instrumented point can be called multiple times. When nested instrumentation is used, exclusive duration is generated for the outer points. Average and standard deviation is provided when an instrumented point is activated multiple times.

This library set supports both POSIX threads and OpenMP as well as both 32 bit and 64 bit programs. Care should be taken to link the proper libraries with the proper applications. This will be discussed further below. One important consideration is that statistical information is collected during the run. Thus, excessive numbers of instrumented regions or repetitive calls (such as an instrumented section inside an inner loop with a high count number) will cause a very noticeable degradation of the application's performance.

Libhpm uses the same counters and event sets that were defined in hpmcount. These event sets are read from either the environment variable, LIBHPM_EVENT_SET, or a user created file named: libHPMevents. If the file option has precedence over the environment variable. Unlike the environment variable, which chooses between the four predefined event sets (default is the first one), if the file used, each of the eight counters must be manually designated, and used only once. The format for this file is:

Counter number (numbered 0 thru 7) ; Event number; Event Mnemonic (Ex.: PM_CYC#) ; Description (Ex.: Cycles#)

A libHPM events example file (note counter numbering is not required to be in order) is included below. This is the definition for the default event set (1):
 

3 1 PM_CYC# Cycles#
4 5 PM_FPU0_CMPL# FPU 0 instructions#
1 35 PM_FPU1_CMPL# FPU 1 instructions#
0 5 PM_IC_MISS# I cache misses#
2 5 PM_LD_MISS_L1# Load misses in L1#
7 0 PM_TLB_MISS# TLB misses#
5 5 PM_CBR_DISP# Branches#
6 3 PM_MPRED_BR# Misspredicted branches#


It is assumed that if the file is used, the user has considerable experience with hardware counters. Otherwise, it would be best to use the pre-defined set.

Function Calls

There are three sets of function calls. Each opening call must have a corresponding closing call. There are two kinds of calls.

For FORTRAN programs:

f_hpminit( taskID, progName )
f_hpmterminate( taskID )

f_hpmstart( instID, label )
f_hpmstop( instID )

f_hpmtstart( instID, label )
f_hpmtstop( instID )


And for C/C++:

hpmInit( taskID, progName )
hpmTerminate( taskID )

hpmStart( instID, label )
hpmStop( instID )

hpmTstart( instID, label )
hpmTstop( instID )


Examples

FORTRAN Coding:
 

Instrumentation should begin with declaring:
#include "f_hpm.h"

The first call to libhpm is to initialize it:
call f_hpminit( taskID, progName )

Each section under investigation would then be bracketed by a unique start and stop:
call f_hpmstart( instID, label )
do work
call f_hpmstop( instID )

Finally, a call to terminate the libhpm monitor is required:
call f_hpmterminate( taskID )

NOTES: "progName" and "label" are both character strings while "instID" and "taskID" are integers.
The include file "f_hpm.h" requires the C preprocessor. Either ending the program ".F" or adding
"-qsuffix=cpp=f" to the command line will enable this function.
 

C and C++ Coding:
 
Instrumentation should begin with declaring:
(for C) #include libhpm.h
(for C++) #include libhpm.H

The first call to libhpm is to initialize it:
hpmInit( taskID, progName );

Each section under investigation would then be bracketed by a unique start and stop:
hpmStart(1, "MPI section");
{ Work is done }
hpmStop( 1);

Finally, a call to terminate the libhpm monitor is required:
hpmTerminate( taskID );

NOTES: Indivudual hpmStart's and hpmStop's do not have to be nested with each other, but they must be nested inside hpmInit and hpmTerminate.
Instrumentation will affect runtimes. C++ programs are instrumented the same as C programs except as noted above.
 


Compilation and output:

A sample FORTRAN compile line (for a threaded program):
xlf_r -qfree -I /miscapps/HPMToolkit/include hpmlib.F -L /miscapps/HPMToolkit/lib -l hpm_r -lpmapi -lm -o hpmtest

To see the insrtumented source code click-> arraytest.F

A sample C compile line (for an MPI program):
mpcc -o jacob jacob.c -I /miscapps/HPMToolkit/include -lm -L /miscapps/HPMToolkit/lib -l hpm -lpmapi

To see the insrtumented source code click-> jacob.c

Running the program:
The program is run normally and the output files are generated automatically.

Output file listing:
-rw-r--r-- 1 rsheppar uits 4055 May 13 12:26 hpm0000_libhpm_test _30254.viz
-rw-r--r-- 1 rsheppar uits 5322 May 13 12:26 perfhpm0000.30254

Output file contents (excerpts):

    Total summary:

    libhpm (Version 2.3.1) summary - running on POWER3-II

    Total execution time of instrumented code (wall time): 253.422638 seconds
    ######## Resource Usage Statistics ########

    Total amount of time in user mode : 243.110000 seconds
    Total amount of time in system mode : 1.060000 seconds
    Maximum resident set size : 234972 Kbytes
    Average shared memory use in text segment : 1561432 Kbytes*sec
    Average unshared memory use in data segment : 2147483647 Kbytes*sec
    Number of page faults without I/O activity : 58762
    Number of page faults with I/O activity : 14
    Number of times process was swapped out : 0
    Number of times file system performed INPUT : 0
    Number of times file system performed OUTPUT : 0
    Number of IPC messages sent : 0
    Number of IPC messages received : 0
    Number of signals delivered : 0
    Number of voluntary context switches : 12
    Number of involuntary context switches : 24521

    ####### End of Resource Statistics ########

    One section given at each label:

    Instrumented section: 1 - Label: Initialize - process: 0
    file: hpmlib.F, lines: 42 <--> 59
    Count: 1
    Wall Clock Time: 9.301887 seconds
    Total time in user mode: 0.000384374248119012 seconds

    Instrumented section: 2 - Label: load_arrays - process: 0
    file: hpmlib.F, lines: 64 <--> 88
    Count: 1
    Wall Clock Time: 2.753696 seconds
    Total time in user mode: 2.25279468087226 seconds
     

    Instrumented section: 3 - Label: Do_work - process: 0
    file: hpmlib.F, lines: 93 <--> 114
    Count: 1
    Wall Clock Time: 241.366528 seconds
    Total time in user mode: 240.371758281325 seconds

    PM_CYC (Cycles) : 90136692995
    PM_INST_CMPL (Instructions completed) : 84014371208
    PM_TLB_MISS (TLB misses) : 210418
    PM_ST_CMPL (Stores completed) : 5384970219
    PM_LD_CMPL (Loads completed) : 30689881204
    PM_FPU0_CMPL (FPU 0 instructions) : 28586822295
    PM_FPU1_CMPL (FPU 1 instructions) : 6153407802
    PM_EXEC_FMA (FMAs executed) : 14623630290

    Utilization rate : 99.588 %
    Avg number of loads per TLB miss : 145851.977
    Load and store operations : 36074.851 M
    Instructions per load/store : 2.329
    MIPS : 348.078
    Instructions per cycle : 0.932
    HW Float points instructions per Cycle : 0.385
    Floating point instructions + FMAs : 49363.860 M
    Float point instructions + FMA rate : 204.518 Mflip/s
    FMA percentage : 59.248 %
    Computation intensity : 1.368


HPMVIZ (Return to TOC)

Hpmviz is the graphical user interface developed to view the performance files generated by the libhpm library.

This GUI is called with command "hpmviz file_name.viz" The file, "file_name.viz" is created at run time. Each instrumented section is reported on the left. A left click on that section will highlight the source code listing on the right. The GUI may be exited through the "QUIT" button in the "FILE" menu.

Example hpmviz screen:

A right click will bring up the instrumentation screen. This screen is exited with a "CTRL W" or clicking the "CLOSE" button.

Example instrumentation screen: