perfmon2
   the hardware-based performance monitoring interface for Linux
opensource.hp.com Link to Linux and HP web site  
Pfmon documentation

Pfmon is a performance monitoring tool for Linux. Pfmon can be used to count or sample on unmodified programs or across an entire system. Pfmon supports the performance counters of all Itanium processors, Intel Pentium M, P6, Pentium 4/Xeon, Intel Core Duo/Solo, Intel Core 2 Duo, AMD X86-64, various MIPS and Power processors.

This document is an attempt at providing some documentation on how to use pfmon. The content covers pfmon version 3.9.

Table of contents
  1. Introduction
  2. Command line options (UPDATED)
  3. Getting event information
  4. Basic counting
    1. Simple examples
    2. Privilege levels
    3. Counter formats
    4. Saving counts
    5. Hiding program output
    6. Getting timing information
    7. Testing event combinations
    8. Using event sets
    9. Using event unit masks
  5. Monitoring running tasks(UPDATED)
  6. System wide sessions
    1. Selecting CPUs to monitor
    2. Delimiting a session
    3. Results aggregation
    4. Delayed start
    5. Saving results in files
    6. Printing intermediate counts
  7. Monitoring multiple processes/threads
    1. Basic monitoring
    2. Saving results in files
    3. Separating execution between fork()/vfork() and exec()
    4. Selectively monitoring processes
    5. Aggregating results from multiple processes/threads
  8. Triggering monitoring at specific locations
    1. Basic code triggers
    2. Data triggers
    3. Repeating triggers
    4. Inheritance of triggers
    5. Dynamic code stop triggers
  9. Dealing with symbols
    1. Symbol table sources
    2. Symbol lookups
    3. Symbols and the --follow-* options
    4. Symbols resolution and sampling
    5. Symbols resolution examples
  10. Sampling with pfmon
    1. Principles
    2. Sampling periods
    3. Sampling modules
    4. Basic sampling examples
    5. Sampling and symbol resolution example
    6. Controlling sampling output
    7. Writing samples to a file
    8. Blocking on overflow notifications
    9. Sampling and the --follow-* options
    10. Sampling in system wide sessions
    11. Randomization of sampling periods
    12. Miscellaneous sampling options
  11. Getting support

For model-specific options, refer to the following manuals:

1. Introduction

Pfmon can be used to monitor unmodified binaries in its per-thread mode and it can also be used to run system wide monitoring sessions. Such sessions are active across all threads executing on a given CPU. Pfmon can launch a system wide session on a dedicated CPU or set of CPUs in parallel.

Pfmon can monitor activities happening at the user and/or kernel level for both type of sessions.

Pfmon can be used to collect basic event counts. It can also be used to sample program or system execution.

In per-thread mode, pfmon can monitor the first (master) thread of a process but it can also follow fork, vfork and pthread-create by dynamically creating additional monitoring sessions.

Pfmon is based on a generic helper library called libpfm which is not included in this package but can be downloaded and installed separately.

In the remainder of this document, we describe the key options and features of pfmon which are available on all CPU (PMU) models. Please refer to the model specific documentation for advanced features.

This documentation is for pfmon-3.9 ONLY.

2. Pfmon command line options

The set of command line options provided by pfmon depends on the host PMU. It is possible to compile pfmon for more than one PMU model and then it will auto-detect the host PMU and provide the corresponding set of options.

The options common to all PMU models are as follows:

-h, --helpdisplay the list of supported options
-V, --versiondisplay pfmon version information
-l[regex], --show-event-list[=regex]show list of supported events by host PMU
-L, --long-show-events[=regex]display matching events names and unit masks
-i event, --event-info=eventget information about a particular event
-u, -3, --user-levelmonitor at the user level for all events
-k, -0, --kernel-levelmonitor at the kernel level for all events
-2monitor at privilege level 2
-1monitor at privilege level 1
-e, --events=ev1,ev2,...select events to monitor in one set
-I,--infoshow pfmon information and detect host CPU
--verboseprint more information during execution
--debugenable debug prints
--outfile=filenamesave counts in a file
--appendwhen used with --outfile, will open the file in append mode
--overflow-blockblock monitored program on overflow notifications
--system-wide create a system wide monitoring session
--cpu-list=x,y-z,..list the cpu (numbers) to include in the system-wide monitoring session
-S format, --smpl-module-info=formatdisplay information about a sampling module
-t secs, --session-timeout=secs duration of the session in seconds
--smpl-outfile=filenamesave samples in a file
--long-smpl-periods=val1,val2,...set long sampling periods for each event
--short-smpl-periods=val1,val2,... set short sampling period for each event
--with-headergenerate a machine description header with counts and samples
--aggregate-resultsaggregate counts and sampling buffer for system-wide sessions
--priv-levels=lvl1,lvl2,... set privilege level per event
--show-time show real,user, and system time for the executed command
--us-counter-formatprint counters using commas (1,024)
--eu-counter-formatprint counters using points (1.024)
--hex-counter-formatprint counters in hexadecimal (0x400)
--smpl-module=module_namespecify the sampling module to use (default: autodetect)
--long-show-events[=regex]display single-line detailed information about events
--symbol-file=filenameuse the ELF archive filename to look for symbols
--sysmap-file=filenameuse the System.map filename to look for symbols
--check-events-onlycheck that the event combination is valid and exit (no measurement)
--smpl-periods-random=mask1:seed1,...randomize both the short and long sampling period
--resolve-addressesresolve addresses to symbols in sampling output
--smpl-print-countsprint counter results when sampling (default: off)
--reset-non-smpl-periodsreset counters not use as sampling period for each sample
--saturate-smpl-buffercollect samples only up to the limit of the buffer (default: off)
--no-cmd-outredirect executed command output to /dev/nul
--attach-task=pidattach monitoring to the thread specified by pid
--trigger-code-start-address=addr,..start monitoring when execution reaches addr
--trigger-code-stop-address=addr,...stop monitoring when execution reaches addr
--trigger-data-start-address=addr,..start monitoring when data location addr is accessed
--trigger-data-stop-address=addr,...stop monitoring when data location addr is accessed
--trigger-code-repeatrepeat code trigger start/stop (default: off)
--trigger-data-repeatrepeat data trigger start/stop (default: off)
--trigger-data-rodata triggers activated only on read only access (default: read-write)
--trigger-data-wodata triggers activated only on write only access (default: read-write)
--trigger-code-followcode trigger start/stop continues across fork,vfork,pthread_create (default: off)
--trigger-data-followdata trigger start/stop continues across fork,vfork,pthread_create (default: off)
--follow-forkmonitoring continues across fork, i.e., in the new process (default: off)
--follow-vforkmonitoring continues across vfork, i.e., in the new process (default: off)
--follow-pthreadmonitoring continues across pthread_create, i.e., in the new thread (default: off)
--follow-allunion of --follow-fork, --follow-vfork, --follow-pthread, --follow-exec (default: off)
--follow-exec[=regex]monitoring will continue across exec, when regular expression is provided, will follow
--follow-exec-exclude[=regex]similar to --follow-exec, except that a match will stop the monitoring (default: off)
--exec-split-resultsdifferentiate monitoring sessions before and after a call to exec() (default: off)
--pin-commandpin executed command on the same processors as specified in --cpu-list (system-wide only)
--dont-startdo not start monitoring (per-thread only)
--switch-timeout=millisecondsnumber of milliseconds before switching to the next event set
--cpu-set-relativeCPU identification in --cpu-list relatie to cpu_set affinity (default: off)
--extra-smpl-pmdsspecify list of extra PMDs to include in samples
--print-interval=millisecondsprint intermediate deltas of counts in system-wide
--smpl-show-top=nprint only top n lines of the profile
--smpl-per-functioncollapse samples per function (default: per-addresses)
--smpl-cum-thres=pprint samples up to p percent of cumulative samples (default: 100)
--smpl-eager-savesave sampling output when session ends(default: when all sessions end)
--no-una-detectdo not try to detect unavailable registers(default: off)
--smpl-compactgenerate row/column text sampling output (default: off) - NEW in 3.9
--smpl-rawgenerate raw(binary) sampling output (default: off) - NEW in 3.9

For historical reasons, many generic examples were run on Itanium 2. They work similarly on other processor architectures.


3. Getting event information

The list of events supported by pfmon depends on the host PMU. You can get the list of supported events using the following pfmon option:

   % pfmon -l
   CPU_CYCLES
   IA64_INST_RETIRED
   IA64_TAGGED_INST_RETIRED_PMC8
   IA64_TAGGED_INST_RETIRED_PMC9
   INST_DISPERSED
   EXPL_STOPBITS
   ALL_STOPS_DISPERSED
   IA32_INST_RETIRED
   ISA_TRANSITIONS
   NOPS_RETIRED
   ....

Using the -L option, it is possible to list all events and unit masks (sub-event). On each line, pfmon prints an event name, colon, one of its unit masks, if any. Note that sometimes, unit masks can be combined, but this is not shown by this option. Here is an example on Intel Core (click here for more on unit masks):

  $ pfmon -L
   ....
   L1D_CACHE_LD:MESI
   L1D_CACHE_LD:I_STATE
   L1D_CACHE_LD:S_STATE
   L1D_CACHE_LD:E_STATE
   L1D_CACHE_LD:M_STATE
   L1D_CACHE_ST:MESI
   ....

If you specify an argument to the -l option (no space between l and the argument), it is interpreted as a regular expression and all matching events will be listed:

   % pfmon -ll1d
   L1D_READ_FORCED_MISSES_RETIRED
   L1D_READ_MISSES_RETIRED
   L1D_READS_RETIRED
   PIPELINE_FLUSH_L1D_WAYMP_FLUSH

You can get more detailed information about each event using the following option (shown for Itanium2):

   % pfmon -i nops_retired
   Name   : NOPS_RETIRED
   VCode  : 0x50
   Code   : 0x50
   PMD/PMC: [ 4 5 6 7 ]
   Umask  : 0000
   EAR    : No (N/A)
   BTB    : No
   MaxIncr: 6  (Threshold [0-5])
   Qual   : [Instruction Address Range] [OpCode Match] 
   Group  : None
   Set    : None
   Desc   : Retired NOP Instructions

Pfmon is case insensitive for event names. Here you see some details about the event. The first 4 lines are generic and provided on all PMU models even though the codes may vary:

  • Code is the event code used by the PMU.
  • Vcode is a libpfm internal event code which encapsulates the event code and other information describing the type of the event. For simple events, the two codes are usually identical.
  • PMD/PMC: list the counting monitors on which this event can be programmed. Not all events can necessarily be programmed on all available counting monitors. This constraint is taken care of by the libpfm library.

The remaining information is specific to the Itanium 2 PMU.

Even with the -i option, you can use a regular expression for the event:

   % pfmon -i'writes$'
   Name   : L2_DATA_REFERENCES_L2_DATA_WRITES
   VCode  : 0x200b2
   Code   : 0xb2
   PMD/PMC: [ 4 5 6 7 ]
   Umask  : 0010
   EAR    : No (N/A)
   BTB    : No
   MaxIncr: 4  (Threshold [0-3])
   Qual   : [Instruction Address Range] [OpCode Match] [Data Address Range] 
   Group  : L2 Cache
   Set    : 1
   Desc   : Data Read/Write Access to L2 -- count only data write and semaphore operations

On some PMU models, the events information contains a text description of the event. Events can be specified using their code:

   % pfmon -i 0x45
   Name   : L2_INST_PREFETCHES
   VCode  : 0x45
   Code   : 0x45
   PMD/PMC: [ 4 5 6 7 ]
   Umask  : 0000
   EAR    : No (N/A)
   BTB    : No
   MaxIncr: 1  (Threshold 0)
   Qual   : [Instruction Address Range] 
   Group  : None
   Set    : None
   Desc   : L2 Instruction Prefetch Requests

For some PMU models, the unit masks are clearly spelled out and can be specified and conbined. For example, on an Intel Core 2 Duo processor, the following event:

% pfmon -iPAGE_WALKS
Name     : PAGE_WALKS
Code     : 0xc
Counters : [ 0 1 ]
Desc     : Number of page-walks executed
Umask-00 : 0x01 : [COUNT] : Number of page-walks executed
Umask-01 : 0x02 : [CYCLES] : Duration of page-walks in core cycles
PEBS     : No

has two unit masks. See here for more information on how to specify unit masks on the command line

Information about what each event measures can be found in the relevant CPU model specific micro-architecture documentation.


4. Basic counting with pfmon

No instrumentation of the program is required to monitor the system or a single thread.

4.1 Simple examples

To collect counts on a specific command, you just need to launch it via pfmon, just like you would do with the time or strace command:

   % pfmon ls /dev/null
   /dev/null
       620458 CPU_CYCLES

When invoked with no particular event, pfmon uses a default event which depends on the PMU model. It is typically measuring elapsed cycles.

To monitor specific events, you can type:

   % pfmon -e cpu_cycles,IA64_inst_Retired ls /dev/null
   /dev/null
                    436368 CPU_CYCLES       
                    513437 IA64_INST_RETIRED

As you can see, pfmon is not case sensitive with regards to event names. More than one event can be measured at a time using a comma separated list of events. You MUST not have space after the comma.

If the command you want to run takes options, you can clearly distinguish the options of pfmon from the options of your command using the '--' separator:

   % pfmon -e ia64_inst_retired -- ls -ial /dev/null
   3679352 crw-rw-rw-    1 root     root       1,   3 Dec  6  2002 /dev/null
                    557392 IA64_INST_RETIRED

Otherwise, pfmon will interpret your command's options as pfmon options.

4.2 Privilege levels

By default, pfmon monitors only what is going at the user level (application level). This is true for both per-thread and system wide mode.

The number of privilege levels depends on the processor architecture. Most processors support at least 2 levels (user and kernel). It is also possible to monitor at several levels at the same time by specifying more than one level. On processors with only 2 privilege levels, options -2 and -1 are ignored are not supported.

The levels can be specified for all events or on a per-event basis.

To affect all events, you can use any combinations of -k (-0), -1, -2, -u (or -3). To set the level for each event, the --priv-levels option must be used.

By default, pfmon only measures at the user level:

   % pfmon -e nops_retired ls /dev/null

counts the number of NOPS_RETIRED when ls is running at the user level only (equivalent to specifying -u or -3).

   % pfmon -k -e nops_retired ls /dev/null

counts the number of NOPS_RETIRED when ls is running at the kernel level only.

   % pfmon -k -u -e nops_retired ls /dev/null

counts the number of NOPS_RETIRED when ls is running at the kernel level or user level, i.e. all the time.

It is possible to refine the settings on a per event basis using the --priv-levels option.

   % pfmon -e loads_retired,nops_retired ls /dev/null

Both events are measured at the user level only.

   % pfmon --priv-level=u,k -e loads_retired,nops_retired ls /dev/null

LOADS_RETIRED is measured at the user level only, NOPS_RETIRED at the kernel level only.

   % pfmon --priv-level=,uk -e loads_retired,nops_retired ls /dev/null

LOADS_RETIRED is measured at the user level only, NOPS_RETIRED at the user and kernel levels.

   % pfmon -k --priv-level=uk -e loads_retired,nops_retired ls /dev/null

LOADS_RETIRED is measured at the user and kernel levels, NOPS_RETIRED at the kernel level only.

4.3 Counter formats

Pfmon can display the final counts in various formats. There are 4 formats defined. The default one is shown in the example above. To make is easier to read large numbers or to feed the number to other programs.

By default, you get the raw number:

   % pfmon ls -l /dev/null
   crw-rw-rw-    1 root     root       1,   3 Dec  6  2002 /dev/null
                   724882 CPU_CYCLES

The --us-counter-format option has the thousands, millions, billions are separated with commas (US and UK style):

   % pfmon --us-counter-format ls -l /dev/null
   crw-rw-rw-    1 root     root       1,   3 Dec  6  2002 /dev/null
                   724,882 CPU_CYCLES

The --eu-counter-format option has the thousands, millions, billions are separated with points (European style):

   % pfmon --eu-counter-format ls -l /dev/null
   crw-rw-rw-    1 root     root       1,   3 Dec  6  2002 /dev/null
                   724.822 CPU_CYCLES

The --hex-counter-format options shows counts in hexadecimal format:

   % pfmon --hex-counter-format ls -l /dev/null
   crw-rw-rw-    1 root     root       1,   3 Dec  6  2002 /dev/null
                    0x00000000000b0f56 CPU_CYCLES
4.4 Saving counts

By default, the counts are printed on the controlling tty. However it is possible to save them in a file using the --outfile option:

   % pfmon --outfile=b --hex-counter-format ls -l /dev/null
   crw-rw-rw-    1 root     root       1,   3 Dec  6  2002 /dev/null
   % cat b
        0x00000000000b7cad CPU_CYCLES

It is possible to include a header with the results using the --with-header option. It will be printed on the controlling tty or saved in the output file. The header contains detailed information about the configuration of the host machine and on the monitoring session:

   % pfmon --with-header --outfile=b --hex-counter-format ls -l /dev/null
   crw-rw-rw-    1 root     root       1,   3 Dec  6  2002 /dev/null
   % cat b
   #
   # date: Thu Dec 22 07:57:46 2005
   #
   # hostname: caldera.hpl.hp.com
   #
   # kernel version: Linux 2.6.15-rc6 #8 SMP Wed Dec 21 08:47:10 PST 2005
   #
   # pfmon version: 3.2
   # kernel perfmon version: 2.2
   #
   #
   # page size: 16384 bytes
   #
   # CLK_TCK: 1000 ticks/second
   #
   # CPU configured: 2
   # CPU online    : 2
   #
   # physical memory          : 4170711040 bytes (3977.5 MB)
   # physical memory available: 3002400768 bytes (2863.3 MB)
   #
   # host CPUs:  2-way 1500MHz/6.0MB Itanium 2 (Madison, B1)
   #	PAL_A: 0.5.37
   #	PAL_B: 0.5.65
   #	Cache levels: 3 Unique caches: 4
   #	L1D:    16384 bytes, line  64 bytes, load_lat   1, store_lat   3
   #	L1I:    16384 bytes, line  64 bytes, load_lat   1, store_lat   0
   #	L2 :   262144 bytes, line 128 bytes, load_lat   5, store_lat   7
   #	L3 :  6291456 bytes, line 128 bytes, load_lat  14, store_lat   7
   #
   #
   # information for set0:
   #	number of runs : 1
   #	active duration: 2043539 cycles
   #	PMD4: CPU_CYCLES = user
   #
   #
   #
   # instruction sets for set0:
   #	PMD4: CPU_CYCLES = ia32/ia64
   #
   #
   # pfmon command line: ./pfmon --with-header --outfile=b --hex-counter-format ls -l /dev/null
   #
   # monitoring mode  : per-thread
   # command line     : ls -l /dev/null 
   #
   # process id       : 19259
   # thread  id       : 19259
   # parent process id: -1
   # parent thread  id: -1
   # exec count       : 1
   #
   #
   #
   0x00000000000e8203 CPU_CYCLES
4.5 Hiding program output

It is possible to redirect the output of the programs you execute under pfmon to /dev/null. This is desireable if you are interested in keeping a clean output and are not using an output file. By default, programs share the same terminal output as pfmon. You can use the --no-cmd-output option to eliminate the output.

Normal output, we see the result of the "ls" command:

   % pfmon --us-counter-format ls -l /dev/null
   crw-rw-rw-    1 root     root       1,   3 Dec  6  2002 /dev/null
                   724,882 CPU_CYCLES

Using the option, you get:

   % pfmon --no-cmd-output --us-counter-format ls -l /dev/null
                   724,882 CPU_CYCLES
4.6 Getting timing information

It is possible to get a time breakdown of the execution of the monitored command for both per-thread and system-wide mode using the --show-time option. The output is similar to the time(1) command. For instance:

   % pfmon --show-time -e nops_retired ls /dev/null
   /dev/null
                    145158 NOPS_RETIRED
   real 0h00m00.001s user 0h00m00.000s sys 0h00m00.000s
4.7 Testing event combinations

Sometimes it is handy to check if some events can be measured simultaneously without actually starting the monitoring session. The --check-events-only option of pfmon allows this mode of operation. It checks that the combination is valid and then exits. If the combination is invalid, it prints out the reason and pfmon exits with a return code of 1, otherwise a value of 0 is returned. On Itanium2, for instance, you can try:

   % pfmon --check-events-only -e loads_retired,stores_retired
   event LOADS_RETIRED and STORES_RETIRED cannot be measured at the same time
   % echo $?
   1

Note that in this mode, you do not need to specify a command to execute.

4.8 Using event sets

You need to run on a kernel with perfmon v2.2 or higher to get this feature.

Pfmon can monitor more events than there are actual physical counters on the PMU. It does this by leveraging the event sets exposed by the perfmon interface. Each set can have as many events as there are counters in the PMU. You do not need to fully populate a set before defining another. In fact, you may not always be able to have full sets if certain events cannot be measured at the same time. In that case, you can place the conflicting events into different sets. Pfmon does not re-order events among sets to resolve any combination conflicts.

The order in which sets are specified determines the switching order. Sets are multiplexed onto the actual PMU by the kernel. The rate of switching is determined by a timeout which must be specified with the --switch-timeout option.

   % pfmon -ecpu_cycles,loads_retired -enops_retired,stores_retired --switch-timeout=100 ls /dev/null
   /dev/null
   set1 was not activated, set1 counts set to zero. Try decreasing the switch timeout.
   911397 CPU_CYCLES
   122096 LOADS_RETIRED
        0 NOPS_RETIRED
        0 STORES_RETIRED

In this example, the command ran for too short to get a chance to switch to the 2nd set. You can try decreasing the timeout:

   % pfmon -ecpu_cycles,loads_retired -enops_retired,stores_retired --switch-timeout=1 ls /dev/null
    646412 CPU_CYCLES
    107934 LOADS_RETIRED
    378464 NOPS_RETIRED
     29832 STORES_RETIRED

When using multiple sets, pfmon scales the counts based on the number of runs of each set. it is VERY important to understand that results are just estimates of what the real result would be if we had monitored all events at the same time. Depending on the switching rate, results may be off.

Note also that the granularity of the underlying timer is limited and depends on the hardware platform and kernel compile options. A 10ms granularity is common on many platforms. That means that any timeout is rounded up to the closest multiple of 10ms. Pfmon does print a warning is the difference between the requested and effective switch timeout is bigger than 2%:

   % pfmon -ecpu_cycles,loads_retired -enops_retired,stores_retired --switch-timeout=1 ls /dev/null
  ...
  requested switch timeout 1 msecs
  effective switch timeout 10 msecs (+900.00%)
  Warning: host kernel timer granularity only allows the following timeout:
        requested switch timeout 1 msecs
        effective switch timeout 10 msecs (+900.00% offset)
  ...

Event related options such as --priv-levels can be specified per set. They always apply to the set that precedes them on the command line. Here is an example:

   % pfmon -ecpu_cycles,loads_retired --prev-level=u,k -enops_retired,stores_retired --switch-timeout=1 ls /dev/null

Here the privilege level of the events will be as follows:

  • set0: cpu_cycles = user
  • set0: loads_retired = kernel
  • set1: nops_retired = user (default priv level)
  • set1: stores_retired = user (default priv level)

Pfmon does not support using multiple event sets when sampling.

4.9 Using events with unit masks

For some PMU models, events can have sub-events called unit masks. There may be more than one unit mask possible and they can oftentimes be combined. This way one can measure multiple events using only one counter. The libpfm library supports unit masks on X86 processors and for some events of the dual-core Itanium 2 processor.

To figure out the list of available unit mask for a given event, you can use the -i option of pfmon. Here is an example for an Intel Core 2 Duo processor:

   % pfmon -i BUS_DATA_RCV
   Name     : BUS_DATA_RCV
   Code     : 0x64
   Counters : [ 0 1 ]
   Desc     : Bus cycles while processor receives data
   Umask-00 : 0x40 : [SELF] : This core
   Umask-01 : 0xc0 : [BOTH_CORES] : Both cores
   PEBS     : No

If an event has unit masks, pfmon fails when at least of unit mask is not specified:

   % pfmon -ebus_data_rcv date
   event bus_data_rcv needs at least a unit mask among: [SELF] [BOTH_CORES]

Unit masks can be specified on the command line of pfmon using a simple event_name:umask1:umask2... notation:

   % pfmon -ebus_data_rcv:both_cores ls /dev/null
   /dev/null
   33931 BUS_DATA_RCV:BOTH_CORES

Multiple unit mask can be combined for a given event:

   % pfmon -isimd_inst_retired
   Name     : SIMD_INST_RETIRED
   Code     : 0xc7
   Counters : [ 0 1 ]
   Desc     : Retired Streaming SIMD Extensions (SSE) packed-single instructions
   Umask-00 : 0x01 : [PACKED_SINGLE] : Retired Streaming SIMD Extensions (SSE) packed-single instructions
   Umask-01 : 0x02 : [SCALAR_SINGLE] : Retired Streaming SIMD Extensions (SSE) scalar-single instructions
   Umask-02 : 0x04 : [PACKED_DOUBLE] : Retired Streaming SIMD Extensions 2 (SSE2) packed-double instructions
   Umask-03 : 0x08 : [SCALAR_DOUBLE] : Retired Streaming SIMD Extensions 2 (SSE2) scalar-double instructions
   Umask-04 : 0x10 : [VECTOR] : Retired Streaming SIMD Extensions 2 (SSE2) vector integer instructions
   Umask-05 : 0x1f : [ANY] : Retired Streaming SIMD instructions (precise event)
   PEBS     : [ANY]

To count PACKED_SINGLE and PACKED_DOUBLE, you need to specify the event as follows:

   % pfmon -esimd_inst_retired:PACKED_SINGLE:PACKED_DOUBLE -- foo
   849346660 SIMD_INST_RETIRED:PACKED_SINGLE:PACKED_DOUBLE

To verify that this actually combined the two unit mask, you can measure them in separate counters:

   % pfmon -esimd_inst_retired:PACKED_DOUBLE,simd_inst_retired:PACKED_SINGLE -- foo
   419430400 SIMD_INST_RETIRED:PACKED_DOUBLE
   429916260 SIMD_INST_RETIRED:PACKED_SINGLE

5. Monitoring running tasks

This version of pfmon can "attach" to an already running thread, monitor for a while and eventually "detach" from the thread. This is very useful when monitoring system daemons or workload that are very long to start. The thread to attach to is specified using the --attach-task option with a thread id as reported by the ps command or the gettid() system call. You must have permission to access the process for this option to work. It is not possible to attach to kernel threads, such as kswapd, migration, khubd, kjournald even if you are root.

Let us look at some examples. Suppose we have this long running program called noploop and we want to monitor it once it is started:

   % noploop 1000000000000000 &
   % ps ax | fgrep noploop
   22731 pts/7    R      1:33 nueh/noploop 1000000000000000
   % pfmon --attach-task=22731
   session interrupted by user
              120106020603 CPU_CYCLES

This pfmon command attaches to thread 22731 and collects CPU_CYCLES in user level until the thread terminates. It is possible to interrupt if you want to stop before the thread terminates by simply pressing CTRL-C (control C) on your keyboard. This prints the counts so far and detaches from the program which will continue to run.

It is possible to use a session timeout to delimit the duration of such measurement using the --session-timeout or -t option:

   % pfmon -t5 --attach-task=22731
   session timed out
                4988897963 CPU_CYCLES

By default, pfmon attaching to a multi-threaded program, only the designated thread is monitored. However, pfmon prints a warning indicating more threads exist:

   $ ps -L ax
   6017  6017 pts/0    Rl+    0:02 ./pthreadtest 100000000000 4
   6017  6018 pts/0    Rl+    0:02 ./pthreadtest 100000000000 4
   6017  6019 pts/0    Rl+    0:02 ./pthreadtest 100000000000 4
   6017  6020 pts/0    Rl+    0:02 ./pthreadtest 100000000000 4
   6017  6021 pts/0    Rl+    0:02 ./pthreadtest 100000000000 4

   $ pfmon --attach-task=6018 -t5
   [6018] is multi-threaded(5 threads), attached only to thread [6018], use --follow-pthread to attach to all threads
   session terminated by timeout expiration
   9486380168 UNHALTED_CORE_CYCLES

As indicated, by using the --follow-pthread option, pfmon will automatically attach to all existing threads, and will follow any newly created thread thereafter:

   $ pfmon --attach-task=6018 --follow-pthread -t5
   session terminated by timeout expiration
    8109083977 UNHALTED_CORE_CYCLES /home/eranian/perfmon/pfmon/tests/pthreadtest (6017,6020,17754)
    9338362249 UNHALTED_CORE_CYCLES /home/eranian/perfmon/pfmon/tests/pthreadtest (6017,6019,17754)
    9417537655 UNHALTED_CORE_CYCLES /home/eranian/perfmon/pfmon/tests/pthreadtest (6017,6018,17754)
    9077620072 UNHALTED_CORE_CYCLES /home/eranian/perfmon/pfmon/tests/pthreadtest (6017,6017,17754)
   11716960509 UNHALTED_CORE_CYCLES /home/eranian/perfmon/pfmon/tests/pthreadtest (6017,6021,17754)

6. System wide sessions

When the --system-wide option is used, pfmon operates in system wide mode. This means that it does not monitor a specific program anymore but instead all the processes that execute on a specific set of CPUs. In this mode, you do no need to specify a command. You do NOT need to be root to create a system wide session. It is possible to exclude the idle task from a system-wide measurement.

A system wide session cannot co-exist with any per-thread sessions. But a system wide session can run concurrently with other system wide sessions as long as they do not monitor the same set of CPUs.

6.1 Selecting CPUs to monitor

This version of pfmon supports up to 2048 processors. By default, pfmon monitors on ALL online CPUs for a system-wide measurement. It is possible to restrict the set of CPUs using the --cpu-list option.

The CPU are identified by their number. By default numbers are relative to their global logical number as seen by the kernel on boot. For example, on a 2-way machine with 2 online CPUs:

   %  pfmon -v --system-wide
   ...
   selected CPUs (2 CPU in set, 2 CPUs online): CPU0 CPU1
   ...
   <press ENTER to stop session>
   CPU0                         85724 CPU_CYCLES
   CPU1                         63743 CPU_CYCLES

If the shell that pfmon is run from is part of a CPU set for which includes only a subset of the online CPUs, then pfmon will only monitor on the processors part of that set:

   % cat /proc/self/cpuset
   /test1
   % cat /dev/cpuset/test1/cpus
   1
   % pfmon -v --system-wide
   ...
   selected CPUs (1 CPU in set, 2 CPUs online): CPU1
   ...
   <press ENTER to stop session>
   CPU1    11873 CPU_CYCLES

Notice that by default pfmon uses the global CPU number. Here CPU1 is CPU number one for the 2-way system, not for the set.

It may be beneficial to use a set relative CPU numbering scheme when collecting performance results. For this it is necessary to use the --cpu-set-relative option:

   % pfmon -v --system-wide --cpu-set-relative
   ...
   selected CPUs (1 CPU in set, 2 CPUs online): CPU0
   ...
   <press ENTER to stop session>
   CPU0    11873 CPU_CYCLES

If you want to restrict to a specific CPU (global or set-relative), you can use the --cpu-list command:

   % pfmon --system-wide --cpu-list=1 -u -e cpu_cycles,ia64_inst_retired
   <press ENTER to stop session>
   CPU1    3757 CPU_CYCLES
   CPU1    1636 IA64_INST_RETIRED

The CPU are numbered starting at 0. You can specify multiple CPUs using a comma separated list withous spaces:

   % pfmon --system-wide --cpu-list=0,1 -u -e cpu_cycles,ia64_inst_retired
   <press ENTER to stop session>
   CPU0                        146933 CPU_CYCLES
   CPU0                        134885 IA64_INST_RETIRED
   CPU1                          6189 CPU_CYCLES
   CPU1                          4624 IA64_INST_RETIRED

It is also possible to specify ranges of CPUs:

   % pfmon --system-wide --cpu-list=0-1 -u -e cpu_cycles,ia64_inst_retired
   <press ENTER to stop session>
   CPU0                        105186 CPU_CYCLES
   CPU0                         87627 IA64_INST_RETIRED
   CPU1                       5713933 CPU_CYCLES
   CPU1                       7066832 IA64_INST_RETIRED

Ranges and individual specification can be combined in a single comma-separated list:

   % pfmon --system-wide --cpu-list=0,2-3 -u -e cpu_cycles,ia64_inst_retired

This above example monitors on CPU0, CPU2, and CPU3.

6.2 Delimiting a system wide session

There are three ways to delimit a system wide session. By default, the session will terminate when the user press the <Enter> key. It is also possible to use a timeout expressed in seconds. Finally, the session can also be delimited by the execution of a command. It will start when the command starts and stops when it terminates. Here are some examples.

Monitor CPU cycles and instructions retired on the first two CPUs at both user and kernel levels and wait for a key-press to stop:

   % pfmon --cpu-list=0-1 --system-wide -u -k -e cpu_cycles,ia64_inst_retired
   <press ENTER to stop session>
   CPU0                   325728352 CPU_CYCLES       
   CPU0                   853076156 IA64_INST_RETIRED
   CPU1                   325869036 CPU_CYCLES       
   CPU1                   853618095 IA64_INST_RETIRED

Monitor cpu_cycles and instruction retired on the first two CPUs at both user and kernel levels for 10 seconds, use US counter format:

   % pfmon --us-c --session-timeout=10 --cpu-list=0-1 --system-wide -u -k -e cpu_cycles,ia64_inst_retired
   <session to end in 10 seconds>
   CPU0                15,001,469,784 CPU_CYCLES
   CPU0                44,896,922,932 IA64_INST_RETIRED
   CPU1                15,001,483,004 CPU_CYCLES
   CPU1                43,518,135,055 IA64_INST_RETIRED

Monitor CPU cycles and instructions retired on the first two CPUs at the user level only during the execution of the ls command (here obviously run on CPU0):

   % pfmon --cpu-list=0-1 --system-wide -u -e cpu_cycles,ia64_inst_retired -- ls -l /dev/null
   crw-rw-rw-    1 root     root       1,   3 Dec  6  2002 /dev/null
   CPU0                      802960 CPU_CYCLES       
   CPU0                      581288 IA64_INST_RETIRED
   CPU1                       44363 CPU_CYCLES       
   CPU1                       10666 IA64_INST_RETIRED

To make the measurement meaningful, it is likely that the user will need the command to be run on the same subset of processors that pfmon is measuring. To ensure that this happens, it is necessary to use the --pin-command option. It changes the processor affinity of the executed command to be equal to the set of processors being monitored. Of course, it is still possible for the command to modify its own affinity later on, therefore the option must be used with caution.

6.3 Results aggregation

It is possible to aggregate counts when monitoring more than one CPU:

   % pfmon --system-wide -k -e cpu_cycles,ia64_inst_retired --session-timeout=10
   <session to end in 10 seconds>
   CPU0                   15001363310 CPU_CYCLES
   CPU0                   44933161824 IA64_INST_RETIRED
   CPU1                   15001462725 CPU_CYCLES
   CPU1                   44967038560 IA64_INST_RETIRED

   % pfmon --system-wide -k -e cpu_cycles,ia64_inst_retired --session-timeout=10 --aggr
   <session to end in 10 seconds>
   30002030938 CPU_CYCLES
   89897861578 IA64_INST_RETIRED

Aggregation supports event set multiplexing:

   % pfmon --system-wide -k -e cpu_cycles -eia64_inst_retired --switch-timeout=100 --session-timeout=10 --aggr
   <session to end in 10 seconds>
   29710925962 CPU_CYCLES
   90797169643 IA64_INST_RETIRED

In which case, the per CPU results are summed. Pfmon does not allow different events to be monitored on different CPUs. For this you can run separate instances of pfmon with a different CPU list, using a command lines similar to:

   % pfmon --session-timeout=10 --cpu-list=0 --system-wide -k -e cpu_cycles &
   % pfmon --session-timeout=10 --cpu-list=1 --system-wide -k -e ia64_inst_retired &
6.4 Delayed start

Sometimes, it may be useful to delay the activation of monitoring until a certain point in the execution is reached. This is the case when the initialization must not be included in the counts.

In system-wide mode only, pfmon provides the --trigger-delay-start option which can be used to specify a delay in second before monitoring is activated:

   % pfmon --system-wide --trigger-start-delay=5  --session-timeout=10

Wait for 5 seconds before activating a 10s monitoring session on all CPUs.

6.5 Saving results in files

Results can be saved into file(s) for system-wide session using the same --outfile option as for per-thread sessions. When results are not aggregated you get one file per cpu. The filename provided via the option is prepended to .cpuX where X is the CPU number:

   % pfmon --outfile=b --cpu-list=0-1 --system-wide -u -e cpu_cycles,ia64_inst_retired -- ls -l /dev/null
   crw-rw-rw-    1 root     root       1,   3 Dec  6  2002 /dev/null
   % cat b.cpu0
                       774233 CPU_CYCLES       
                    557904 IA64_INST_RETIRED
   % cat b.cpu1
                        54799 CPU_CYCLES       
                     32728 IA64_INST_RETIRED

You can add a header in each file by specifying the --with-header option.

6.6 Printing intermediate counts

In system-wide mode, when counting, you may want to see intermediate results. With the --print-interval option, pfmon print deltas of events at regular intervals. The --print-interval option takes the number of milliseconds between prints. Deltas are generated per event and per processor. There is no final counts printed in this mode. The --print-interval option is not available when aggregating results (--aggregate-results), when sampling and in per-thread mode.

Here is a simple example on 2-way machine where we measure for 6s and print every 2s:

   % pfmon --system-wide --us-c --print-interval=2000 -t6
   <session to end in 6 seconds>
   #
   CPU0                        91,528 CPU_CYCLES
   CPU1                       176,120 CPU_CYCLES
   #
   CPU0                         1,024 CPU_CYCLES
   CPU1                       294,496 CPU_CYCLES
   #
   CPU0                         2,050 CPU_CYCLES
   CPU1                       103,525 CPU_CYCLES

7. Monitoring multiple processes/threads

This feature of pfmon is only meaningful in per-thread mode.

7.1 Basic monitoring of multiple processes/threads

When monitoring complicated workloads it is useful to be able to monitor more than just the first process or thread which typically only do the initialization the bulk of the work being carried out by worker processes/threads. To illustrate the various options offered by pfmon we use the example of the GNU C compiler gcc invoked to compile a trivial "hello.c" program. Several processes are created as part of the compilation:

   % gcc -v hello.c -o hello -O2 -static
   Reading specs from /usr/lib/gcc-lib/ia64-linux/2.96/specs
   gcc version 2.96 20000731 (Debian GNU/Linux IA64 experimental)
   ...
   /usr/lib/gcc-lib/ia64-linux/2.96/cpp0 -lang-c -v -D__GNUC__=2 -D__GNUC_MINOR__=96...
   ...
   /usr/lib/gcc-lib/ia64-linux/2.96/cc1 /tmp/ccM0VFYv.i -mb-step -quiet -dumpbase hello.c...
   ...
   as -x -o /tmp/ccMo7exq.o /tmp/ccas6Y5Q.s
   ...
   /usr/lib/gcc-lib/ia64-linux/2.96/collect2 -static -o hello /usr/lib/gcc-lib/...

The edited output reveals that gcc uses a total of 5 processes to compile hello.c.

Now, let's look at how pfmon can monitor all or some of the processes involved here.

By default only the first process is monitored:

   % pfmon --us-c --no-cmd-out -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
                 1,123,406 IA64_INST_RETIRED

How do we know that only one process is monitored? You can enable the verbose mode of pfmon:

   % pfmon --verbose --us-c --no-cmd-out -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
   symbol file for gcc not found
   measuring at user privilege level ONLY
   [pmc4=0x000828 thres=0 es=0x08 plm=8 umask=0x0 pm=0 ism=0x0 oi=1] IA64_INST_RETIRED
   [pmc15=0xffffffff ibrp0_pmc8=1 ibrp1_pmc9=1 ibrp2_pmc8=1 ibrp3_pmc9=1]
   [pmd4 ival=0x0 long_rate=0x0 short_rate=0x0 mask=0x0 seed=0 rand=n]
   exec-pattern=*
   [10950] started task: gcc hello.c -o hello -O2 -static 
   follow_exec=n follow_vfork=n follow_fork=n follow_pthread=n
   [10950] monitoring gcc hello.c -o hello -O2 -static ...
   [10950] activating monitoring
   measurements started at Thu Oct 30 17:10:58 2003
   
   [10950] task exited
                    1,123,406 IA64_INST_RETIRED
   [10950] detached
   created tasks        : 1
   maximum tasks        : 1
   maximum active tasks : 1
   measurements completed at Thu Oct 30 17:10:59 2003

Now if we want to monitor all processes/threads, we can use some of the --follow-* option that pfmon offers:

  • --follow-fork : pfmon monitors the child process created via a regular fork(). Pfmon creates a distinct "session" dedicated to the newly created process. By default pfmon does not monitor the child process.
  • --follow-vfork : pfmon monitors the child process created via a vfork(). Pfmon creates a distinct "session" dedicated to the newly created process. By default pfmon does not monitor the child process.
  • --follow-pthread: pfmon monitors the a new pthread is created via pthread_create(). This works for both the old LinuxThreads library and the New Posix Thread Library (NPTL). Pfmon creates a distinct "session" dedicated to the newly created process. By default pfmon does not monitor the new thread.
  • --follow-exec : pfmon monitors the process across the execve() system call. No distinct "session" is created. By default, pfmon does not monitoring across exec().
  • --follow-all : combines all of the above options.

In case you are not sure which variation of the fork to use, it is best to use --follow-all. Now we run our gcc test case with the --follow-all option:

   % pfmon --follow-all --us-c --no-cmd-out -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
	 48,406,141 IA64_INST_RETIRED /usr/lib/gcc/ia64-linux-gnu/4.0.1/cc1 (21927,21927,21926)
	  5,438,463 IA64_INST_RETIRED as (21928,21928,21926)
	410,621,894 IA64_INST_RETIRED /usr/bin/ld (21930,21930,21929)
	    570,537 IA64_INST_RETIRED /usr/lib/gcc/ia64-linux-gnu/4.0.1/collect2 (21929,21929,21926)
	  1,277,759 IA64_INST_RETIRED gcc (21926,21926,-1)

This time, pfmon reports 5 different results for 5 different processes. Let us examine a line of the output:

	         5,438,463 IA64_INST_RETIRED as (21928,21928,21926)
		 ^         ^                 ^   ^     ^     ^
                 |         |                 |   |     |     |
		 |         |                 |   |     |     parent process id
		 |         |                 |   |     thread id (tid)
		 |         |                 |   process id (pid)
		 |         |                 command name
		 |         event name
		 total count for the process

A parent process id of -1 indicates the first process.

It is important to keep in mind that for child processes, by default, the "total count" does include execution up to the initial fork, i.e., before the exec:

    m   | ->     vfork() ----> parent goes to wait()
    o   |          |
    n   |          |
    i   |          |
    t   |        execve()
    o   |          |
    r   |        execution of the command
    e   |          |
    d   |->       exit()

The pid vs. tid distinction is only relevant for threads AND when using NPTL. Otherwise both numbers should be identical. The pid, tid, parent pid can be used to re-create the process tree involved with the execution of gcc. Results are reported in order of termination of the process. This is why "gcc" is reported last. The entire command line is not reported, only the first argument, which is the command name.

You can trace what is going on with the --verbose option:

   % pfmon --verbose --follow-all --us-c --no-cmd-out -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
   loaded 14150 text symbols from /proc/kallsyms
   loaded 3008 data symbols from /proc/kallsyms
   symbol file for gcc not found
   measuring at user privilege level ONLY
   1 event sets defined
   long  sampling periods(val/mask/seed): 0/0x0/0
   short sampling periods(val/mask/seed): 0/0x0/0
   [pmc4=0x000828 thres=0 es=0x08 plm=8 umask=0x0 pm=0 ism=0x0 oi=1] IA64_INST_RETIRED
   [pmc15=0xffffffff ibrp0_pmc8=1 ibrp1_pmc9=1 ibrp2_pmc8=1 ibrp3_pmc9=1]
   pmd setup for event set0:
   [pmd4 set=0 ival=0x0 long_rate=0x0 short_rate=0x0 mask=0x0 seed=0 randomize=n]
   exec-pattern=*
   [22131] started task: gcc /home/eranian/nueh/hello.c -o /tmp/hello -O2 -static
   follow_exec=y follow_vfork=y follow_fork=y follow_pthread=y
   [22131] monitoring gcc /home/eranian/nueh/hello.c -o /tmp/hello -O2 -static ...
   results are on terminal
   requested switch timeout 0s 0ms 0ns
   effective switch timeout 0s 0ms 0ns
   [22131] activating monitoring
   measurements started at Wed Jan  4 06:27:29 2006
   
   [22131] vfork [22132]
   [22132] monitoring gcc /home/eranian/nueh/hello.c -o /tmp/hello -O2 -static ...
   results are on terminal
   requested switch timeout 0s 0ms 0ns
   effective switch timeout 0s 0ms 0ns
   [22132] activating monitoring
   [22132] exec /usr/lib/gcc/ia64-linux-gnu/4.0.1/cc1 -quiet /home/eranian/nueh/...
   [22132] monitoring /usr/lib/gcc/ia64-linux-gnu/4.0.1/cc1 -quiet /home/eranian...
   [22132] task exited
   set0 runs=1 duration=32122704
                   48,406,141 IA64_INST_RETIRED
   /usr/lib/gcc/ia64-linux-gnu/4.0.1/cc1 (22132,22132,22131)
   [22132] detached
   [22131] vfork [22133]
   [22133] monitoring gcc /home/eranian/nueh/hello.c -o /tmp/hello -O2 -static ...
   results are on terminal
   requested switch timeout 0s 0ms 0ns
   effective switch timeout 0s 0ms 0ns
   [22133] activating monitoring
   [22133] exec as -x -o /tmp/ccy0rssc.o /tmp/ccA8krGb.s ...
   [22133] monitoring as -x -o /tmp/ccy0rssc.o /tmp/ccA8krGb.s ...
   [22133] task exited
   set0 runs=1 duration=5629233
                    5,438,463 IA64_INST_RETIRED as (22133,22133,22131)
   [22133] detached
   [22131] vfork [22134]
   [22134] monitoring gcc /home/eranian/nueh/hello.c -o /tmp/hello -O2 -static ...
   results are on terminal
   requested switch timeout 0s 0ms 0ns
   effective switch timeout 0s 0ms 0ns
   [22134] activating monitoring
   [22134] exec /usr/lib/gcc/ia64-linux-gnu/4.0.1/collect2 -static -o /tmp/hello...
   [22134] monitoring /usr/lib/gcc/ia64-linux-gnu/4.0.1/collect2 -static -o /tmp...
   [22134] vfork [22135]
   [22135] monitoring /usr/lib/gcc/ia64-linux-gnu/4.0.1/collect2 -static -o /tmp...
   results are on terminal
   requested switch timeout 0s 0ms 0ns
   effective switch timeout 0s 0ms 0ns
   [22135] activating monitoring
   [22135] exec /usr/bin/ld -static -o /tmp/hello /usr/lib/gcc/ia64-linux-gnu/4....
   [22135] monitoring /usr/bin/ld -static -o /tmp/hello /usr/lib/gcc/ia64-linux-...
   [22135] task exited
   set0 runs=1 duration=274328917
                  410,625,058 IA64_INST_RETIRED /usr/bin/ld (22135,22135,22134)
   [22135] detached
   [22134] task exited
   set0 runs=1 duration=1687796
                      570,537 IA64_INST_RETIRED
   /usr/lib/gcc/ia64-linux-gnu/4.0.1/collect2 (22134,22134,22131)
   [22134] detached
   [22131] task exited
   set0 runs=1 duration=2196384
                    1,277,759 IA64_INST_RETIRED gcc (22131,22131,-1)
   [22131] detached
   created tasks        : 5
   maximum tasks        : 3
   maximum active tasks : 3
   measurements completed at Wed Jan  4 06:27:29 2006
 

The trace reveals that vfork() is used to create the new processes, therfore we could have used --follow-vfork and --follow-exec to obtain the same results.

7.2 Saving results in files

It is possible to use the --outfile option when you are using any of the --follow-* options. There will be one file per process/thread, really per task in Linux terms unless you decide to aggregate the results using the --aggregate-results options.

Using our gcc compilation example, we can now do:

   % pfmon --outfile=result --follow-all --us-c --no-cmd-out -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
   $ ls -l result.*
   -rw-r--r--    1 eranian  users          45 Nov  4 15:28 result.1931.1931
   -rw-r--r--    1 eranian  users          45 Nov  4 15:28 result.1933.1933
   -rw-r--r--    1 eranian  users          45 Nov  4 15:28 result.1934.1934
   -rw-r--r--    1 eranian  users          45 Nov  4 15:28 result.1935.1935
   -rw-r--r--    1 eranian  users          45 Nov  4 15:28 result.1936.1936
   -rw-r--r--    1 eranian  users          45 Nov  4 15:28 result.1937.1937

There is one file per process created. The naming convention used by pfmon is as follows:

       result.1937.1937
       ^      ^    ^
       |      |    |
       |      |    thread id
       |      process id
       base filename

Unless you are using threads (LinuxThreads or NPTL) both the process id and the thread id should be identical. You can use the --with-header option to get detailed header in each output file.

   % pfmon --with-header --outfile=result --follow-all --us-c --no-cmd-out \
     -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
   % ls -l result.*
   -rw-r--r--    1 eranian  users        1416 Nov  4 15:32 result.2791.2791
   -rw-r--r--    1 eranian  users        1816 Nov  4 15:32 result.2793.2793
   -rw-r--r--    1 eranian  users        1497 Nov  4 15:32 result.2794.2794
   -rw-r--r--    1 eranian  users        1428 Nov  4 15:32 result.2795.2795
   -rw-r--r--    1 eranian  users        1790 Nov  4 15:32 result.2796.2796
   -rw-r--r--    1 eranian  users        1760 Nov  4 15:32 result.2797.2797
   % cat result.2797.2797
   ...
   #
   # command line     : /usr/bin/ld -static -o hello /usr/lib/gcc-lib/ia64-linux/2.96/../../../crt1.o ...
   #
   # process id       : 2797
   # thread  id       : 2797
   # parent process id: 2796
   # parent thread  id: 2796
   # exec count       : 1

The "command line" line shows the program for this task with represents the invocation of the linker, ld. The following set of lines shows the pid/tid and parent information.

7.3 Separating execution between fork()/vfork() and exec()

Pfmon can differentiate the execution that happens between the fork/vfork and exec using the --exec-split-result option.

If we use our gcc example from the previous section, here is what happens:

Without the option:

   % pfmon --follow-all --us-c --no-cmd-out -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
		48,406,141 IA64_INST_RETIRED /usr/lib/gcc/ia64-linux-gnu/4.0.1/cc1 (5278,5278,5277)
                 5,438,463 IA64_INST_RETIRED as (5279,5279,5277)
	       410,617,535 IA64_INST_RETIRED /usr/bin/ld (5281,5281,5280)
	           570,537 IA64_INST_RETIRED /usr/lib/gcc/ia64-linux-gnu/4.0.1/collect2 (5280,5280,5277)
		 1,277,759 IA64_INST_RETIRED gcc (5277,5277,-1)

With the option:

   % pfmon --exec-split-result --follow-all --us-c --no-cmd-out -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
	      1,190 IA64_INST_RETIRED gcc (19682,19682,19681)
	 48,404,951 IA64_INST_RETIRED /usr/lib/gcc/ia64-linux-gnu/4.0.1/cc1 (19682,19682,19681)
	      4,128 IA64_INST_RETIRED gcc (19683,19683,19681)
	  5,434,335 IA64_INST_RETIRED as (19683,19683,19681)
	         91 IA64_INST_RETIRED gcc (19684,19684,19681)
	      1,170 IA64_INST_RETIRED /usr/lib/gcc/ia64-linux-gnu/4.0.1/collect2 (19685,19685,19684)
	410,620,724 IA64_INST_RETIRED /usr/bin/ld (19685,19685,19684)
	    570,446 IA64_INST_RETIRED /usr/lib/gcc/ia64-linux-gnu/4.0.1/collect2 (19684,19684,19681)
	  1,277,759 IA64_INST_RETIRED gcc (19681,19681,-1)

This time we have more counts reported, but that does not mean we had more processes created just because of our measurement. It simply shows a finer grain breakdown of the same measurement. It is fairly easy to rebuild the process tree:

                 gcc (19681,19681,-1)
       vfork ->        gcc (19682,19682,19681)
       exec  ->        /usr/lib/gcc/ia64-linux-gnu/4.0.1/cc1 (19682,19682,19681)

       vfork ->        gcc (19683,19683,19681)
       exec  ->        as (19683,19683,19681)

       vfork ->        gcc (19684,19684,19681)
       exec  ->        /usr/lib/gcc/ia64-linux-gnu/4.0.1/collect2 (19684,19684,19681)
       vfork ->            /usr/lib/gcc/ia64-linux-gnu/4.0.1/collect2 (19685,19685,19684)
       exec ->        	   /usr/bin/ld (19685,19685,19684)

When results are saved into a file, pfmon automatically adds an extra field to the file name when the --exec-split-results option is used. This is necessary to distinguish between the various parts of the execution of the task because the pid and the tid remain the same across exec:

   % pfmon --exec-split-result --outfile=qq --follow-all --us-c --no-cmd-out \
     -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
   % ls -l qq.*
   -rw-r--r--  1 eranian users 28 Jan  4 03:50 qq.19708.19708.1
   -rw-r--r--  1 eranian users 24 Jan  4 03:50 qq.19709.19709.0
   -rw-r--r--  1 eranian users 29 Jan  4 03:50 qq.19709.19709.1
   -rw-r--r--  1 eranian users 24 Jan  4 03:50 qq.19710.19710.0
   -rw-r--r--  1 eranian users 28 Jan  4 03:50 qq.19710.19710.1
   -rw-r--r--  1 eranian users 21 Jan  4 03:50 qq.19711.19711.0
   -rw-r--r--  1 eranian users 26 Jan  4 03:50 qq.19711.19711.1
   -rw-r--r--  1 eranian users 24 Jan  4 03:50 qq.19712.19712.0
   -rw-r--r--  1 eranian users 30 Jan  4 03:50 qq.19712.19712.1

Notice, the third field in the filenames:

       qq.19709.19709.1
       ^  ^     ^     ^
       |  |     |     |
       |  |     |     exec number
       |  |     thread id
       |  process id
       base filename

The process 19709 did exec once. The qq.19709.19709.0 contains the results for the initial portion of the execution of the process, i.e., before the first exec. The file qq.19709.1 contains the results for the portion of execution after the first exec.

As for the initial process, gcc in this case, it shows up as qq.19708.19708.1. The exec count is always 1 for this program because we count the initial exec from the pfmon tool (after the vfork).

It is very helpful to use the --with-header option to record detailed information about the results. For instance, in the example above:

   % cat qq.19727.19727.0
   ...
   # monitoring mode  : per-thread
   # command line     : gcc /home/eranian/nueh/hello.c -o /tmp/hello -O2 -static
   #
   # process id       : 19727
   # thread  id       : 19727
   # parent process id: 19726
   # parent thread  id: 19726
   # exec count       : 0
   #
   ...

Looking at the other file:

   % cat qq.19727.19727.1
   ...
   # monitoring mode  : per-thread
   # command line     : /usr/lib/gcc/ia64-linux-gnu/4.0.1/cc1 -quiet ...
   #
   # process id       : 19727
   # thread  id       : 19727
   # parent process id: 19726
   # parent thread  id: 19726
   # exec count       : 1
   ...
7.4 Selectively monitoring processes

When using the --follow-exec option it is possible to specify a regular expression pattern to filter out some part of execution. This is best explain with an example. If we use our gcc example once again and suppose we only care about the results for the assembler, "as", then we can use the following:

   % pfmon --follow-all  --follow-exec="^as" --us-c --no-cmd-out -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
                    5,496,093 IA64_INST_RETIRED as (11065,11065,11061)

The "^" before as is to ensure that we match the pattern "as" only at the beginning of the command line.

Pfmon only measures and reports execution of the "as process". Of course, here one has to be careful by what is meant by "as process". As we saw in the previous section, the "as process" starts its life as a "gcc process" that eventually execs the assembler, as. In this particular example, pfmon did not have monitoring enabled at the time the "gcc process" is created, therefore monitoring ONLY starts at the exec.

If you want to include more than one command and cannot easily express this with a regular expression, you can use the opposite command to --follow-exec which is --follow-exec-exclude. Commands that match are not monitored. In our gcc example, suppose we want to measure everything except gcc:

 
   % pfmon --follow-all  --follow-exec-exclude="^gcc" --us-c --no-cmd-out -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
	 48,404,951 IA64_INST_RETIRED /usr/lib/gcc/ia64-linux-gnu/4.0.1/cc1 (19747,19747,19746)
	  5,434,335 IA64_INST_RETIRED as (19748,19748,19746)
	410,621,894 IA64_INST_RETIRED /usr/bin/ld (19750,19750,19749)
	    570,446 IA64_INST_RETIRED /usr/lib/gcc/ia64-linux-gnu/4.0.1/collect2 (19749,19749,19746)

These options may seem frivolous at first because a simple grep of the full output would yield the same results but they become particularly useful when pfmon is used for sampling. Also by not monitoring when not necessary we limit the overhead.

7.5 Aggregating results from multiple processes/threads

By default, pfmon produces output for every actively monitored task followed. You may want to only get the aggregate results. In this case, you simply need to use the --aggregate-results and pfmon will add up all the counts per event for all tasks.

Let us take our compilation example and use the LOADS_RETIRED event to gain some more stability in the counts:

   % pfmon --follow-all -eloads_retired -- gcc hello.c -o hello -O2 -static
	 4482486 LOADS_RETIRED /usr/lib/gcc/ia64-linux-gnu/4.0.1/cc1 (19772,19772,19771)
	  540310 LOADS_RETIRED as (19773,19773,19771)
	42851738 LOADS_RETIRED /usr/bin/ld (19775,19775,19774)
	   66243 LOADS_RETIRED /usr/lib/gcc/ia64-linux-gnu/4.0.1/collect2 (19774,19774,19771)
	  122887 LOADS_RETIRED gcc (19771,19771,-1)

Now with aggregation:

   % pfmon --aggregate-results --follow-all -eloads_retired -- gcc hello.c -o hello -O2 -static
        48063561 LOADS_RETIRED

8. Triggering monitoring at specific locations

Pfmon allows you to control where monitoring is started and stopped using code or data trigger addresses. By default pfmon starts monitoring at the first instruction of a program and stop when the program terminates, i.e., the process exits. This feature exists on all platform and does not use the PMU hardware but the debug registers. As such it introduces more overhead and should be use carefully, i.e. the section of the execution where monitoring is active must be long enough compared to the overhead.

This ability to control where monitoring starts/stops is only available in per-thread mode.

8.1 Basic code triggers

You can start or stop monitoring when execution reaches a certain instruction in your program. The instruction is identified by its address. There is no particular checks made on the address, pfmon does not try to validate the address inside the process user level address space. However pfmon may reject addresses referring to kernel code. This feature cannot be used to trigger monitoring for kernel level execution. This limitation is due to the fact that triggers are implemented using the debug registers and it is illegal to set a breakpoint in the kernel.

The number of code triggers depends on the processor. If you have 4 triggers, then, you can have two start triggers and two stop triggers, but they do not need to be evenly balanced, i.e., one start and three stops is perfectly legal. You specify the start triggers using --trigger-code-start followed by a comma separated list of addresses or symbol names. Do not use space between the elements of the list. Similarly stops are expressed using --trigger-code-stop as a comma-separated list of addresses or symbols.

Supposing we set the start trigger at the beginning of a function, and if this function calls others, monitoring will stay active until we come across a stop trigger or the process terminates.

By default start and stop triggers are "executed" only once.

Let us look at a simple example with a stupid noploop program:

   % nm noploop | fgrep "T main"
   4000000000000450 T main

Count the number of load retired instruction for the entire execution:

   % pfmon --no-cmd-out -eloads_retired noploop 100000
   1714 LOADS_RETIRED

Count the number of load retired instruction starting at main():

   % pfmon --trigger-code-start=0x4000000000000450 --no-cmd-out -eloads_retired noploop 100000
   538 LOADS_RETIRED

Monitoring starts at the first instruction in main() and stop when noploop executes its last user level instruction (we only monitor at the user level here). From the two runs, we can figure out the number of loads executed by the startup code: 1714-538 = 1176. We can verify this by simply stopping when we reach main():

   % pfmon --trigger-code-stop=0x4000000000000450 --no-cmd-out -eloads_retired noploop 100000
   1176 LOADS_RETIRED

As you can see here, you do not need to match a start with a stop.

Instead of using addresses, you can use symbols names given that the symbol table is still available in the binary. So our last example can be expressed as follows:

   % pfmon --trigger-code-stop=main --no-cmd-out -eloads_retired noploop 100000
   1176 LOADS_RETIRED
8.2 Data triggers

Data triggers work very much like code triggers but they apply only to data accesses, i.e, load, store, synchronization instructions. By default the trigger operates on read or write access. The start triggers are specified with the --trigger-data-start option and the stop triggers with the --trigger-data-stop option. Both are comma-separated lists of data addresses or symbols. The trigger is activated when the byte pointed to by the address is accessed as part of any size of access.

When using symbols, the names refer to global variables in the main program when linked dynamically or for the entirety of the program when linked statically, i.e., no dynamic symbol lookup is performed.

It is possible to refine the type of access that we active the triggers. By default they are activated on read OR write. You can restrict this with the following options if supported by underlying processors:

  • --trigger-data-ro : activate on read access only
  • --trigger-data-wo : activate on write access only
8.3 Repeating triggers

In certain cases it is useful to repeat the start/stop trigger sequence each time the program comes across the trigger points. In the case of a code trigger, that is how you could accumulate a measurement specific to a function, for instance. By default pfmon only uses each trigger once. You can repeat the code triggers using --trigger-code-repeat and ALL code start/stop trigger will be repeated. Similarly using --trigger-data-repeat will keep the data trigger installed throughout the execution.

8.4 Inheritance of triggers

By default, triggers are not inherited across fork,vfork,pthread_create. Triggers are NEVER inherited across exec(). By inheritance, we mean are not reinstalled into the newly created task. Trigger inheritance is possible using the --trigger-code-follow and --trigger-data-follow options.

8.5 Dynamic code stop triggers

Many times, code triggers are used to measure only one function and the ones it calls. However many functions can be called from different places and it is often hard to locate the return point where the stop trigger should be placed. Pfmon can dynamically figure out the return point of a function. When the code start and stop trigger are identical, pfmon interprets the options as a request to monitor when execution comes into a function until it leaves the function, i.e, return from the function. Pfmon does this using a best-effort approach that is not 100% guaranteed. Pfmon does not check that you place the start trigger at the beginning of a function. To determine the location where to place the stop trigger, pfmon uses the value stored in the return pointer at the time execution reaches the start trigger. Hence, if the start trigger is placed at the first instruction of the function, then the return pointers contains the place where the function will eventually return to. Note that this location is not necessarily just after where the call instruction, especially for leaf procedures. This feature is not 100% guaranteed and as such must be used by people who understand their code very well.

Here is a simple example using our noploop program:

   % pfmon --trigger-code-start=noploop --trigger-code-stop=noploop -e loads_retired noploop 100000000
                            0 LOADS_RETIRED

To figure out what is going on, you can turn on verbose mode:

   % pfmon --verb --no-cmd-out --trigger-code-start=noploop --trigger-code-stop=noploop -e loads_retired noploop 100000000
   loaded 14150 text symbols from /proc/kallsyms
   loaded 3008 data symbols from /proc/kallsyms
   table_size=1440 effective=945 min_addr=0x4000000000000140
   loaded 945 text symbols from ELF file /home/eranian/nueh/noploop
   table_size=1440 effective=408 min_addr=0x40000000000a5b60
   loaded 408 data symbols from ELF file /home/eranian/nueh/noploop
   measuring at user privilege level ONLY
   1 event sets defined
   long  sampling periods(val/mask/seed): 0/0x0/0
   short sampling periods(val/mask/seed): 0/0x0/0
   start code trigger @0x4000000000000390
   stop  code trigger @0x4000000000000390
   [pmc5=0x00cd28 thres=0 es=0xcd plm=8 umask=0x0 pm=0 ism=0x0 oi=1]
   LOADS_RETIRED
   [pmc15=0xffffffff ibrp0_pmc8=1 ibrp1_pmc9=1 ibrp2_pmc8=1 ibrp3_pmc9=1]
   pmd setup for event set0:
   [pmd5 set=0 ival=0x0 long_rate=0x0 short_rate=0x0 mask=0x0 seed=0
   randomize=n]
   exec-pattern=*
   [21108] started task: /home/eranian/nueh/noploop 100000000
   follow_exec=n follow_vfork=n follow_fork=n follow_pthread=n
   [21108] monitoring /home/eranian/nueh/noploop 100000000 ...
   results are on terminal
   requested switch timeout 0s 0ms 0ns
   effective switch timeout 0s 0ms 0ns
   [21108] installed start code breakpoint at 0x4000000000000390
   [21108] monitoring not activated
   measurements started at Wed Jan  4 05:23:43 2006

   [21108] reached start code breakpoint @0x4000000000000390
   [21108] install br.ret breakpoint @0x40000000000004e0
   [21108] activating monitoring at trigger start
   [21108] clearing code breakpoint @0x4000000000000390
   [21108] reached stop  code breakpoint @0x40000000000004e0
   [21108] stopping monitoring at trigger stop
   [21108] clearing code breakpoint @0x40000000000004e0
   [21108] task exited
   set0 runs=1 duration=133782861
   0 LOADS_RETIRED
   [21108] detached
   created tasks        : 1
   maximum tasks        : 1
   maximum active tasks : 1
   measurements completed at Wed Jan  4 05:23:43 2006

The key line in the trace is the one that shows the dynamically extracted stop trigger:

      [21108] install br.ret breakpoint @0x40000000000004e0

When the start trigger is reached at 0x4000000000000390, i.e., the beginning of noploop(), pfmon queries the kernel to get the return pointer, and it gets 0x40000000000004e0 which is the address in main() where noploop() will return to.


9. Dealing with symbols

Whenever an option takes an address (code or data) as argument, it is possible to directly use a symbol name rather than an address. For instance, this is true for the --trigger-code-*, --trigger-data-* options. Other examples include the Itanium2 options, --drange, --irange options.

9.1 Symbol table sources

Pfmon extracts symbol tables from ELF images on the filesystem. Kernel symbols are extracted them from /proc/kallsyms.

9.2 Symbol lookups

In per-thread mode, the symbols are automatically extracted from the monitored program. That includes the main program, shared libraries as well as libraries explictely loaded via dlopen(). In system-wide mode, the program name is provided whenever possible.

In version 3.8, the --symbol-file option is deprecated.

In version 3.8, the --sysmap-file option is deprecated.

For the kernel, pfmon automatically loads the content of /proc/kallsyms. It should be noted that /proc/kallsyms does not provide the size of symbols. Thus, pfmon approximates by assuming that symbols do not overlap and that there is no gap.

9.3 Symbols and the --follow-* options

In per-thread mode, symbol lookups are done before the command is actually launched therefore they pertain only to the initial program (including its shared libraries). Thus, it is not possible to refer to the name of a symbol from a program that will eventually be executed. For instance, it is not possible to name a symbol from the assembler "as" when monitoring a compilation with GNU C (gcc).

9.4 Symbol resolution when sampling

It is possible to resolve code and data addresses present in samples when using the --resolve-addresses option.

With this option, pfmon resolves code and/or data addresses it finds in each sample. Resolution occurs for kernel level symbols and user applications if the symbol table is found. When no matching symbol exists, the raw address is printed. The symbol name is followed by an offset in hexadecimal representing the number of bytes from the beginning of the symbol. The location of the symbol, either kernel or application (app) is appended to the symbol name.

In system-wide, only kernel symbols are resolved.

9.5 Symbol resolution examples

Here are a few examples on Itanium or Itanium2 and the model specific --checkpoint-func option:

Count the number of time main() is called in the noploop program:

    % file noploop
    noploop: ELF 64-bit LSB executable, IA-64, version 1, statically linked, not stripped
    % pfmon --no-cmd-out --checkpoint-func=main -e ia64_inst_retired noploop 10000
                         3 IA64_INST_RETIRED

Here the symbol information for main() is directly extracted from noploop itself. You must divide by 3 (because 3 instructions were executed in the first bundle) to get the actual result.

Count the number of times main() is called in the noploop-stripped program:

    % file noploop-stripped
    noploop-stripped: ELF 64-bit LSB executable, IA-64, version 1, statically linked, stripped
    % pfmon --no-cmd-out --checkpoint-func=main -e ia64_inst_retired noploop-=stripped 10000
    symbol not found: main

Let us provide the non-stripped version for the symbol table:

    % pfmon --no-cmd-out --symbol-file=noploop --checkpoint-func=main -e ia64_inst_retired noploop-stripped 1000
                             3 IA64_INST_RETIRED

Here noploop and noploop-stripped are the same programs except that the latter does not have the symbol table anymore.

Count the number of times sys_getpid() is called during the execution of the noploop program:

    % pfmon -k --checkpoint-func=sys_getpid -e ia64_inst_retired noploop 1000
                             0 IA64_INST_RETIRED

The -k is required to make sure we monitor at the kernel level where sys_getpid() is located. The results shows that noploop never calls getpid(). We can easily verify that this is correct using the strace program:

    % strace noploop 1000
    execve("./noploop", ["noploop", "100000"], [/* 24 vars */]) = 0
    uname({sys="Linux", node="tuff.hpl.hp.com", ...}) = 0
    brk(0)                                  = 0x6000000000010000
    brk(0x6000000000034000)                 = 0x6000000000034000
    fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
    mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2000000000000000
    write(1, "143663 cycles, 1.44 cycles/loop\n", 32143663 cycles, 1.44 cycles/loop) = 32
    munmap(0x2000000000000000, 65536)       = 0
    exit(0)                                 = ?

But if we use a program that does getpid() in a loop:

    % pfmon --no-cmd-out -k --checkpoint-func=sys_getpid -e ia64_inst_retired getpid 1000
                      3003 IA64_INST_RETIRED

This one does 1000+1 (3003/3) calls, obviously a 1000 from the loop plus an extra one somewhere else.


10. Sampling with pfmon

Pfmon has support for sampling on any events or combination of events. Samples are collected into a buffer which can then be written to a file or simply on the screen. Sampling is supported for both system-wide mode and per-thread mode including when following fork, vfork, and pthread_create.

Pfmon can sample is many different ways depending on the underlying CPU or what is being measured. For a given sampling measurement, pfmon can present the information differently. For instance it can be presented in clear text with decoding of certain timings, or samples can be dumped as a table with the values uninterpreted. What controls the output is called a pfmon sampling module and we detail what they do in the next sections, but first we need to introduce what sampling with pfmon means.

10.1 Principles

Sampling is a technique to record certain information at some interval. The information is recorded into a sample which is stored into a sampling buffer. The interval can be time-based or more likely event-based and the length can be fixed or variable. A typical example is the case where you want to record where you are (IP) each time you have executed one million instructions.

Each sample is composed of two parts, a fixed size header which contains general information about the sample and a variable size body which consists of a set of 64-bit values each one corresponding to a PMD register representing the other events being monitored. All samples record the same set of PMDs, this set is determined by pfmon based on what is requested by the user.

The sampling buffer is controlled by the kernel but its size is configurable. Depending on the sampling module, the default size of the buffer can vary. The common module size is 2048 entries. Each sampling module may offer options to change that size.

Sampling works as follows:

  1. the user specifies what needs to be recorded
  2. the user specifies the sampling period and optional randomization parameters
  3. at the end of a period, a sample is recorded into the buffer by the kernel
  4. a new sampling period is reloaded and execution/monitoring resumes. we go back to step 3
  5. if the sampling buffer becomes full, pfmon is notified
  6. pfmon processes the buffer, i.e., prints and/or saves the buffer depending on the sampling module
  7. pfmon then notifies the kernel that it is done
  8. the kernel reload new sampling periods and execution/monitoring resumes. we go back to 3

Pfmon and the kernel perfmon subsystem allow monitoring sessions with multiple sampling periods. In case where more than one sampling periods expire at the same time, there is one sample per expired period recorded, thereby multiple samples recorded, subject to the space left in the buffer. No partial sample is ever recorded.

10.2 Sampling periods

Pfmon (and the perfmon interface) uses two sampling periods instead of just one. The first one is called short-smpl-period and the second is called long-smpl-period. The short-smpl-period is used in step 4, this is when the sampling buffer is not full after writing a sample. The long-smpl-period is used in step 8 when the reload occurs after the buffer became full.

But why do we need 2 periods?

As you might imagine there is some overhead when recording a sample. This overhead is increased even more when pfmon needs to get involved to drain the buffer. This operation can take some time and will inevitably introduce some "noise" in the measurements in the form of TLB and/or cache pollution. To try and hide this "noise", it is sometimes beneficial to adjust the sampling period, i.e., make it bigger to ensure that the next sample will not record an event that is the consequence of the overhead generated by the monitoring but rather a normal event for the program/system being monitored. It is expected that the long-smpl-period >= short-smpl-period. Of course if the two are equal, this is equivalent to having only one sampling period. Note that the long-smpl-period is only used for the first sample recorded after the buffer is marked as empty again (step 7). Pfmon also use the long period first the initial period, i.e., for the first sample of the measurement.

The long sampling period is specified via the --long-smpl-periods options and similarly --short-smpl-period is used for the short sampling period.

   % pfmon --long-smpl-periods=5000 --short-smpl-period=2000 -e l3_misses ....

In the command line above, we take a sample every 2000 L3 misses at the user level. When the buffer becomes full, pfmon processes the data and restarts monitoring. The next sample will come after 5000 L3 misses, subsequent samples will be 2000 misses apart.

Pfmon supports multiple sampling periods. Each time one period elapses a sample is recorded. Hence with multiple sampling periods you can, in effect, create an OR condition: after N occurrences of event A OR M occurrences of event B record a sample. The number of sampling periods is limited by the number of counters supported by the underlying PMU. On Itanium and Itanium2 the limit is 4 counters, thus 4 sampling periods.

When multiple sampling periods are used, they must be specified as a comma-separated list (no space). They apply in the same order as the events are specified with the -e option:

   % pfmon --long-smpl-periods=4000,3000 --short-smpl-periods=2000,1000 -el2_misses,l3_misses ...

Here, the long and short periods and l2_misses are respectively 4000 and 2000. For l3_misses they are 300 and 1000 respectively.

At a minimum, You must specify either one of the two types of sampling periods. Pfmon automatically adjusts the other one to be equal in that case. Hence the following:

   % pfmon --long-smpl-periods=4000 -el2_misses ...

implies that the short period for l2_misses is 4000. You can see this with the --verbose option:

   % pfmon --verbose --long-smpl-periods=4000 -el2_misses ...
   loaded 10806 symbols from /proc/kallsyms
   loaded 1353 symbols from ELF file noploop
   measuring at user privilege level ONLY
   using detailed-itanium2 sampling module
   long  sampling periods(val/mask/seed): 2000/0x0/0
   short sampling periods(val/mask/seed): 2000/0x0/0
   ...

Sampling is not supported when multiple event sets are specified.

10.3 Sampling modules

There are many ways in which the samples can be saved or printed on the screen. Pfmon offers a choice of sampling module depending on the underlying CPU and what is being measured.

To find out the list of builtin modules, simply use the -I option:

   % pfmon -I
   detected host CPUs:  4-way 1600MHz/4.0MB -- Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz (stepping 11)
   detected pfmon  PMU model: Intel Core
   detected libpfm PMU model: Intel Core
   max counters/set: 5
   supported pfmon PMU models: [AMD64] [Pentium 4] [Intel Core] [Intel Atom] [Intel Nehalem] [Intel architectural PMU] 
   supported sampling modules: [inst-hist] [detailed] [compact] [raw] [pebs] 
   pfmlib version: 3.9
   kernel perfmon version: 2.82
   kernel clock resolution: 1ns (1000000000Hz)
   host kernel architecture: x86_64

The first module listed, here [inst-hist], is selected by default.

Because some modules are dependent on the underlying PMU model, the choices may vary from one processor to another. The example below is for an Intel Dual-Core Itanium 2 processor:

   % pfmon -I
   detected host CPUs:  4-way 1666MHz/9.0MB Dual-Core Itanium 2 (rev 1)
   detected PMU model: dual-core Itanium 2
   max counters/set: 12
   supported PMU models: [Itanium] [Itanium2] [dual-core Itanium 2] [generic IA-64] 
   supported sampling modules: [inst-hist] [detailed] [compact] [raw] [dear-hist] 
   pfmlib version: 3.5
   kernel perfmon version: 2.81
   kernel clock resolution: 1000000ns (1000Hz)
   host kernel architecture: ia64

A sampling module is called when the sampling buffer becomes full and needs to be drained (dumped). Each module is free to process the data in any way it wants. Some of them can be specialized in ONLY a certain type of measurements.

The following formats are available for all processors:

[inst-hist] (default) : prints an instruction pointer based histogram. This module works for all CPU models. It is possible to specify multiple sampling periods, samples counts for each events are printed.
[detailed] prints detailed information about each sample. The header and body are decoded. Each PMD is detailed. No histrogram is generated.
[compact]Obsolete since version 3.9, use --smpl-compact instead.
[raw]Obsolete since version 3.9, use --smpl-raw instead

By default, pfmon auto-detects the host CPU and picks the first matching sampling module. On all PMU models, the inst-hist module is chosen by default. This can be changed using the the --smpl-module option:

   % pfmon --no-cmd-output --smpl-module=inst-hist --smpl-compact --long-smpl-periods=1000000 noploop 10000000
   0           11345    11345  3 0x400280 0x0001b4b738f07b06  17     0         -1000000 
   1           11345    11345  3 0x400280 0x0001b4b738f64a91  17     0         -1000000 
   2           11345    11345  3 0x400280 0x0001b4b738fc0ae9  17     0         -1000000 
   3           11345    11345  3 0x400280 0x0001b4b73901cad5  17     0         -1000000 
   4           11345    11345  3 0x400280 0x0001b4b739079e20  17     0         -1000000 
   5           11345    11345  3 0x400280 0x0001b4b7390d5eae  17     0         -1000000 
   ...

You can find out more about each module by using the -S or --smpl-module-info option:

   %  pfmon -S dear-hist
   name        : dear-hist
   description : Data EAR-based cache/tlb misses histograms
   PMU models  : [itanium] [itanium2] [dual-core Itanium 2] 
   options     :
        --smpl-inst-view                Show instruction address based histogram (default).
        --smpl-data-view                Show data address based histogram.
        --smpl-level-view               Show cache/tlb level based histogram.
        --smpl-sort-byvalue             Sort samples by their value.
        --smpl-sort-bylevel             Sort samples by cache/tlb hit level.
        --smpl-save-raw                 Save samples in binary format for offline processing.

The name of the module along with a short description is printed. Then, the list of supported PMU models. Finally, module-specific options are listed.

Each module can have module-specific options. Those options become known to pfmon ONLY once the module is known. Thus, you can specify module options ONLY after the module is specified with --smpl-module. this restriction does not apply to the default module (inst-hist):

   % pfmon -ecpu_cycles --smpl-module=dear-hist --smpl-data-view ....

You can list all the modules with all their options simply by typing:

   % pfmon --help
   ....
   options for "inst-hist" sampling format:
        none
   options for "detailed" sampling format:
        none
   options for "compact" sampling format:
        none
   options for "raw" sampling format:
        none
   options for "dear-hist" sampling format:
        --smpl-inst-view                Show instruction address based histogram (default).
        --smpl-data-view                Show data address based histogram.
        --smpl-level-view               Show cache/tlb level based histogram.
        --smpl-sort-byvalue             Sort samples by their value.
        --smpl-sort-bylevel             Sort samples by cache/tlb hit level.
        --smpl-save-raw                 Save samples in binary format for offline processing.
10.4 Basic sampling examples

Suppose you want to record a sample every 1,000,000 cycles. In each sample, you are interested in knowing where the program is, i.e., record the instruction pointer (IP-based sampling). The cycle event is the default event used by pfmon, so all you have to do is specify the sampling period(s):

   % pfmon --long-smpl-period=1000000 -- foo
   #                    event00 
   #     counts   %self    %cum          code addr 
          11470  69.00%  69.00% 0x00000000004005e9 
           4527  27.23%  96.23% 0x00000000004005f1 
            534   3.21%  99.44% 0x00000000004005e0 
             68   0.41%  99.85% 0x00000000004005f4 
             12   0.07%  99.92% 0x00000000004005e6 
              6   0.04%  99.96% 0x00000000004005ed 
              5   0.03%  99.99% 0x00000000004005fc 
              1   0.01%  99.99% 0xffffffff8021e49a 
              1   0.01% 100.00% 0x00000000004005c8 

The example shows shows the output you get with the inst-hist sampling module. The samples are sorted by the number of hits, i.e., the number of times the same instruction address was encountered. So here, the address 0x00000000004005e9 had 11,470 samples. This is where a lot of the time was actually spent. The second column shows the importance of the number of samples for a particular address relative to all the samples collected. Here the 11,470 samples represent 69% of all samples collected during the run. The third column print the cumulative percentage all the way to 100%. The fourth column is the code address.

We can sample on multiple events at the same time, by specifying multiple events and sampling periods:

   % pfmon --long-smpl-periods=1000000,20000 -eunhalted_core_cycles,mem_load_retired:dtlb_miss -- foo
   #                 event00                    event01 
   #  counts   %self    %cum     counts   %self    %cum          code addr 
       12220  75.43%  75.43%      83451  44.87%  44.87% 0x00000000004005e9 
        3447  21.28%  96.71%      85539  45.99%  90.85% 0x00000000004005f1 
         468   2.89%  99.60%      15283   8.22%  99.07% 0x00000000004005e0 
          31   0.19%  99.79%       1134   0.61%  99.68% 0x00000000004005f4 
          10   0.06%  99.85%        175   0.09%  99.77% 0x00000000004005e6 
          10   0.06%  99.91%         91   0.05%  99.82% 0x00000000004005ed 
           5   0.03%  99.94%        196   0.11%  99.93% 0x00000000004005fc 
           5   0.03%  99.98%          4   0.00%  99.93% 0xffffffff80235325 
           2   0.01%  99.99%          6   0.00%  99.93% 0x00000000004005c8 
  ...

Here we are sampling on elapsed cycles and data TLB misses. A sample is recorded every 1,000,000 cycles OR every 20,000 data TLB misses. There is now a second group of 3 columns for the second event. Columns are oredered based on event order. The example shows a correlation between the instruction where we spend a lot of time and the data TLB misses. Samples are always sorted based on the counts of the FIRST event

10.5 Sampling and symbol resolution example

It is possible to get instruction address resolution using the --resolve-addresses option:

 
   % pfmon --resolv --long-smpl-periods=1000000,20000 -eunhalted_core_cycles,mem_load_retired:dtlb_miss -- foo
   #                    event00                    event01
   #     counts   %self    %cum     counts   %self    %cum          code addr symbol
           6976  68.41%  68.41%       4684  49.96%  49.96% 0x00000000004005e9 mat_add_num+0x29<mcol>
           2754  27.01%  95.41%       4220  45.01%  94.97% 0x00000000004005f1 mat_add_num+0x31<mcol>
            376   3.69%  99.10%        301   3.21%  98.18% 0x00000000004005e0 mat_add_num+0x20<mcol>
             46   0.45%  99.55%         61   0.65%  98.83% 0x00000000004005ed mat_add_num+0x2d<mcol>
             34   0.33%  99.88%         86   0.92%  99.74% 0x00000000004005f4 mat_add_num+0x34<mcol>
              7   0.07%  99.95%          2   0.02%  99.77% 0x00000000004005e6 mat_add_num+0x26<mcol>
              2   0.02%  99.97%          0   0.00%  99.77% 0x000000000040076b main+0x10b<mcol>
              2   0.02%  99.99%         11   0.12%  99.88% 0x00000000004005fc mat_add_num+0x3c<mcol>
              1   0.01% 100.00%          0   0.00%  99.88% 0xffffffff8021e49a do_page_fault+0x292<kernel>
              0   0.00% 100.00%          8   0.09%  99.97% 0x0000000000400600 mat_add_num+0x40<mcol>
              0   0.00% 100.00%          2   0.02%  99.99% 0x00000000004005c8 mat_add_num+0x8<mcol>
              0   0.00% 100.00%          1   0.01% 100.00% 0x0000000000400603 mat_add_num+0x43<mcol>

The symbolic name of the function is printed followed by an offset expressed in bytes. It is followed by the name of the ELF file where the function is located. The full path is printed for libraries. When the symbol is not found, the raw address is repeated in the symbol column.

10.6 Controlling sampling output

By default, the profile is printed for each instruction address. Using the --smpl-per-function, it is possible to collapse the output by function:

   % pfmon --resolv --smpl-per-func --long-smpl-periods=1000000,20000 -eunhalted_core_cycles,mem_load_retired:dtlb_miss -- mcol
   #                    event00                    event01
   #     counts   %self    %cum     counts   %self    %cum          code addr symbol
          10273  99.98%  99.98%       9309 100.00% 100.00% 0x00000000004005c0 mat_add_num<mcol>
              1   0.01%  99.99%          0   0.00% 100.00% 0xffffffff802352d2 __do_softirq<kernel>
              1   0.01% 100.00%          0   0.00% 100.00% 0xffffffff8021e208 do_page_fault<kernel>

It is also possible to limit the number of samples printed on the screen using the following options:

  • --smpl-show-top=n : print at most n lines of samples
  • --smpl-cum-threshold=n: stop printing samples when the first event cumulative percentage reaches n%

Both options can be used simultaneously, printing stop when anyone limit is reached. Let us look at a few examples. First let's try to limit the number of lines to the top 5:

   % pfmon --resolv --smpl-show-top=5 --long-smpl-periods=1000000,20000 -eunhalted_core_cycles,mem_load_retired:dtlb_miss -- mcol
   #                    event00                    event01
   #     counts   %self    %cum     counts   %self    %cum          code addr symbol
           6973  67.81%  67.81%       4581  49.16%  49.16% 0x00000000004005e9 mat_add_num+0x29<mcol>
           2852  27.74%  95.55%       4245  45.55%  94.71% 0x00000000004005f1 mat_add_num+0x31<mcol>
            346   3.36%  98.91%        298   3.20%  97.91% 0x00000000004005e0 mat_add_num+0x20<mcol>
             59   0.57%  99.48%         95   1.02%  98.93% 0x00000000004005ed mat_add_num+0x2d<mcol>
             37   0.36%  99.84%         89   0.96%  99.88% 0x00000000004005f4 mat_add_num+0x34<mcol>

Second, let's try limiting output when we reach 96% of samples for the first event:

   % pfmon --resolv --smpl-cum-thresh=96 --long-smpl-periods=1000000,20000 -eunhalted_core_cycles,mem_load_retired:dtlb_miss -- mcol
   #                    event00                    event01
   #     counts   %self    %cum     counts   %self    %cum          code addr symbol
           6983  68.31%  68.31%       4502  48.51%  48.51% 0x00000000004005e9 mat_add_num+0x29<mcol>
           2755  26.95%  95.27%       4330  46.65%  95.16% 0x00000000004005f1 mat_add_num+0x31<mcol>
10.7 Writing samples to a file

By default, all sampling modules print to the controlling terminal. This may be overwhelming at times. Therefore, it is strongly recommended that samples be redirected to a file using the --smpl-outfile option. This option is different from the --outfile option. The latter is where the final counts are saved, whereas the former is where samples are saved. By default, no counts are printed in sampling mode.

The filename specified for the --smpl-outfile is just the basename. A suffix is added by pfmon. The suffix depends on the type of monitoring session. For a system-wide session, you get one file per cpu and the .cpuXX suffix is added with XX representing the CPU number, starting at zero. If aggregation is used then nothing is appended. In per-thread mode, the filename is also the basename. In case the follow-* option is used, you get one file per monitored thread.

10.8 Blocking on overflow notifications

Whenever the sampling buffer becomes full and pfmon is notified, you have the option of either letting the monitored program continue or block it. In both cases, monitoring is off during the processing of the sampling buffer. By default, pfmon lets the program continue its execution. It is possible to block the program using the --overflow-block option. Blocking the program ensures pfmon sees the entire execution. Keeping the program running ensures that the caches and TLB are kept somewhat warm, i.e., with some state belonging to the running process, especially on SMP systems. The option is supported in all types of per-thread sessions including the ones which are following execution through fork(), vfork(), and pthread_create(). This options is not supported in system-wide mode.

IMPORTANT:: this option may have some side effects for programs using signals. Signals may cause the blocking period to be terminated early, yet monitoring would not resume. Thus it is possible to get much fewer samples than expected. Use this option carefully.

10.9 Sampling and the --follow-* options

Sampling is supported across fork(),vfork(), and pthread_create(). It is also supported when sampling across exec().

By default, all sampling modules write to the controlling terminal, hence it may be difficult to distinguish which samples come from which task and output may be interleaved. However, pfmon does serialize terminal output. This means that samples output is batched per thread.

The best way to handle multi process/thread sampling sessions is to redirect the output to a file using the --smpl-outfile option. The filename passed to the option serves as a basename very much like what is done for the --outfile option that we described above.

Let us look at our compilation example:

   % pfmon --follow-all --no-cmd-out --smpl-outfile=gcc-smpl --long-smpl-periods=1000000 -- gcc hello.c -o hello -O2 -static
   % ls -l gcc-smpl.*
   2174981 -rw-r--r--    1 eranian  users           0 Nov  4 17:31 gcc-smpl.24410.24410
   2175304 -rw-r--r--    1 eranian  users           0 Nov  4 17:31 gcc-smpl.24411.24411
   2175313 -rw-r--r--    1 eranian  users        1046 Nov  4 17:31 gcc-smpl.24412.24412
   2175323 -rw-r--r--    1 eranian  users         380 Nov  4 17:31 gcc-smpl.24413.24413
   2175327 -rw-r--r--    1 eranian  users           0 Nov  4 17:31 gcc-smpl.24414.24414
   2175337 -rw-r--r--    1 eranian  users       14634 Nov  4 17:31 gcc-smpl.24415.24415

Just like for --outfile, the process id and the thread id are appended to the provided filename. The thread id differs from the process id only when following across threads. For processes, the two numbers are identical:

 
       gcc-smpl.24410.24410
       ^        ^     ^
       |        |     |
       |        |     thread id
       |        process id
       base filename

You can get detailed information included in each file using the --with-header option, just like for --outfile.

The --exec-split-results and --follow-exec-* options have the same effects as on --outfile. The exec count is appended to the filename:

   % pfmon -exec-split --follow-all --no-cmd-out --smpl-outfile=gcc-smpl \
     --long-smpl-periods=1000000 -- gcc hello.c -o hello -O2 -static
   % ls -l gcc-smpl.*
   2174981 -rw-r--r--    1 eranian  users        1683 Nov  4 17:38 gcc-smpl.25727.25727.1
   2175304 -rw-r--r--    1 eranian  users        1594 Nov  4 17:38 gcc-smpl.25728.25728.0
   2175312 -rw-r--r--    1 eranian  users        1990 Nov  4 17:38 gcc-smpl.25728.25728.1
   2175317 -rw-r--r--    1 eranian  users        1594 Nov  4 17:38 gcc-smpl.25729.25729.0
   2175323 -rw-r--r--    1 eranian  users        2717 Nov  4 17:38 gcc-smpl.25729.25729.1
   2175332 -rw-r--r--    1 eranian  users        1594 Nov  4 17:38 gcc-smpl.25730.25730.0
   2175333 -rw-r--r--    1 eranian  users        1982 Nov  4 17:38 gcc-smpl.25730.25730.1
   2175336 -rw-r--r--    1 eranian  users        1594 Nov  4 17:38 gcc-smpl.25731.25731.0
   2175337 -rw-r--r--    1 eranian  users        1964 Nov  4 17:38 gcc-smpl.25731.25731.1
   2175341 -rw-r--r--    1 eranian  users        1964 Nov  4 17:38 gcc-smpl.25732.25732.0
   2175342 -rw-r--r--    1 eranian  users       16568 Nov  4 17:38 gcc-smpl.25732.25732.1

If we look at the header of the largest file, we find out that this is the linker, which is exec'ed from the collect2 process:

   # command line     : /usr/bin/ld -static -o hello /usr/lib/gcc-lib/ia64-linux/2.96/...
   #
   # process id       : 25732
   # thread  id       : 25732
   # parent process id: 25731
   # parent thread  id: 25731
   # exec count       : 1
10.10 Sampling in system wide mode

Sampling is possible in the same manner for system wide sessions. By default, the buffer is printed on the controlling terminal. When sampling on more than one CPU at a time, samples for each CPU are printed separately. When sampling results are redirected into a file with the --smpl-outfile option, you get one file per CPU. If the file is called 'myresults', then 'myresults.cpu0' contains the samples captured on CPU0, 'myresults.cpu1' the ones from CPU1, and so on:

   % pfmon --system-wide --smpl-outfile=sysout --long-smpl-periods=1000000 --session-timeout=3
   <session to end in 3 seconds>
   % ls -l sysout*
   sysout.cpu0  sysout.cpu1
   % ls -l sysout.cpu*
   -rw-r--r--    1 eranian  users           0 Nov  4 12:22 sysout.cpu0
   -rw-r--r--    1 eranian  users           0 Nov  4 12:22 sysout.cpu1

The --aggregate-results option changes the way samples are saved to files. When this option is used, samples are merged into a single file. In our example, they would go into 'myresults':

10.11 Randomization of sampling periods

Pfmon supports randomization of sampling periods. This is useful to minimize the risk of getting biased samples. This is especially true when sampling on events that happen very frequently such as taken branches. The user must supply a bitmask and a seed value using the --smpl-periods-random option. The seed value is ignored with perfmon v2.2 or higher. The same mask applies to both the long and short period for each event.

10.12 Miscellaneous sampling options

When sampling, the counts printed at the end of the session are not very useful, especially for the counters used as sampling periods. By default, pfmon does not print them at the end of a sampling session. You can change this by using the --smpl-print-counts option.

For certain measurements, it is interesting to put the sampling buffer in "saturation" mode. This means that monitoring stops as soon as the buffer fill up. Pfmon is not notified at all. In the end, we collect the first N samples, where N is determine by the size of the sampling buffer. The processes and session continues until normal termination, but no monitoring is active. You can turn on this mode using the --saturate-smpl-buffer.

By default, pfmon includes the PMD register values for all but the overflowed PMD in a sample. For instance, if you do the following:

   % pfmon -ecpu_cycles,ia64_inst_retired --long-smpl-periods=15000 ls

Assuming cpu_cycles is counted by PMD4 and ia64_inst_retired by PMD5, in each sample created by the overflow of PMD4, the value of PMD5 is saved in the body. In some configurations, it may be interesting to include other PMD register values in the body of a sample, e.g. some PMD may be directly be related to an event and thus cannot be specified on the command line. For these configurations, it is possible to use the --extra-smpl-pmds option:

   % pfmon -ecpu_cycles,ia64_inst_retired --extra-smpl-pmds=0,2-3 --long-smpl-periods=15000 ls

In this example, each sample will include the value of PMD5 (representing ia64_inst_retired) and also the values of PMD0, PMD2, PMD3. PMD register values are always stored in increasing index order. Note that the PMC registers, if any, corresponding to those extra PMD registers are not touched. As such they inherit their default values.


11. Getting support

The perfmon project (includes the kernel subsystem and related tools) has a web site here. The web site for earlier versions of pfmon is here.

There exists a mailing dedicated to perfmon related discussions. Go to the perfmon web site for more details.



Last updated 10/18/2007