|
Pfmon is a performance monitoring tool uniquely designed for Linux.
Pfmon can be used to count or sample on unmodified program or across an entire system.
Pfmon supports the performance counters of all Itanium processors, Intel Pentium M, P6, Pentium 4/Xeon,
Intel Core Duo/Solo, Intel Core 2 Duo, as well as AMD X86-64 processors.
This document is an attempt at providing some documentation on how to use pfmon.
The content covers pfmon version 3.2 only.
Pfmon version 3.2 requires
a Linux 2.6.x kernel. It does not work on 2.4.x kernels. For non Itanium platforms, it is necessary
to apply the kernel patch available on our download page.
Table of contents
- Introduction
- Command line options
- Getting event information
- Basic counting
- Simple examples
- Privilege levels
- Counter formats
- Saving counts
- Hiding program output
- Getting timing information
- Testing event combinations
- Using event sets
- Using event unit masks
- Monitoring running tasks
- System wide sessions
- Selecting CPUs to monitor
- Delimiting a session
- Results aggregation
- Delayed start
- Saving results in files
- Excluding the idle task
- Printing intermediate counts (NEW)
- Monitoring multiple processes/threads
- Basic monitoring
- Saving results in files
- Separating execution between fork()/vfork() and exec()
- Selectively monitoring processes
- Aggregating results from multiple processes/threads
- Triggering monitoring at specific locations
- Basic code triggers
- Data triggers
- Repeating triggers
- Inheritance of triggers
- Dynamic code stop triggers
- Dealing with symbols
- Symbol table sources
- Symbol lookups
- Symbols and the --follow-* options
- Symbols resolution and sampling
- Symbols resolution examples
- Sampling with pfmon (UPDATED)
- Principles
- Sampling periods
- Sampling modules
- Basic sampling examples
- Sampling and symbol resolution example
- Controlling sampling output
- Writing samples to a file
- Blocking on overflow notifications
- Sampling and the --follow-* options
- Sampling in system wide sessions
- Randomization of sampling periods
- Miscellaneous sampling options
- Getting support
For model-specific options, refer to the following manuals:
- Itanium 2 processor Support
- Dual-Core Itanium 2 (Montecito) processor Support
- AMD X86-64 processor Support
- Intel Pentium M/P6 processor Support
- Intel Core-based processor Support(NEW)
1. Introduction
Pfmon can be used to monitor unmodified binaries in its per-thread mode and it can
also be used to run system wide monitoring sessions. Such sessions are active across
all threads executing on a given CPU. Pfmon can launch a system wide session
on a dedicated CPU or set of CPUs in parallel.
Pfmon can monitor activities happening at the user and/or kernel level for both
type of sessions.
Pfmon can be used to collect basic event counts. It can also be used to sample
program or system execution.
In per-thread mode, pfmon can monitor the first (master) thread of a process but
it can also follow fork, vfork and pthread-create by dynamically creating additional monitoring
sessions.
Pfmon is based on a generic helper library called libpfm which is not included in this
package but can be downloaded and installed separately. This version of pfmon requires
libpfm-3.2 or higher.
In the remainder of this document, we describe the key options and features of pfmon
which are available on all CPU (PMU) models. Please refer to the model specific
documentation for advanced features.
This documentation is for pfmon-3.2 ONLY.
Pfmon version 3.2, requires
a Linux-2.6.x series kernel. It does not work on 2.4.x kernels. For non Itanium platforms, it is necessary
to apply the kernel patch available on our download page.
2. Pfmon command line options
The set of command line options provided by pfmon depends on the host PMU. It is possible
to compile pfmon for more than one PMU model and then it will auto-detect
the host PMU and provide the corresponding set of options.
The options common to all PMU models are as follows:
| -h, --help | display the list of supported options |
| -V, --version | display pfmon version information |
| -l[regex], --show-event-list[=regex] | show list of supported events by host PMU |
| -i event, --event-info=event | get information about a particular event |
| -u, -3, --user-level | monitor at the user level for all events |
| -k, -0, --kernel-level | monitor at the kernel level for all events |
| -2 | monitor at privilege level 2 |
| -1 | monitor at privilege level 1 |
| -e, --events=ev1,ev2,... | select events to monitor in one set |
| -I,--info | show pfmon information and detect host CPU |
| --verbose | print more information during execution |
| --debug | enable debug prints |
| --outfile=filename | save counts in a file |
| --append | when used with --outfile, will open the file in append mode |
| --overflow-block | block monitored program on overflow notifications |
| --system-wide | create a system wide monitoring session |
| --cpu-list=x,y-z,.. | list the cpu (numbers) to include in the system-wide monitoring session |
| -S format, --smpl-module-info=format | display information about a sampling module |
| -t secs, --session-timeout=secs | duration of the session in seconds |
| --smpl-outfile=filename | save samples in a file |
| --long-smpl-periods=val1,val2,... | set long sampling periods for each event |
| --short-smpl-periods=val1,val2,... | set short sampling period for each event |
| --with-header | generate a machine description header with counts and samples |
| --aggregate-results | aggregate counts and sampling buffer for system-wide sessions |
| --priv-levels=lvl1,lvl2,... | set privilege level per event |
| --show-time | show real,user, and system time for the executed command |
| --us-counter-format | print counters using commas (1,024) |
| --eu-counter-format | print counters using points (1.024) |
| --hex-counter-format | print counters in hexadecimal (0x400) |
| --smpl-module=module_name | specify the sampling module to use (default: autodetect) |
| --long-show-events[=regex] | display single-line detailed information about events |
| --symbol-file=filename | use the ELF archive filename to look for symbols |
| --sysmap-file=filename | use the System.map filename to look for symbols |
| --check-events-only | check that the event combination is valid and exit (no measurement) |
| --smpl-periods-random=mask1:seed1,... | randomize both the short and long sampling period |
| --resolve-addresses | resolve addresses to symbols in sampling output |
| --demangle-cpp | OBSOLETE option |
| --demangle-java | OBSOLETE option |
| --smpl-print-counts | print counter results when sampling (default: off) |
| --reset-non-smpl-periods | reset counters not use as sampling period for each sample |
| --saturate-smpl-buffer | collect samples only up to the limit of the buffer (default: off) |
| --no-cmd-out | redirect executed command output to /dev/nul |
| --attach-task=pid | attach monitoring to the thread specified by pid |
| --trigger-code-start-address=addr,.. | start monitoring when execution reaches addr |
| --trigger-code-stop-address=addr,... | stop monitoring when execution reaches addr |
| --trigger-data-start-address=addr,.. | start monitoring when data location addr is accessed |
| --trigger-data-stop-address=addr,... | stop monitoring when data location addr is accessed |
| --trigger-code-repeat | repeat code trigger start/stop (default: off) |
| --trigger-data-repeat | repeat data trigger start/stop (default: off) |
| --trigger-data-ro | data triggers activated only on read only access (default: read-write) |
| --trigger-data-wo | data triggers activated only on write only access (default: read-write) |
| --trigger-code-follow | code trigger start/stop continues across fork,vfork,pthread_create (default: off) |
| --trigger-data-follow | data trigger start/stop continues across fork,vfork,pthread_create (default: off) |
| --follow-fork | monitoring continues across fork, i.e., in the new process (default: off) |
| --follow-vfork | monitoring continues across vfork, i.e., in the new process (default: off) |
| --follow-pthread | monitoring continues across pthread_create, i.e., in the new thread (default: off) |
| --follow-all | union of --follow-fork, --follow-vfork, --follow-pthread, --follow-exec (default: off) |
| --follow-exec[=regex] | monitoring will continue across exec, when regular expression is provided, will follow |
| --follow-exec-exclude[=regex] | similar to --follow-exec, except that a match will stop the monitoring (default: off) |
| --exec-split-results | differentiate monitoring sessions before and after a call to exec() (default: off) |
| --pin-command | pin executed command on the same processors as specified in --cpu-list (system-wide only) |
| --dont-start | do not start monitoring (per-thread only) |
| --switch-timeout=milliseconds | number of milliseconds before switching to the next event set |
| --cpu-set-relative | CPU identification in --cpu-list relatie to cpu_set affinity (default: off) |
| --extra-smpl-pmds | specify list of extra PMDs to include in samples |
| --print-interval=milliseconds | print intermediate deltas of counts in system-wide(NEW) |
| --smpl-show-top=n | print only top n lines of the profile (NEW) |
| --smpl-per-function | collapse samples per function (default: per-addresses)(NEW) |
| --smpl-cum-thres=p | print samples up to p percent of cumulative samples (default: 100) (NEW) |
For historical reasons, all generic examples were run on Itanium 2. They work similarly on other
processor architectures.
3. Getting event information
The list of events supported by pfmon depends on the host PMU. You can get the list
of supported events using the following pfmon option:
% pfmon -l
CPU_CYCLES
IA64_INST_RETIRED
IA64_TAGGED_INST_RETIRED_PMC8
IA64_TAGGED_INST_RETIRED_PMC9
INST_DISPERSED
EXPL_STOPBITS
ALL_STOPS_DISPERSED
IA32_INST_RETIRED
ISA_TRANSITIONS
NOPS_RETIRED
....
If you specify an argument to the -l option (no space between l and the
argument), it is interpreted as a regular expression and all matching events
will be listed:
% pfmon -ll1d
L1D_READ_FORCED_MISSES_RETIRED
L1D_READ_MISSES_RETIRED
L1D_READS_RETIRED
PIPELINE_FLUSH_L1D_WAYMP_FLUSH
You can get more detailed information about each event using the following
option (shown for Itanium2):
% pfmon -i nops_retired
Name : NOPS_RETIRED
VCode : 0x50
Code : 0x50
PMD/PMC: [ 4 5 6 7 ]
Umask : 0000
EAR : No (N/A)
BTB : No
MaxIncr: 6 (Threshold [0-5])
Qual : [Instruction Address Range] [OpCode Match]
Group : None
Set : None
Desc : Retired NOP Instructions
Pfmon is case insensitive for event names. Here you see some details about the event.
The first 4 lines are generic and provided on all PMU models even though the codes may
vary:
- Code is the event code used by the PMU.
- Vcode is a libpfm internal event code which encapsulates the event code and other
information describing the type of the event. For simple events, the two codes are
usually identical.
- PMD/PMC: list the counting monitors on which this event can be programmed. Not
all events can necessarily be programmed on all available counting
monitors. This constraint is taken care of by the libpfm library.
The remaining information is specific to the Itanium 2 PMU.
Even with the -i option, you can use a regular expression for the event:
% pfmon -i'writes$'
Name : L2_DATA_REFERENCES_L2_DATA_WRITES
VCode : 0x200b2
Code : 0xb2
PMD/PMC: [ 4 5 6 7 ]
Umask : 0010
EAR : No (N/A)
BTB : No
MaxIncr: 4 (Threshold [0-3])
Qual : [Instruction Address Range] [OpCode Match] [Data Address Range]
Group : L2 Cache
Set : 1
Desc : Data Read/Write Access to L2 -- count only data write and semaphore operations
On some PMU models, the events information contains a text description of the event.
Events can be specified using their code:
% pfmon -i 0x45
Name : L2_INST_PREFETCHES
VCode : 0x45
Code : 0x45
PMD/PMC: [ 4 5 6 7 ]
Umask : 0000
EAR : No (N/A)
BTB : No
MaxIncr: 1 (Threshold 0)
Qual : [Instruction Address Range]
Group : None
Set : None
Desc : L2 Instruction Prefetch Requests
For some PMU models, the unit masks are clearly spelled out and can be specified and
conbined. For example, on an Intel Core 2 Duo processor, the following event:
% pfmon -iPAGE_WALKS
Name : PAGE_WALKS
Code : 0xc
Counters : [ 0 1 ]
Desc : Number of page-walks executed
Umask-00 : 0x01 : [COUNT] : Number of page-walks executed
Umask-01 : 0x02 : [CYCLES] : Duration of page-walks in core cycles
PEBS : No
has two unit masks. See here for more information on how to specify unit masks on the
command line
Information about what each event measures can be found in the relevant CPU model specific
micro-architecture documentation.
4. Basic counting with pfmon
No instrumentation of the program is required to monitor the system or a
single thread.
4.1 Simple examples
To collect counts on a specific command, you just need to launch it via pfmon, just like
you would do with the time or strace command:
% pfmon ls /dev/null
/dev/null
620458 CPU_CYCLES
When invoked with no particular event, pfmon uses a default event which depends on the PMU model.
It is typically measuring elapsed cycles.
To monitor specific events, you can type:
% pfmon -e cpu_cycles,IA64_inst_Retired ls /dev/null
/dev/null
436368 CPU_CYCLES
513437 IA64_INST_RETIRED
As you can see, pfmon is not case sensitive with regards to event names. More than one event
can be measured at a time using a comma separated list of events. You MUST not have space
after the comma.
If the command you want to run takes options, you can clearly distinguish the options of
pfmon from the options of your command using the '--' separator:
% pfmon -e ia64_inst_retired -- ls -ial /dev/null
3679352 crw-rw-rw- 1 root root 1, 3 Dec 6 2002 /dev/null
557392 IA64_INST_RETIRED
Otherwise, pfmon will interpret your command's options as pfmon options.
4.2 Privilege levels
By default, pfmon monitors only what is going at the user level
(application level). This is true for both per-thread and system wide
mode.
The number of privilege levels depends on the processor architecture. Most
processors support at least 2 levels (user and kernel). It is also possible to
monitor at several levels at the same time by specifying more than one level.
On processors with only 2 privilege levels, options -2 and -1 are ignored
are not supported.
The levels can be specified for all events or on a per-event basis.
To affect all events, you can use any combinations of -k (-0), -1, -2, -u (or -3).
To set the level for each event, the --priv-levels option must be used.
By default, pfmon only measures at the user level:
% pfmon -e nops_retired ls /dev/null
counts the number of NOPS_RETIRED when ls is running at the user level only
(equivalent to specifying -u or -3).
% pfmon -k -e nops_retired ls /dev/null
counts the number of NOPS_RETIRED when ls is running at the kernel level only.
% pfmon -k -u -e nops_retired ls /dev/null
counts the number of NOPS_RETIRED when ls is running at the kernel level
or user level, i.e. all the time.
It is possible to refine the settings on a per event basis using the
--priv-levels option.
% pfmon -e loads_retired,nops_retired ls /dev/null
Both events are measured at the user level only.
% pfmon --priv-level=u,k -e loads_retired,nops_retired ls /dev/null
LOADS_RETIRED is measured at the user level only, NOPS_RETIRED at the
kernel level only.
% pfmon --priv-level=,uk -e loads_retired,nops_retired ls /dev/null
LOADS_RETIRED is measured at the user level only, NOPS_RETIRED at the
user and kernel levels.
% pfmon -k --priv-level=uk -e loads_retired,nops_retired ls /dev/null
LOADS_RETIRED is measured at the user and kernel levels, NOPS_RETIRED at the
kernel level only.
4.3 Counter formats
Pfmon can display the final counts in various formats. There are 4 formats
defined. The default one is shown in the example above. To make is easier
to read large numbers or to feed the number to other programs.
By default, you get the raw number:
% pfmon ls -l /dev/null
crw-rw-rw- 1 root root 1, 3 Dec 6 2002 /dev/null
724882 CPU_CYCLES
The --us-counter-format option has the thousands, millions, billions are separated
with commas (US and UK style):
% pfmon --us-counter-format ls -l /dev/null
crw-rw-rw- 1 root root 1, 3 Dec 6 2002 /dev/null
724,882 CPU_CYCLES
The --eu-counter-format option has the thousands, millions, billions are separated
with points (European style):
% pfmon --eu-counter-format ls -l /dev/null
crw-rw-rw- 1 root root 1, 3 Dec 6 2002 /dev/null
724.822 CPU_CYCLES
The --hex-counter-format options shows counts in hexadecimal format:
% pfmon --hex-counter-format ls -l /dev/null
crw-rw-rw- 1 root root 1, 3 Dec 6 2002 /dev/null
0x00000000000b0f56 CPU_CYCLES
4.4 Saving counts
By default, the counts are printed on the controlling tty. However it is
possible to save them in a file using the --outfile option:
% pfmon --outfile=b --hex-counter-format ls -l /dev/null
crw-rw-rw- 1 root root 1, 3 Dec 6 2002 /dev/null
% cat b
0x00000000000b7cad CPU_CYCLES
It is possible to include a header with the results using the --with-header
option. It will be printed on the controlling tty or saved in the output
file. The header contains detailed information about the configuration of
the host machine and on the monitoring session:
% pfmon --with-header --outfile=b --hex-counter-format ls -l /dev/null
crw-rw-rw- 1 root root 1, 3 Dec 6 2002 /dev/null
% cat b
#
# date: Thu Dec 22 07:57:46 2005
#
# hostname: caldera.hpl.hp.com
#
# kernel version: Linux 2.6.15-rc6 #8 SMP Wed Dec 21 08:47:10 PST 2005
#
# pfmon version: 3.2
# kernel perfmon version: 2.2
#
#
# page size: 16384 bytes
#
# CLK_TCK: 1000 ticks/second
#
# CPU configured: 2
# CPU online : 2
#
# physical memory : 4170711040 bytes (3977.5 MB)
# physical memory available: 3002400768 bytes (2863.3 MB)
#
# host CPUs: 2-way 1500MHz/6.0MB Itanium 2 (Madison, B1)
# PAL_A: 0.5.37
# PAL_B: 0.5.65
# Cache levels: 3 Unique caches: 4
# L1D: 16384 bytes, line 64 bytes, load_lat 1, store_lat 3
# L1I: 16384 bytes, line 64 bytes, load_lat 1, store_lat 0
# L2 : 262144 bytes, line 128 bytes, load_lat 5, store_lat 7
# L3 : 6291456 bytes, line 128 bytes, load_lat 14, store_lat 7
#
#
# information for set0:
# number of runs : 1
# active duration: 2043539 cycles
# PMD4: CPU_CYCLES = user
#
#
#
# instruction sets for set0:
# PMD4: CPU_CYCLES = ia32/ia64
#
#
# pfmon command line: ./pfmon --with-header --outfile=b --hex-counter-format ls -l /dev/null
#
# monitoring mode : per-thread
# command line : ls -l /dev/null
#
# process id : 19259
# thread id : 19259
# parent process id: -1
# parent thread id: -1
# exec count : 1
#
#
#
0x00000000000e8203 CPU_CYCLES
4.5 Hiding program output
It is possible to redirect the output of the programs you execute under pfmon to /dev/null.
This is desireable if you are interested in keeping a clean output and are not using an
output file. By default, programs share the same terminal output as pfmon. You can use the
--no-cmd-output option to eliminate the output.
Normal output, we see the result of the "ls" command:
% pfmon --us-counter-format ls -l /dev/null
crw-rw-rw- 1 root root 1, 3 Dec 6 2002 /dev/null
724,882 CPU_CYCLES
Using the option, you get:
% pfmon --no-cmd-output --us-counter-format ls -l /dev/null
724,882 CPU_CYCLES
4.6 Getting timing information
It is possible to get a time breakdown of the execution of the monitored command for
both per-thread and system-wide mode using the --show-time option. The output is similar
to the time(1) command. For instance:
% pfmon --show-time -e nops_retired ls /dev/null
/dev/null
145158 NOPS_RETIRED
real 0h00m00.001s user 0h00m00.000s sys 0h00m00.000s
4.7 Testing event combinations
Sometimes it is handy to check if some events can be measured simultaneously without actually
starting the monitoring session. The --check-events-only option of pfmon allows this mode of
operation. It checks that the combination is valid and then exits. If the combination is
invalid, it prints out the reason and pfmon exits with a return code of 1, otherwise a value
of 0 is returned. On Itanium2, for instance, you can try:
% pfmon --check-events-only -e loads_retired,stores_retired
event LOADS_RETIRED and STORES_RETIRED cannot be measured at the same time
% echo $?
1
Note that in this mode, you do not need to specify a command to execute.
4.8 Using event sets
You need to run on a kernel with perfmon v2.2 or higher to get this feature.
Pfmon can monitor more events than there are actual physical counters on the PMU. It does
this by leveraging the event sets exposed by the perfmon interface. Each set can have as
many events as there are counters in the PMU. You do not need to fully populate a set before defining
another. In fact, you may not always be able to have full sets if certain events cannot be
measured at the same time. In that case, you can place the conflicting events into different
sets. Pfmon does not re-order events among sets to resolve any combination conflicts.
The order in which sets are specified determines the switching order. Sets are multiplexed
onto the actual PMU by the kernel. The rate of switching is determined by a timeout which
must be specified with the --switch-timeout option.
% pfmon -ecpu_cycles,loads_retired -enops_retired,stores_retired --switch-timeout=100 ls /dev/null
/dev/null
set1 was not activated, set1 counts set to zero. Try decreasing the switch timeout.
911397 CPU_CYCLES
122096 LOADS_RETIRED
0 NOPS_RETIRED
0 STORES_RETIRED
In this example, the command ran for too short to get a chance to switch to the 2nd set. You can
try decreasing the timeout:
% pfmon -ecpu_cycles,loads_retired -enops_retired,stores_retired --switch-timeout=1 ls /dev/null
646412 CPU_CYCLES
107934 LOADS_RETIRED
378464 NOPS_RETIRED
29832 STORES_RETIRED
When using multiple sets, pfmon scales the counts based on the number of runs of each set.
it is VERY important to understand that results are just estimates
of what the real result would be if we had monitored all events at the same time. Depending on the
switching rate, results may be off.
Note also that the granularity of the underlying timer is limited and depends on the hardware platform
and kernel compile options. A 10ms granularity is common on many platforms. That means that any timeout
is rounded up to the closest multiple of 10ms. Pfmon does print a warning is the difference between the
requested and effective switch timeout is bigger than 2%:
% pfmon -ecpu_cycles,loads_retired -enops_retired,stores_retired --switch-timeout=1 ls /dev/null
...
requested switch timeout 1 msecs
effective switch timeout 10 msecs (+900.00%)
Warning: host kernel timer granularity only allows the following timeout:
requested switch timeout 1 msecs
effective switch timeout 10 msecs (+900.00% offset)
...
Event related options such as --priv-levels can be specified per set. They always apply to the set
that precedes them on the command line. Here is an example:
% pfmon -ecpu_cycles,loads_retired --prev-level=u,k -enops_retired,stores_retired --switch-timeout=1 ls /dev/null
Here the privilege level of the events will be as follows:
- set0: cpu_cycles = user
- set0: loads_retired = kernel
- set1: nops_retired = user (default priv level)
- set1: stores_retired = user (default priv level)
Pfmon does not support using multiple event sets when sampling.
4.9 Using events with unit masks
For some PMU models, events can have sub-events called unit masks. There may be more than one
unit mask possible and they can oftentimes be combined. This way one can measure multiple events using only one
counter. The libpfm library supports unit masks on X86 processors and for some events of the dual-core Itanium 2 processor.
To figure out the list of available unit mask for a given event, you can use the -i option of pfmon. Here
is an example for an Intel Core 2 Duo processor:
% pfmon -i BUS_DATA_RCV
Name : BUS_DATA_RCV
Code : 0x64
Counters : [ 0 1 ]
Desc : Bus cycles while processor receives data
Umask-00 : 0x40 : [SELF] : This core
Umask-01 : 0xc0 : [BOTH_CORES] : Both cores
PEBS : No
If an event has unit masks, pfmon fails when at least of unit mask is not specified:
% pfmon -ebus_data_rcv date
event bus_data_rcv needs at least a unit mask among: [SELF] [BOTH_CORES]
Unit masks can be specified on the command line of pfmon using a simple event_name:umask1:umask2... notation:
% pfmon -ebus_data_rcv:both_cores ls /dev/null
/dev/null
33931 BUS_DATA_RCV:BOTH_CORES
Multiple unit mask can be combined for a given event:
% pfmon -isimd_inst_retired
Name : SIMD_INST_RETIRED
Code : 0xc7
Counters : [ 0 1 ]
Desc : Retired Streaming SIMD Extensions (SSE) packed-single instructions
Umask-00 : 0x01 : [PACKED_SINGLE] : Retired Streaming SIMD Extensions (SSE) packed-single instructions
Umask-01 : 0x02 : [SCALAR_SINGLE] : Retired Streaming SIMD Extensions (SSE) scalar-single instructions
Umask-02 : 0x04 : [PACKED_DOUBLE] : Retired Streaming SIMD Extensions 2 (SSE2) packed-double instructions
Umask-03 : 0x08 : [SCALAR_DOUBLE] : Retired Streaming SIMD Extensions 2 (SSE2) scalar-double instructions
Umask-04 : 0x10 : [VECTOR] : Retired Streaming SIMD Extensions 2 (SSE2) vector integer instructions
Umask-05 : 0x1f : [ANY] : Retired Streaming SIMD instructions (precise event)
PEBS : [ANY]
To count PACKED_SINGLE and PACKED_DOUBLE, you need to specify the event as follows:
% pfmon -esimd_inst_retired:PACKED_SINGLE:PACKED_DOUBLE -- foo
849346660 SIMD_INST_RETIRED:PACKED_SINGLE:PACKED_DOUBLE
To verify that this actually combined the two unit mask, you can measure them in separate
counters:
% pfmon -esimd_inst_retired:PACKED_DOUBLE,simd_inst_retired:PACKED_SINGLE -- foo
419430400 SIMD_INST_RETIRED:PACKED_DOUBLE
429916260 SIMD_INST_RETIRED:PACKED_SINGLE
5. Monitoring running tasks
This version of pfmon can "attach" to an already running thread, monitor for a while
and eventually "detach" from the thread. This is very useful when monitoring system
daemons or workload that are very long to start. The thread to attach to is specified
using the --attach-task option with a thread id as reported by the ps command or the
gettid() system call. You must have permission to access the process for this option to work.
It is not possible to attach to kernel threads, such as kswapd, migration, khubd, kjournald
even if you are root.
Let us look at some examples. Suppose we have this long running program called noploop and
we want to monitor it once it is started:
% noploop 1000000000000000 &
% ps ax | fgrep noploop
22731 pts/7 R 1:33 nueh/noploop 1000000000000000
% pfmon --attach-task=22731
session interrupted by user
120106020603 CPU_CYCLES
This pfmon command attaches to thread 22731 and collects CPU_CYCLES in user level until the thread
terminates. It is possible to interrupt if you want to stop before the thread terminates by
simply pressing CTRL-C (control C) on your keyboard. This prints the counts so far and detaches
from the program which will continue to run.
It is possible to use a session timeout to delimit the duration of such measurement using the
--session-timeout option:
% pfmon --session-timeout=5 --attach-task=22731
session timed out
4988897963 CPU_CYCLES
It is not possible to use the --session-timeout option with any of the -follow-* option is used.
But the --attach-task option can certainly be combined with any of the --follow-* options making
it very easy to monitor daemons that fork off dedicated processes such as xinetd.
6. System wide sessions
When the --system-wide option is used, pfmon operates in system wide mode. This means that
it does not monitor a specific program anymore but instead all the processes that execute
on a specific set of CPUs. In this mode, you do no need to specify a command. You do NOT
need to be root to create a system wide session. It is possible to exclude the idle task
from a system-wide measurement.
A system wide session cannot co-exist with any per-thread sessions. But a system wide session
can run concurrently with other system wide sessions as long as they do not monitor the same
set of CPUs.
6.1 Selecting CPUs to monitor
This version of pfmon supports up to 2048 processors. By default, pfmon monitors on ALL online
CPUs for a system-wide measurement. It is possible to restrict the set of CPUs using the
--cpu-list option.
The CPU are identified by their number. By default numbers are relative to
their global logical number as seen by the kernel on boot.
For example, on a 2-way machine with 2 online CPUs:
% pfmon -v --system-wide
...
selected CPUs (2 CPU in set, 2 CPUs online): CPU0 CPU1
...
<press ENTER to stop session>
CPU0 85724 CPU_CYCLES
CPU1 63743 CPU_CYCLES
If the shell that pfmon is run from is part of a CPU set for which includes
only a subset of the online CPUs, then pfmon will only monitor on the
processors part of that set:
% cat /proc/self/cpuset
/test1
% cat /dev/cpuset/test1/cpus
1
% pfmon -v --system-wide
...
selected CPUs (1 CPU in set, 2 CPUs online): CPU1
...
<press ENTER to stop session>
CPU1 11873 CPU_CYCLES
Notice that by default pfmon uses the global CPU number. Here CPU1 is CPU
number one for the 2-way system, not for the set.
It may be beneficial to use a set relative CPU numbering scheme when
collecting performance results. For this it is necessary to use the
--cpu-set-relative option:
% pfmon -v --system-wide --cpu-set-relative
...
selected CPUs (1 CPU in set, 2 CPUs online): CPU0
...
<press ENTER to stop session>
CPU0 11873 CPU_CYCLES
If you want to restrict to a specific CPU (global or set-relative),
you can use the --cpu-list command:
% pfmon --system-wide --cpu-list=1 -u -e cpu_cycles,ia64_inst_retired
<press ENTER to stop session>
CPU1 3757 CPU_CYCLES
CPU1 1636 IA64_INST_RETIRED
The CPU are numbered starting at 0. You can specify multiple CPUs using a comma
separated list withous spaces:
% pfmon --system-wide --cpu-list=0,1 -u -e cpu_cycles,ia64_inst_retired
<press ENTER to stop session>
CPU0 146933 CPU_CYCLES
CPU0 134885 IA64_INST_RETIRED
CPU1 6189 CPU_CYCLES
CPU1 4624 IA64_INST_RETIRED
It is also possible to specify ranges of CPUs:
% pfmon --system-wide --cpu-list=0-1 -u -e cpu_cycles,ia64_inst_retired
<press ENTER to stop session>
CPU0 105186 CPU_CYCLES
CPU0 87627 IA64_INST_RETIRED
CPU1 5713933 CPU_CYCLES
CPU1 7066832 IA64_INST_RETIRED
Ranges and individual specification can be combined in a single comma-separated list:
% pfmon --system-wide --cpu-list=0,2-3 -u -e cpu_cycles,ia64_inst_retired
This above example monitors on CPU0, CPU2, and CPU3.
6.2 Delimiting a system wide session
There are three ways to delimit a system wide session. By default, the
session will terminate when the user press the <Enter> key. It is also
possible to use a timeout expressed in seconds. Finally, the session can
also be delimited by the execution of a command. It will start when the
command starts and stops when it terminates. Here are some examples.
Monitor CPU cycles and instructions retired on the first two CPUs at both
user and kernel levels and wait for a key-press to stop:
% pfmon --cpu-list=0-1 --system-wide -u -k -e cpu_cycles,ia64_inst_retired
<press ENTER to stop session>
CPU0 325728352 CPU_CYCLES
CPU0 853076156 IA64_INST_RETIRED
CPU1 325869036 CPU_CYCLES
CPU1 853618095 IA64_INST_RETIRED
Monitor cpu_cycles and instruction retired on the first two CPUs at both
user and kernel levels for 10 seconds, use US counter format:
% pfmon --us-c --session-timeout=10 --cpu-list=0-1 --system-wide -u -k -e cpu_cycles,ia64_inst_retired
<session to end in 10 seconds>
CPU0 15,001,469,784 CPU_CYCLES
CPU0 44,896,922,932 IA64_INST_RETIRED
CPU1 15,001,483,004 CPU_CYCLES
CPU1 43,518,135,055 IA64_INST_RETIRED
Monitor CPU cycles and instructions retired on the first two CPUs at the
user level only during the execution of the ls command (here obviously run
on CPU0):
% pfmon --cpu-list=0-1 --system-wide -u -e cpu_cycles,ia64_inst_retired -- ls -l /dev/null
crw-rw-rw- 1 root root 1, 3 Dec 6 2002 /dev/null
CPU0 802960 CPU_CYCLES
CPU0 581288 IA64_INST_RETIRED
CPU1 44363 CPU_CYCLES
CPU1 10666 IA64_INST_RETIRED
To make the measurement meaningful, it is likely that the user will need the
command to be run on the same subset of processors that pfmon is measuring.
To ensure that this happens, it is necessary to use the --pin-command option.
It changes the processor affinity of the executed command to be equal to the
set of processors being monitored. Of course, it is still possible for the
command to modify its own affinity later on, therefore the option must be
used with caution.
6.3 Results aggregation
It is possible to aggregate counts when monitoring more than one CPU:
% pfmon --system-wide -k -e cpu_cycles,ia64_inst_retired --session-timeout=10
<session to end in 10 seconds>
CPU0 15001363310 CPU_CYCLES
CPU0 44933161824 IA64_INST_RETIRED
CPU1 15001462725 CPU_CYCLES
CPU1 44967038560 IA64_INST_RETIRED
% pfmon --system-wide -k -e cpu_cycles,ia64_inst_retired --session-timeout=10 --aggr
<session to end in 10 seconds>
30002030938 CPU_CYCLES
89897861578 IA64_INST_RETIRED
Aggregation supports event set multiplexing:
% pfmon --system-wide -k -e cpu_cycles -eia64_inst_retired --switch-timeout=100 --session-timeout=10 --aggr
<session to end in 10 seconds>
29710925962 CPU_CYCLES
90797169643 IA64_INST_RETIRED
In which case, the per CPU results are summed. Pfmon does not allow different events to be
monitored on different CPUs. For this you can run separate instances of pfmon with a different
CPU list, using a command lines similar to:
% pfmon --session-timeout=10 --cpu-list=0 --system-wide -k -e cpu_cycles &
% pfmon --session-timeout=10 --cpu-list=1 --system-wide -k -e ia64_inst_retired &
6.4 Delayed start
Sometimes, it may be useful to delay the activation of monitoring until
a certain point in the execution is reached. This is the case when the
initialization must not be included in the counts.
In system-wide mode only, pfmon provides the --trigger-delay-start option
which can be used to specify a delay in second before monitoring is activated:
% pfmon --system-wide --trigger-start-delay=5 --session-timeout=10
Wait for 5 seconds before activating a 10s monitoring session on all CPUs.
6.5 Saving results in files
Results can be saved into file(s) for system-wide session using the same
--outfile option as for per-thread sessions. When results are not aggregated
you get one file per cpu. The filename provided via the option is prepended
to .cpuX where X is the CPU number:
% pfmon --outfile=b --cpu-list=0-1 --system-wide -u -e cpu_cycles,ia64_inst_retired -- ls -l /dev/null
crw-rw-rw- 1 root root 1, 3 Dec 6 2002 /dev/null
% cat b.cpu0
774233 CPU_CYCLES
557904 IA64_INST_RETIRED
% cat b.cpu1
54799 CPU_CYCLES
32728 IA64_INST_RETIRED
You can add a header in each file by specifying the --with-header option.
6.6 Excluding the idle task
You need perfmon v2.3 or higher to get this feature.
As of perfmon v2.3, the kernel will systematically exclude the useless execution of the
idle thread on each processor. Useless execution is defined as the portion of the execution of the idle thread
where all it does is wait for an event, such as an interrupt.
6.7 Printing intermediate counts
In system-wide mode, when counting, you may want to see intermediate results. With the
--print-interval option, pfmon print deltas of events at regular intervals.
The --print-interval option takes the number of milliseconds between prints. Deltas are generated per event and per processor. There is no final counts printed in this mode. The --print-interval option is not available when aggregating results (--aggregate-results), when sampling and in per-thread mode.
Here is a simple example on 2-way machine where we measure for 6s and print every 2s:
% pfmon --system-wide --us-c --print-interval=2000 -t6
<session to end in 6 seconds>
#
CPU0 91,528 CPU_CYCLES
CPU1 176,120 CPU_CYCLES
#
CPU0 1,024 CPU_CYCLES
CPU1 294,496 CPU_CYCLES
#
CPU0 2,050 CPU_CYCLES
CPU1 103,525 CPU_CYCLES
7. Monitoring multiple processes/threads
This feature of pfmon is only meaningful in per-process mode.
7.1 Basic monitoring of multiple processes/threads
When monitoring complicated workloads it is useful to be able to monitor
more than just the first process or thread which typically only do the initialization
the bulk of the work being carried out by worker processes/threads. To illustrate the
various options offered by pfmon we use the example of the GNU C compiler gcc invoked
to compile a trivial "hello.c" program. Several processes are created as part of the
compilation:
% gcc -v hello.c -o hello -O2 -static
Reading specs from /usr/lib/gcc-lib/ia64-linux/2.96/specs
gcc version 2.96 20000731 (Debian GNU/Linux IA64 experimental)
...
/usr/lib/gcc-lib/ia64-linux/2.96/cpp0 -lang-c -v -D__GNUC__=2 -D__GNUC_MINOR__=96...
...
/usr/lib/gcc-lib/ia64-linux/2.96/cc1 /tmp/ccM0VFYv.i -mb-step -quiet -dumpbase hello.c...
...
as -x -o /tmp/ccMo7exq.o /tmp/ccas6Y5Q.s
...
/usr/lib/gcc-lib/ia64-linux/2.96/collect2 -static -o hello /usr/lib/gcc-lib/...
The edited output reveals that gcc uses a total of 5 processes to compile hello.c.
Now, let's look at how pfmon can monitor all or some of the processes involved here.
By default only the first process is monitored:
% pfmon --us-c --no-cmd-out -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
1,123,406 IA64_INST_RETIRED
How do we know that only one process is monitored? You can enable the verbose mode of pfmon:
% pfmon --verbose --us-c --no-cmd-out -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
symbol file for gcc not found
measuring at user privilege level ONLY
[pmc4=0x000828 thres=0 es=0x08 plm=8 umask=0x0 pm=0 ism=0x0 oi=1] IA64_INST_RETIRED
[pmc15=0xffffffff ibrp0_pmc8=1 ibrp1_pmc9=1 ibrp2_pmc8=1 ibrp3_pmc9=1]
[pmd4 ival=0x0 long_rate=0x0 short_rate=0x0 mask=0x0 seed=0 rand=n]
exec-pattern=*
[10950] started task: gcc hello.c -o hello -O2 -static
follow_exec=n follow_vfork=n follow_fork=n follow_pthread=n
[10950] monitoring gcc hello.c -o hello -O2 -static ...
[10950] activating monitoring
measurements started at Thu Oct 30 17:10:58 2003
[10950] task exited
1,123,406 IA64_INST_RETIRED
[10950] detached
created tasks : 1
maximum tasks : 1
maximum active tasks : 1
measurements completed at Thu Oct 30 17:10:59 2003
Now if we want to monitor all processes/threads, we can use some of the --follow-* option
that pfmon offers:
- --follow-fork : pfmon monitors the child process created via a regular fork(). Pfmon creates
a distinct "session" dedicated to the newly created process. By default pfmon
does not monitor the child process.
- --follow-vfork : pfmon monitors the child process created via a vfork(). Pfmon creates
a distinct "session" dedicated to the newly created process. By default pfmon
does not monitor the child process.
- --follow-pthread: pfmon monitors the a new pthread is created via pthread_create(). This works for both the
old LinuxThreads library and the New Posix Thread Library (NPTL). Pfmon creates
a distinct "session" dedicated to the newly created process. By default pfmon
does not monitor the new thread.
- --follow-exec : pfmon monitors the process across the execve() system call. No distinct "session" is created.
By default, pfmon does not monitoring across exec().
- --follow-all : combines all of the above options.
In case you are not sure which variation of the fork to use, it is best to use --follow-all. Now we run our gcc
test case with the --follow-all option:
% pfmon --follow-all --us-c --no-cmd-out -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
48,406,141 IA64_INST_RETIRED /usr/lib/gcc/ia64-linux-gnu/4.0.1/cc1 (21927,21927,21926)
5,438,463 IA64_INST_RETIRED as (21928,21928,21926)
410,621,894 IA64_INST_RETIRED /usr/bin/ld (21930,21930,21929)
570,537 IA64_INST_RETIRED /usr/lib/gcc/ia64-linux-gnu/4.0.1/collect2 (21929,21929,21926)
1,277,759 IA64_INST_RETIRED gcc (21926,21926,-1)
This time, pfmon reports 5 different results for 5 different processes.
Let us examine a line of the output:
5,438,463 IA64_INST_RETIRED as (21928,21928,21926)
^ ^ ^ ^ ^ ^
| | | | | |
| | | | | parent process id
| | | | thread id (tid)
| | | process id (pid)
| | command name
| event name
total count for the process
A parent process id of -1 indicates the first process.
It is important to keep in mind that for child processes, by default, the "total count" does
include execution up to the initial fork, i.e., before the exec:
m | -> vfork() ----> parent goes to wait()
o | |
n | |
i | |
t | execve()
o | |
r | execution of the command
e | |
d |-> exit()
The pid vs. tid distinction is only relevant for threads AND when using NPTL. Otherwise both
numbers should be identical. The pid, tid, parent pid can be used to re-create the process
tree involved with the execution of gcc. Results are reported in order of termination of the
process. This is why "gcc" is reported last. The entire command line is not reported, only
the first argument, which is the command name.
You can trace what is going on with the --verbose option:
% pfmon --verbose --follow-all --us-c --no-cmd-out -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
loaded 14150 text symbols from /proc/kallsyms
loaded 3008 data symbols from /proc/kallsyms
symbol file for gcc not found
measuring at user privilege level ONLY
1 event sets defined
long sampling periods(val/mask/seed): 0/0x0/0
short sampling periods(val/mask/seed): 0/0x0/0
[pmc4=0x000828 thres=0 es=0x08 plm=8 umask=0x0 pm=0 ism=0x0 oi=1] IA64_INST_RETIRED
[pmc15=0xffffffff ibrp0_pmc8=1 ibrp1_pmc9=1 ibrp2_pmc8=1 ibrp3_pmc9=1]
pmd setup for event set0:
[pmd4 set=0 ival=0x0 long_rate=0x0 short_rate=0x0 mask=0x0 seed=0 randomize=n]
exec-pattern=*
[22131] started task: gcc /home/eranian/nueh/hello.c -o /tmp/hello -O2 -static
follow_exec=y follow_vfork=y follow_fork=y follow_pthread=y
[22131] monitoring gcc /home/eranian/nueh/hello.c -o /tmp/hello -O2 -static ...
results are on terminal
requested switch timeout 0s 0ms 0ns
effective switch timeout 0s 0ms 0ns
[22131] activating monitoring
measurements started at Wed Jan 4 06:27:29 2006
[22131] vfork [22132]
[22132] monitoring gcc /home/eranian/nueh/hello.c -o /tmp/hello -O2 -static ...
results are on terminal
requested switch timeout 0s 0ms 0ns
effective switch timeout 0s 0ms 0ns
[22132] activating monitoring
[22132] exec /usr/lib/gcc/ia64-linux-gnu/4.0.1/cc1 -quiet /home/eranian/nueh/...
[22132] monitoring /usr/lib/gcc/ia64-linux-gnu/4.0.1/cc1 -quiet /home/eranian...
[22132] task exited
set0 runs=1 duration=32122704
48,406,141 IA64_INST_RETIRED
/usr/lib/gcc/ia64-linux-gnu/4.0.1/cc1 (22132,22132,22131)
[22132] detached
[22131] vfork [22133]
[22133] monitoring gcc /home/eranian/nueh/hello.c -o /tmp/hello -O2 -static ...
results are on terminal
requested switch timeout 0s 0ms 0ns
effective switch timeout 0s 0ms 0ns
[22133] activating monitoring
[22133] exec as -x -o /tmp/ccy0rssc.o /tmp/ccA8krGb.s ...
[22133] monitoring as -x -o /tmp/ccy0rssc.o /tmp/ccA8krGb.s ...
[22133] task exited
set0 runs=1 duration=5629233
5,438,463 IA64_INST_RETIRED as (22133,22133,22131)
[22133] detached
[22131] vfork [22134]
[22134] monitoring gcc /home/eranian/nueh/hello.c -o /tmp/hello -O2 -static ...
results are on terminal
requested switch timeout 0s 0ms 0ns
effective switch timeout 0s 0ms 0ns
[22134] activating monitoring
[22134] exec /usr/lib/gcc/ia64-linux-gnu/4.0.1/collect2 -static -o /tmp/hello...
[22134] monitoring /usr/lib/gcc/ia64-linux-gnu/4.0.1/collect2 -static -o /tmp...
[22134] vfork [22135]
[22135] monitoring /usr/lib/gcc/ia64-linux-gnu/4.0.1/collect2 -static -o /tmp...
results are on terminal
requested switch timeout 0s 0ms 0ns
effective switch timeout 0s 0ms 0ns
[22135] activating monitoring
[22135] exec /usr/bin/ld -static -o /tmp/hello /usr/lib/gcc/ia64-linux-gnu/4....
[22135] monitoring /usr/bin/ld -static -o /tmp/hello /usr/lib/gcc/ia64-linux-...
[22135] task exited
set0 runs=1 duration=274328917
410,625,058 IA64_INST_RETIRED /usr/bin/ld (22135,22135,22134)
[22135] detached
[22134] task exited
set0 runs=1 duration=1687796
570,537 IA64_INST_RETIRED
/usr/lib/gcc/ia64-linux-gnu/4.0.1/collect2 (22134,22134,22131)
[22134] detached
[22131] task exited
set0 runs=1 duration=2196384
1,277,759 IA64_INST_RETIRED gcc (22131,22131,-1)
[22131] detached
created tasks : 5
maximum tasks : 3
maximum active tasks : 3
measurements completed at Wed Jan 4 06:27:29 2006
The trace reveals that vfork() is used to create the new processes, therfore we could have used
--follow-vfork and --follow-exec to obtain the same results.
7.2 Saving results in files
It is possible to use the --outfile option when you are using any of the --follow-* options.
There will be one file per process/thread, really per task in Linux terms unless you decide
to aggregate the results using the --aggregate-results options.
Using our gcc compilation example, we can now do:
% pfmon --outfile=result --follow-all --us-c --no-cmd-out -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
$ ls -l result.*
-rw-r--r-- 1 eranian users 45 Nov 4 15:28 result.1931.1931
-rw-r--r-- 1 eranian users 45 Nov 4 15:28 result.1933.1933
-rw-r--r-- 1 eranian users 45 Nov 4 15:28 result.1934.1934
-rw-r--r-- 1 eranian users 45 Nov 4 15:28 result.1935.1935
-rw-r--r-- 1 eranian users 45 Nov 4 15:28 result.1936.1936
-rw-r--r-- 1 eranian users 45 Nov 4 15:28 result.1937.1937
There is one file per process created. The naming convention used by pfmon is as follows:
result.1937.1937
^ ^ ^
| | |
| | thread id
| process id
base filename
Unless you are using threads (LinuxThreads or NPTL) both the process id and the thread id should be identical.
You can use the --with-header option to get detailed header in each output file.
% pfmon --with-header --outfile=result --follow-all --us-c --no-cmd-out \
-eia64_inst_retired -- gcc hello.c -o hello -O2 -static
% ls -l result.*
-rw-r--r-- 1 eranian users 1416 Nov 4 15:32 result.2791.2791
-rw-r--r-- 1 eranian users 1816 Nov 4 15:32 result.2793.2793
-rw-r--r-- 1 eranian users 1497 Nov 4 15:32 result.2794.2794
-rw-r--r-- 1 eranian users 1428 Nov 4 15:32 result.2795.2795
-rw-r--r-- 1 eranian users 1790 Nov 4 15:32 result.2796.2796
-rw-r--r-- 1 eranian users 1760 Nov 4 15:32 result.2797.2797
% cat result.2797.2797
...
#
# command line : /usr/bin/ld -static -o hello /usr/lib/gcc-lib/ia64-linux/2.96/../../../crt1.o ...
#
# process id : 2797
# thread id : 2797
# parent process id: 2796
# parent thread id: 2796
# exec count : 1
The "command line" line shows the program for this task with represents the invocation of the linker, ld. The
following set of lines shows the pid/tid and parent information.
7.3 Separating execution between fork()/vfork() and exec()
Pfmon can differentiate the execution that happens between the fork/vfork
and exec using the --exec-split-result option.
If we use our gcc example from the previous section, here is what happens:
Without the option:
% pfmon --follow-all --us-c --no-cmd-out -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
48,406,141 IA64_INST_RETIRED /usr/lib/gcc/ia64-linux-gnu/4.0.1/cc1 (5278,5278,5277)
5,438,463 IA64_INST_RETIRED as (5279,5279,5277)
410,617,535 IA64_INST_RETIRED /usr/bin/ld (5281,5281,5280)
570,537 IA64_INST_RETIRED /usr/lib/gcc/ia64-linux-gnu/4.0.1/collect2 (5280,5280,5277)
1,277,759 IA64_INST_RETIRED gcc (5277,5277,-1)
With the option:
% pfmon --exec-split-result --follow-all --us-c --no-cmd-out -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
1,190 IA64_INST_RETIRED gcc (19682,19682,19681)
48,404,951 IA64_INST_RETIRED /usr/lib/gcc/ia64-linux-gnu/4.0.1/cc1 (19682,19682,19681)
4,128 IA64_INST_RETIRED gcc (19683,19683,19681)
5,434,335 IA64_INST_RETIRED as (19683,19683,19681)
91 IA64_INST_RETIRED gcc (19684,19684,19681)
1,170 IA64_INST_RETIRED /usr/lib/gcc/ia64-linux-gnu/4.0.1/collect2 (19685,19685,19684)
410,620,724 IA64_INST_RETIRED /usr/bin/ld (19685,19685,19684)
570,446 IA64_INST_RETIRED /usr/lib/gcc/ia64-linux-gnu/4.0.1/collect2 (19684,19684,19681)
1,277,759 IA64_INST_RETIRED gcc (19681,19681,-1)
This time we have more counts reported, but that does not mean we had more processes created just because of
our measurement. It simply shows a finer grain breakdown of the same measurement. It is fairly easy to rebuild
the process tree:
gcc (19681,19681,-1)
vfork -> gcc (19682,19682,19681)
exec -> /usr/lib/gcc/ia64-linux-gnu/4.0.1/cc1 (19682,19682,19681)
vfork -> gcc (19683,19683,19681)
exec -> as (19683,19683,19681)
vfork -> gcc (19684,19684,19681)
exec -> /usr/lib/gcc/ia64-linux-gnu/4.0.1/collect2 (19684,19684,19681)
vfork -> /usr/lib/gcc/ia64-linux-gnu/4.0.1/collect2 (19685,19685,19684)
exec -> /usr/bin/ld (19685,19685,19684)
When results are saved into a file, pfmon automatically adds an extra field to the file name when the
--exec-split-results option is used. This is necessary to distinguish between the various parts of the
execution of the task because the pid and the tid remain the same across exec:
% pfmon --exec-split-result --outfile=qq --follow-all --us-c --no-cmd-out \
-eia64_inst_retired -- gcc hello.c -o hello -O2 -static
% ls -l qq.*
-rw-r--r-- 1 eranian users 28 Jan 4 03:50 qq.19708.19708.1
-rw-r--r-- 1 eranian users 24 Jan 4 03:50 qq.19709.19709.0
-rw-r--r-- 1 eranian users 29 Jan 4 03:50 qq.19709.19709.1
-rw-r--r-- 1 eranian users 24 Jan 4 03:50 qq.19710.19710.0
-rw-r--r-- 1 eranian users 28 Jan 4 03:50 qq.19710.19710.1
-rw-r--r-- 1 eranian users 21 Jan 4 03:50 qq.19711.19711.0
-rw-r--r-- 1 eranian users 26 Jan 4 03:50 qq.19711.19711.1
-rw-r--r-- 1 eranian users 24 Jan 4 03:50 qq.19712.19712.0
-rw-r--r-- 1 eranian users 30 Jan 4 03:50 qq.19712.19712.1
Notice, the third field in the filenames:
qq.19709.19709.1
^ ^ ^ ^
| | | |
| | | exec number
| | thread id
| process id
base filename
The process 19709 did exec once. The qq.19709.19709.0 contains the results for the initial portion of
the execution of the process, i.e., before the first exec. The file qq.19709.1 contains the results
for the portion of execution after the first exec.
As for the initial process, gcc in this case, it shows up as qq.19708.19708.1. The exec count is always
1 for this program because we count the initial exec from the pfmon tool (after the vfork).
It is very helpful to use the --with-header option to record detailed information about the results.
For instance, in the example above:
% cat qq.19727.19727.0
...
# monitoring mode : per-thread
# command line : gcc /home/eranian/nueh/hello.c -o /tmp/hello -O2 -static
#
# process id : 19727
# thread id : 19727
# parent process id: 19726
# parent thread id: 19726
# exec count : 0
#
...
Looking at the other file:
% cat qq.19727.19727.1
...
# monitoring mode : per-thread
# command line : /usr/lib/gcc/ia64-linux-gnu/4.0.1/cc1 -quiet ...
#
# process id : 19727
# thread id : 19727
# parent process id: 19726
# parent thread id: 19726
# exec count : 1
...
7.4 Selectively monitoring processes
When using the --follow-exec option it is possible to specify a regular expression pattern to filter out
some part of execution. This is best explain with an example. If we use our gcc example once again and
suppose we only care about the results for the assembler, "as", then we can use the following:
% pfmon --follow-all --follow-exec="^as" --us-c --no-cmd-out -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
5,496,093 IA64_INST_RETIRED as (11065,11065,11061)
The "^" before as is to ensure that we match the pattern "as" only at the beginning of the command line.
Pfmon only measures and reports execution of the "as process". Of course, here one has to be careful by what is meant
by "as process". As we saw in the previous section, the "as process" starts its life as a "gcc process" that eventually
execs the assembler, as. In this particular example, pfmon did not have monitoring enabled at the time the "gcc process"
is created, therefore monitoring ONLY starts at the exec.
If you want to include more than one command and cannot easily express this with a regular expression, you can use
the opposite command to --follow-exec which is --follow-exec-exclude. Commands that match are not monitored.
In our gcc example, suppose we want to measure everything except gcc:
% pfmon --follow-all --follow-exec-exclude="^gcc" --us-c --no-cmd-out -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
48,404,951 IA64_INST_RETIRED /usr/lib/gcc/ia64-linux-gnu/4.0.1/cc1 (19747,19747,19746)
5,434,335 IA64_INST_RETIRED as (19748,19748,19746)
410,621,894 IA64_INST_RETIRED /usr/bin/ld (19750,19750,19749)
570,446 IA64_INST_RETIRED /usr/lib/gcc/ia64-linux-gnu/4.0.1/collect2 (19749,19749,19746)
These options may seem frivolous at first because a simple grep of the full output would yield the same results but
they become particularly useful when pfmon is used for sampling. Also by not monitoring when not necessary we limit
the overhead.
7.5 Aggregating results from multiple processes/threads
By default, pfmon produces output for every actively monitored task followed. You may want to only
get the aggregate results. In this case, you simply need to use the --aggregate-results and pfmon
will add up all the counts per event for all tasks.
Let us take our compilation example and use the LOADS_RETIRED event to gain some more stability in the counts:
% pfmon --follow-all -eloads_retired -- gcc hello.c -o hello -O2 -static
4482486 LOADS_RETIRED /usr/lib/gcc/ia64-linux-gnu/4.0.1/cc1 (19772,19772,19771)
540310 LOADS_RETIRED as (19773,19773,19771)
42851738 LOADS_RETIRED /usr/bin/ld (19775,19775,19774)
66243 LOADS_RETIRED /usr/lib/gcc/ia64-linux-gnu/4.0.1/collect2 (19774,19774,19771)
122887 LOADS_RETIRED gcc (19771,19771,-1)
Now with aggregation:
% pfmon --aggregate-results --follow-all -eloads_retired -- gcc hello.c -o hello -O2 -static
48063561 LOADS_RETIRED
8. Triggering monitoring at specific locations
Pfmon allows you to control where monitoring is started and stopped using code or data trigger addresses. By
default pfmon starts monitoring at the first instruction of a program and stop when the program terminates, i.e.,
the process exits. This feature exists on all platform and does not use the PMU hardware but the debug
registers. As such it introduces more overhead and should be use carefully, i.e. the section of the
execution where monitoring is active must be long enough compared to the overhead.
This ability to control where monitoring starts/stops is only available in per-thread mode.
8.1 Basic code triggers
You can start or stop monitoring when execution reaches a certain instruction in your program.
The instruction is identified by its address. There is no particular checks made on the address,
pfmon does not try to validate the address inside the process user level address space. However
pfmon may reject addresses referring to kernel code. This feature cannot be used to trigger monitoring
for kernel level execution. This limitation is due to the fact that triggers are implemented using
the debug registers and it is illegal to set a breakpoint in the kernel.
The number of code triggers depends on the processor. If you have 4 triggers, then, you can have two
start triggers and two stop triggers, but they do not need to be evenly balanced, i.e., one start and
three stops is perfectly legal. You specify the start triggers using --trigger-code-start followed
by a comma separated list of addresses or symbol names. Do not use space between the elements of the
list. Similarly stops are expressed using --trigger-code-stop as a comma-separated list of addresses
or symbols.
Supposing we set the start trigger at the beginning of a function, and if this function calls
others, monitoring will stay active until we come across a stop trigger or the process terminates.
By default start and stop triggers are "executed" only once.
Let us look at a simple example with a stupid noploop program:
% nm noploop | fgrep "T main"
4000000000000450 T main
Count the number of load retired instruction for the entire execution:
% pfmon --no-cmd-out -eloads_retired noploop 100000
1714 LOADS_RETIRED
Count the number of load retired instruction starting at main():
% pfmon --trigger-code-start=0x4000000000000450 --no-cmd-out -eloads_retired noploop 100000
538 LOADS_RETIRED
Monitoring starts at the first instruction in main() and stop when noploop executes its last user level
instruction (we only monitor at the user level here). From the two runs, we can figure out the number
of loads executed by the startup code: 1714-538 = 1176. We can verify this by simply stopping when we
reach main():
% pfmon --trigger-code-stop=0x4000000000000450 --no-cmd-out -eloads_retired noploop 100000
1176 LOADS_RETIRED
As you can see here, you do not need to match a start with a stop.
Instead of using addresses, you can use symbols names given that the symbol table is still
available in the binary. So our last example can be expressed as follows:
% pfmon --trigger-code-stop=main --no-cmd-out -eloads_retired noploop 100000
1176 LOADS_RETIRED
8.2 Data triggers
Data triggers work very much like code triggers but they apply only to data accesses, i.e,
load, store, synchronization instructions. By default the trigger operates on read or
write access. The start triggers are specified with the --trigger-data-start option
and the stop triggers with the --trigger-data-stop option. Both are comma-separated lists
of data addresses or symbols. The trigger is activated when the byte pointed to by the address
is accessed as part of any size of access.
When using symbols, the names refer to global variables in the main program when linked
dynamically or for the entirety of the program when linked statically, i.e., no dynamic
symbol lookup is performed.
It is possible to refine the type of access that we active the triggers. By default
they are activated on read OR write. You can restrict this with the following options
if supported by underlying processors:
- --trigger-data-ro : activate on read access only
- --trigger-data-wo : activate on write access only
8.3 Repeating triggers
In certain cases it is useful to repeat the start/stop trigger sequence each time
the program comes across the trigger points. In the case of a code trigger, that is how
you could accumulate a measurement specific to a function, for instance. By default pfmon
only uses each trigger once. You can repeat the code triggers using --trigger-code-repeat
and ALL code start/stop trigger will be repeated. Similarly using --trigger-data-repeat
will keep the data trigger installed throughout the execution.
8.4 Inheritance of triggers
By default, triggers are not inherited across fork,vfork,pthread_create. Triggers
are NEVER inherited across exec(). By inheritance, we mean are not reinstalled into
the newly created task. Trigger inheritance is possible using the --trigger-code-follow
and --trigger-data-follow options.
8.5 Dynamic code stop triggers
Many times, code triggers are used to measure only one function and the ones it calls.
However many functions can be called from different places and it is often hard to
locate the return point where the stop trigger should be placed. Pfmon can dynamically
figure out the return point of a function. When the code start and stop trigger are
identical, pfmon interprets the options as a request to monitor when execution comes
into a function until it leaves the function, i.e, return from the function. Pfmon
does this using a best-effort approach that is not 100% guaranteed. Pfmon does not
check that you place the start trigger at the beginning of a function. To determine
the location where to place the stop trigger, pfmon uses the value stored in the
return pointer at the time execution reaches the start trigger. Hence, if
the start trigger is placed at the first instruction of the function, then
the return pointers contains the place where the function will eventually return to.
Note that this location is not necessarily just after where the call instruction, especially
for leaf procedures. This feature is not 100% guaranteed and as such must be
used by people who understand their code very well.
Here is a simple example using our noploop program:
% pfmon --trigger-code-start=noploop --trigger-code-stop=noploop -e loads_retired noploop 100000000
0 LOADS_RETIRED
To figure out what is going on, you can turn on verbose mode:
% pfmon --verb --no-cmd-out --trigger-code-start=noploop --trigger-code-stop=noploop -e loads_retired noploop 100000000
loaded 14150 text symbols from /proc/kallsyms
loaded 3008 data symbols from /proc/kallsyms
table_size=1440 effective=945 min_addr=0x4000000000000140
loaded 945 text symbols from ELF file /home/eranian/nueh/noploop
table_size=1440 effective=408 min_addr=0x40000000000a5b60
loaded 408 data symbols from ELF file /home/eranian/nueh/noploop
measuring at user privilege level ONLY
1 event sets defined
long sampling periods(val/mask/seed): 0/0x0/0
short sampling periods(val/mask/seed): 0/0x0/0
start code trigger @0x4000000000000390
stop code trigger @0x4000000000000390
[pmc5=0x00cd28 thres=0 es=0xcd plm=8 umask=0x0 pm=0 ism=0x0 oi=1]
LOADS_RETIRED
[pmc15=0xffffffff ibrp0_pmc8=1 ibrp1_pmc9=1 ibrp2_pmc8=1 ibrp3_pmc9=1]
pmd setup for event set0:
[pmd5 set=0 ival=0x0 long_rate=0x0 short_rate=0x0 mask=0x0 seed=0
randomize=n]
exec-pattern=*
[21108] started task: /home/eranian/nueh/noploop 100000000
follow_exec=n follow_vfork=n follow_fork=n follow_pthread=n
[21108] monitoring /home/eranian/nueh/noploop 100000000 ...
results are on terminal
requested switch timeout 0s 0ms 0ns
effective switch timeout 0s 0ms 0ns
[21108] installed start code breakpoint at 0x4000000000000390
[21108] monitoring not activated
measurements started at Wed Jan 4 05:23:43 2006
[21108] reached start code breakpoint @0x4000000000000390
[21108] install br.ret breakpoint @0x40000000000004e0
[21108] activating monitoring at trigger start
[21108] clearing code breakpoint @0x4000000000000390
[21108] reached stop code breakpoint @0x40000000000004e0
[21108] stopping monitoring at trigger stop
[21108] clearing code breakpoint @0x40000000000004e0
[21108] task exited
set0 runs=1 duration=133782861
0 LOADS_RETIRED
[21108] detached
created tasks : 1
maximum tasks : 1
maximum active tasks : 1
measurements completed at Wed Jan 4 05:23:43 2006
The key line in the trace is the one that shows the dynamically extracted stop trigger:
[21108] install br.ret breakpoint @0x40000000000004e0
When the start trigger is reached at 0x4000000000000390, i.e., the beginning of noploop(),
pfmon queries the kernel to get the return pointer, and it gets 0x40000000000004e0 which
is the address in main() where noploop() will return to.
9. Dealing with symbols
Whenever an option takes an address (code or data) as argument, it is
possible to directly use a symbol name rather than an address.
For instance, this is true for the --trigger-code-*, --trigger-data-*
options. Other examples include the Itanium2 options, --drange, --irange
options.
9.1 Symbol table sources
Pfmon can extract the symbol table from an ELF image, /proc/*/maps.
For kernel symbols, it can extract them from the kernel ELF image, System.map or /proc/kallsyms
(default).
9.2 Symbol lookups
In per-thread mode, the symbols are automatically extracted from the
command being run. When the command does not have the symbol table it
is possible to specify an alternate ELF file to look for the symbol, using
the --symbol-file command. This other ELF file must be the non-stripped version
of the command being run, of course!
There are a few restrictions concerning symbols in per-thread mode. Pfmon cannot
extract symbol information coming from dynamically linked libraries or modules.
To work around this problem, the program must be statically linked and should
not explicitly use dl_open(). Another workaround is to attach to the program with --attach-task.
This limitation should be lifted in future version of pfmon.
If the symbol table has been stripped, pfmon will not find any symbol and you
should revert to using a numerical address.
In case the option requires a code address, pfmon will only look for matching
code symbols. Conversely, if the option requires a data address, pfmon will only
look for matching data symbols.
For the kernel, pfmon automatically loads the content of /proc/kallsyms, unless
the user specified the --sysmap-file option which supersedes it. The option
must point to a valid System.map file that matches the host kernel.
It should be noted that /proc/kallsyms does not provide the size of symbols. Thus,
pfmon approximates by assuming that symbols do not overlap and that there is no gap. Some ELF symbol
tables may also present this kind of defects.
9.3 Symbols and the --follow-* options
In per-thread mode, symbol lookups are done before the command is actually
launched therefore they pertain only to the initial program. Hence, it is not possible to
refer to the name of a symbol from a program that will eventually be executed. For instance,
it is not possible to name a symbol from the assembler "as" when monitoring a compilation
with GNU C (gcc).
9.4 Symbol resolution when sampling
It is possible to resolve all addresses present in samples when using the
--resolve-addresses option.
With this option, pfmon tries to resolve addresses it finds in each sample.
This includes the instruction pointer (IP) but possibly other addresses contained in model-specific
PMDs. For instance, pfmon resolves the address contained in PMD17 for D-EAR
on Itanium2 processors. Resolution occurs for kernel level symbols and
user applications if the symbol table is found. When no matching symbol
exists, the raw address is printed instead. The location of the symbol,
either kernel or application (app) is suffixed to the symbol name.
In system-wide, only kernel symbols are resolved.
9.5 Symbol resolution examples
Here are a few examples on Itanium or Itanium2 and the model specific --checkpoint-func option:
Count the number of time main() is called in the noploop program:
% file noploop
noploop: ELF 64-bit LSB executable, IA-64, version 1, statically linked, not stripped
% pfmon --no-cmd-out --checkpoint-func=main -e ia64_inst_retired noploop 10000
3 IA64_INST_RETIRED
Here the symbol information for main() is directly extracted from noploop
itself. You must divide by 3 (because 3 instructions were executed in the first bundle)
to get the actual result.
Count the number of times main() is called in the noploop-stripped program:
% file noploop-stripped
noploop-stripped: ELF 64-bit LSB executable, IA-64, version 1, statically linked, stripped
% pfmon --no-cmd-out --checkpoint-func=main -e ia64_inst_retired noploop-=stripped 10000
symbol not found: main
Let us provide the non-stripped version for the symbol table:
% pfmon --no-cmd-out --symbol-file=noploop --checkpoint-func=main -e ia64_inst_retired noploop-stripped 1000
3 IA64_INST_RETIRED
Here noploop and noploop-stripped are the same programs except that the latter does not have
the symbol table anymore.
Count the number of times sys_getpid() is called during the execution of the noploop program:
% pfmon -k --checkpoint-func=sys_getpid -e ia64_inst_retired noploop 1000
0 IA64_INST_RETIRED
The -k is required to make sure we monitor at the kernel level where sys_getpid() is located. The
results shows that noploop never calls getpid(). We can easily verify that this is correct using
the strace program:
% strace noploop 1000
execve("./noploop", ["noploop", "100000"], [/* 24 vars */]) = 0
uname({sys="Linux", node="tuff.hpl.hp.com", ...}) = 0
brk(0) = 0x6000000000010000
brk(0x6000000000034000) = 0x6000000000034000
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2000000000000000
write(1, "143663 cycles, 1.44 cycles/loop\n", 32143663 cycles, 1.44 cycles/loop) = 32
munmap(0x2000000000000000, 65536) = 0
exit(0) = ?
But if we use a program that does getpid() in a loop:
% pfmon --no-cmd-out -k --checkpoint-func=sys_getpid -e ia64_inst_retired getpid 1000
3003 IA64_INST_RETIRED
This one does 1000+1 (3003/3) calls, obviously a 1000 from the loop plus an extra one somewhere else.
10. Sampling with pfmon
Pfmon has support for sampling on any events or combination of events. Samples are collected
into a buffer which can then be written to a file or simply on the screen. Sampling is supported
for both system-wide mode and per-thread mode including when following fork, vfork, and pthread_create.
Pfmon can sample is many different ways depending on the underlying CPU or what is being measured.
For a given sampling measurement, pfmon can present the information differently. For instance it
can be presented in clear text with decoding of certain timings, or samples can be dumped as a table
with the values uninterpreted. What controls the output is called a pfmon sampling module and we
detail what they do in the next sections, but first we need to introduce what sampling with
pfmon means.
10.1 Principles
Sampling is a technique to record certain information at some interval. The information is recorded
into a sample which is stored into a sampling buffer. The interval can be time-based or more likely
event-based and the length can be fixed or variable. A typical example is the case where you want
to record where you are (IP) each time you have executed one million instructions.
Each sample is composed of two parts, a fixed size header which contains general information about
the sample and a variable size body which consists of a set of 64-bit values each one corresponding
to a PMD register representing the other events being monitored. All samples record the same set
of PMDs, this set is determined by pfmon based on what is requested by the user.
The sampling buffer is controlled by the kernel but its size is configurable. Depending on
the sampling module, the default size of the buffer can vary. The common module
size is 2048 entries. Each sampling module may offer options to change that size.
Sampling works as follows:
- the user specifies what needs to be recorded
- the user specifies the sampling period and optional randomization parameters
- at the end of a period, a sample is recorded into the buffer by the kernel
- a new sampling period is reloaded and execution/monitoring resumes. we go back to step 3
- if the sampling buffer becomes full, pfmon is notified
- pfmon processes the buffer, i.e., prints and/or saves the buffer depending on the sampling module
- pfmon then notifies the kernel that it is done
- the kernel reload new sampling periods and execution/monitoring resumes. we go back to 3
Pfmon and the kernel perfmon subsystem allow monitoring sessions with multiple sampling periods.
In case where more than one sampling periods expire at the same time, there is one sample per
expired period recorded, thereby multiple samples recorded, subject to the space left in the buffer.
No partial sample is ever recorded.
10.2 Sampling periods
Pfmon (and the perfmon interface) uses two sampling periods instead of just one. The first one is called
short-smpl-period and the second is called long-smpl-period. The short-smpl-period is used
in step 4, this is when the sampling buffer is not full after writing a sample. The
long-smpl-period is used in step 8 when the reload occurs after the buffer became full.
But why do we need 2 periods?
As you might imagine there is some overhead when recording a sample. This overhead is
increased even more when pfmon needs to get involved to drain the buffer. This operation
can take some time and will inevitably introduce some "noise" in the measurements in the form
of TLB and/or cache pollution. To try and hide this "noise", it is sometimes beneficial to
adjust the sampling period, i.e., make it bigger to ensure that the next sample will not
record an event that is the consequence of the overhead generated by the monitoring but rather
a normal event for the program/system being monitored. It is expected that the
long-smpl-period >= short-smpl-period. Of course if the two are equal, this is equivalent to
having only one sampling period. Note that the long-smpl-period is only used for
the first sample recorded after the buffer is marked as empty again (step 7). Pfmon
also use the long period first the initial period, i.e., for the first sample of the
measurement.
The long sampling period is specified via the --long-smpl-periods options and similarly
--short-smpl-period is used for the short sampling period.
% pfmon --long-smpl-periods=5000 --short-smpl-period=2000 -e l3_misses ....
In the command line above, we take a sample every 2000 L3 misses at the user level.
When the buffer becomes full, pfmon processes the data and restarts monitoring.
The next sample will come after 5000 L3 misses, subsequent samples will be 2000 misses
apart.
Pfmon supports multiple sampling periods. Each time one period elapses a sample
is recorded. Hence with multiple sampling periods you can, in effect, create an
OR condition: after N occurrences of event A OR M occurrences of event B record
a sample. The number of sampling periods is limited by the number of counters
supported by the underlying PMU. On Itanium and Itanium2 the limit
is 4 counters, thus 4 sampling periods.
When multiple sampling periods are used, they must be specified as a comma-separated
list (no space). They apply in the same order as the events are specified with the -e option:
% pfmon --long-smpl-periods=4000,3000 --short-smpl-periods=2000,1000 -el2_misses,l3_misses ...
Here, the long and short periods and l2_misses are respectively 4000 and 2000. For l3_misses
they are 300 and 1000 respectively.
At a minimum, You must specify either one of the two types of sampling periods. Pfmon
automatically adjusts the other one to be equal in that case. Hence the following:
% pfmon --long-smpl-periods=4000 -el2_misses ...
implies that the short period for l2_misses is 4000. You can see this with the --verbose
option:
% pfmon --verbose --long-smpl-periods=4000 -el2_misses ...
loaded 10806 symbols from /proc/kallsyms
loaded 1353 symbols from ELF file noploop
measuring at user privilege level ONLY
using detailed-itanium2 sampling module
long sampling periods(val/mask/seed): 2000/0x0/0
short sampling periods(val/mask/seed): 2000/0x0/0
...
Sampling is not supported when multiple event sets are specified.
10.3 Sampling modules
There are many ways in which the samples can be saved or printed on the
screen. Pfmon offers a choice of sampling module depending on the underlying
CPU and what is being measured.
To find out the list of builtin modules, simply use the -I option:
% pfmon -I
detected host CPUs: 2-way 1500MHz/6.0MB Itanium 2 (Madison, rev 5)
detected PMU model: Itanium2
max counters/set: 4
supported PMU models: [Itanium] [Itanium2] [dual-core Itanium 2] [generic IA-64]
supported sampling modules: [inst-hist] [detailed] [compact] [raw] [dear-hist]
pfmlib version: 3.2
kernel perfmon version: 2.7
The first module listed, here [inst-hist], is selected by default.
Because some modules are dependent on the underlying PMU model, the choices
may vary from one processor to another. The example below is for
an Intel Core 2 Duo processor:
% pfmon -I
detected host CPUs: 4-way 2660MHz/4.0MB -- Intel(R) Xeon(R) CPU 5150 @ 2.66GHz (stepping 6)
detected PMU model: Intel Core
max counters/set: 5
supported PMU models: [AMD64] [Pentium 4] [Intel Core] [Intel architectural PMU]
supported sampling modules: [inst-hist] [detailed] [compact] [raw] [pebs]
pfmlib version: 3.2
kernel perfmon version: 2.7
A sampling module is called when the sampling buffer becomes full and needs to
be drained (dumped). Each module is free to process the data in any way it
wants. Some of them can be specialized in ONLY a certain type of measurements.
The following formats are available for all processors:
| [inst-hist] | (default) : prints an instruction pointer based histogram. This module
works for all CPU models. It is possible to specify multiple sampling
periods, samples counts for each events are printed. |
| [detailed] | prints detailed information about each sample. The header
and body are decoded. Each PMD is detailed. No histrogram
is generated. |
| [compact] | prints the header and body with one line per sample and just
raw values. No interpretation is attempted. This module works
for all CPU models. |
| [raw] | the buffer is dumped as is into a binary file. This module works
for all CPU models. |
By default, pfmon auto-detects the host CPU and picks the first matching sampling
module. On all PMU models, the inst-hist module is chosen by default. This can be changed using the
the --smpl-module option:
% pfmon --no-cmd-output --smpl-module=compact --long-smpl-periods=1000000 noploop 10000000
0 11345 11345 3 0x400280 0x0001b4b738f07b06 17 0 -1000000
1 11345 11345 3 0x400280 0x0001b4b738f64a91 17 0 -1000000
2 11345 11345 3 0x400280 0x0001b4b738fc0ae9 17 0 -1000000
3 11345 11345 3 0x400280 0x0001b4b73901cad5 17 0 -1000000
4 11345 11345 3 0x400280 0x0001b4b739079e20 17 0 -1000000
5 11345 11345 3 0x400280 0x0001b4b7390d5eae 17 0 -1000000
...
You can find out more about each module by using the -S or --smpl-module-info option:
% pfmon -S dear-hist
name : dear-hist
description : Data EAR-based cache/tlb misses histograms
PMU models : [itanium] [itanium2] [dual-core Itanium 2]
options :
--smpl-inst-view Show instruction address based histogram (default).
--smpl-data-view Show data address based histogram.
--smpl-level-view Show cache/tlb level based histogram.
--smpl-sort-byvalue Sort samples by their value.
--smpl-sort-bylevel Sort samples by cache/tlb hit level.
--smpl-save-raw Save samples in binary format for offline processing.
The name of the module along with a short description is printed. Then, the list of supported
PMU models. Finally, module-specific options are listed.
Each module can have module-specific options. Those options become known to pfmon ONLY once
the module is known. Thus, you can specify module options ONLY after the module is specified with --smpl-module.
this restriction does not apply to the default module (inst-hist):
% pfmon -ecpu_cycles --smpl-module=dear-hist --smpl-data-view ....
You can list all the modules with all their options simply by typing:
% pfmon --help
....
options for "inst-hist" sampling format:
none
options for "detailed" sampling format:
none
options for "compact" sampling format:
none
options for "raw" sampling format:
none
options for "dear-hist" sampling format:
--smpl-inst-view Show instruction address based histogram (default).
--smpl-data-view Show data address based histogram.
--smpl-level-view Show cache/tlb level based histogram.
--smpl-sort-byvalue Sort samples by their value.
--smpl-sort-bylevel Sort samples by cache/tlb hit level.
--smpl-save-raw Save samples in binary format for offline processing.
10.4 Basic sampling examples
Suppose you want to record a sample every 1,000,000 cycles. In each sample, you are interested
in knowing where the program is, i.e., record the instruction pointer (IP-based sampling). The cycle event is the
default event used by pfmon, so all you have to do is specify the sampling period(s):
% pfmon --long-smpl-period=1000000 -- foo
# event00
# counts %self %cum code addr
11470 69.00% 69.00% 0x00000000004005e9
4527 27.23% 96.23% 0x00000000004005f1
534 3.21% 99.44% 0x00000000004005e0
68 0.41% 99.85% 0x00000000004005f4
12 0.07% 99.92% 0x00000000004005e6
6 0.04% 99.96% 0x00000000004005ed
5 0.03% 99.99% 0x00000000004005fc
1 0.01% 99.99% 0xffffffff8021e49a
1 0.01% 100.00% 0x00000000004005c8
The example shows shows the output you get with the inst-hist sampling module. The samples are
sorted by the number of hits, i.e., the number of times the same instruction address was encountered. So here,
the address 0x00000000004005e9 had 11,470 samples. This is where a lot of the time was actually spent. The second
column shows the importance of the number of samples for a particular address relative to all the samples collected.
Here the 11,470 samples represent 69% of all samples collected during the run. The third column print the cumulative
percentage all the way to 100%. The fourth column is the code address.
We can sample on multiple events at the same time, by specifying multiple events and sampling periods:
% pfmon --long-smpl-periods=1000000,20000 -eunhalted_core_cycles,mem_load_retired:dtlb_miss -- foo
# event00 event01
# counts %self %cum counts %self %cum code addr
12220 75.43% 75.43% 83451 44.87% 44.87% 0x00000000004005e9
3447 21.28% 96.71% 85539 45.99% 90.85% 0x00000000004005f1
468 2.89% 99.60% 15283 8.22% 99.07% 0x00000000004005e0
31 0.19% 99.79% 1134 0.61% 99.68% 0x00000000004005f4
10 0.06% 99.85% 175 0.09% 99.77% 0x00000000004005e6
10 0.06% 99.91% 91 0.05% 99.82% 0x00000000004005ed
5 0.03% 99.94% 196 0.11% 99.93% 0x00000000004005fc
5 0.03% 99.98% 4 0.00% 99.93% 0xffffffff80235325
2 0.01% 99.99% 6 0.00% 99.93% 0x00000000004005c8
...
Here we are sampling on elapsed cycles and data TLB misses. A sample is recorded every 1,000,000
cycles OR every 20,000 data TLB misses. There is now a second group of 3 columns for the second event. Columns are
oredered based on event order. The example shows a correlation between the instruction where we spend a lot of time
and the data TLB misses. Samples are always sorted based on the counts of the FIRST event
10.5 Sampling and symbol resolution example
It is possible to get instruction address resolution using the --resolve-addresses option:
% pfmon --resolv --long-smpl-periods=1000000,20000 -eunhalted_core_cycles,mem_load_retired:dtlb_miss -- foo
# event00 event01
# counts %self %cum counts %self %cum code addr symbol
6976 68.41% 68.41% 4684 49.96% 49.96% 0x00000000004005e9 mat_add_num+0x29<mcol>
2754 27.01% 95.41% 4220 45.01% 94.97% 0x00000000004005f1 mat_add_num+0x31<mcol>
376 3.69% 99.10% 301 3.21% 98.18% 0x00000000004005e0 mat_add_num+0x20<mcol>
46 0.45% 99.55% 61 0.65% 98.83% 0x00000000004005ed mat_add_num+0x2d<mcol>
34 0.33% 99.88% 86 0.92% 99.74% 0x00000000004005f4 mat_add_num+0x34<mcol>
7 0.07% 99.95% 2 0.02% 99.77% 0x00000000004005e6 mat_add_num+0x26<mcol>
2 0.02% 99.97% 0 0.00% 99.77% 0x000000000040076b main+0x10b<mcol>
2 0.02% 99.99% 11 0.12% 99.88% 0x00000000004005fc mat_add_num+0x3c<mcol>
1 0.01% 100.00% 0 0.00% 99.88% 0xffffffff8021e49a do_page_fault+0x292<kernel>
0 0.00% 100.00% 8 0.09% 99.97% 0x0000000000400600 mat_add_num+0x40<mcol>
0 0.00% 100.00% 2 0.02% 99.99% 0x00000000004005c8 mat_add_num+0x8<mcol>
0 0.00% 100.00% 1 0.01% 100.00% 0x0000000000400603 mat_add_num+0x43<mcol>
The symbolic name of the function is printed followed by an offset expressed in bytes. It is followed by the
name of the ELF file where the function is located. The full path is printed for libraries. When the symbol is not found, the
raw address is repeated in the symbol column.
10.6 Controlling sampling output
By default, the profile is printed for each instruction address. Using the --smpl-per-function, it is possible
to collapse the output by function:
% pfmon --resolv --smpl-per-func --long-smpl-periods=1000000,20000 -eunhalted_core_cycles,mem_load_retired:dtlb_miss -- mcol
# event00 event01
# counts %self %cum counts %self %cum code addr symbol
10273 99.98% 99.98% 9309 100.00% 100.00% 0x00000000004005c0 mat_add_num<mcol>
1 0.01% 99.99% 0 0.00% 100.00% 0xffffffff802352d2 __do_softirq<kernel>
1 0.01% 100.00% 0 0.00% 100.00% 0xffffffff8021e208 do_page_fault<kernel>
It is also possible to limit the number of samples printed on the screen using the following
options:
- --smpl-show-top=n : print at most n lines of samples
- --smpl-cum-threshold=n: stop printing samples when the first event cumulative percentage reaches n%
Both options can be used simultaneously, printing stop when anyone limit is reached. Let us
look at a few examples. First let's try to limit the number of lines to the top 5:
% pfmon --resolv --smpl-show-top=5 --long-smpl-periods=1000000,20000 -eunhalted_core_cycles,mem_load_retired:dtlb_miss -- mcol
# event00 event01
# counts %self %cum counts %self %cum code addr symbol
6973 67.81% 67.81% 4581 49.16% 49.16% 0x00000000004005e9 mat_add_num+0x29<mcol>
2852 27.74% 95.55% 4245 45.55% 94.71% 0x00000000004005f1 mat_add_num+0x31<mcol>
346 3.36% 98.91% 298 3.20% 97.91% 0x00000000004005e0 mat_add_num+0x20<mcol>
59 0.57% 99.48% 95 1.02% 98.93% 0x00000000004005ed mat_add_num+0x2d<mcol>
37 0.36% 99.84% 89 0.96% 99.88% 0x00000000004005f4 mat_add_num+0x34<mcol>
Second, let's try limiting output when we reach 96% of samples for the first event:
% pfmon --resolv --smpl-cum-thresh=96 --long-smpl-periods=1000000,20000 -eunhalted_core_cycles,mem_load_retired:dtlb_miss -- mcol
# event00 event01
# counts %self %cum counts %self %cum code addr symbol
6983 68.31% 68.31% 4502 48.51% 48.51% 0x00000000004005e9 mat_add_num+0x29<mcol>
2755 26.95% 95.27% 4330 46.65% 95.16% 0x00000000004005f1 mat_add_num+0x31<mcol>
10.7 Writing samples to a file
By default, all sampling modules print to the controlling terminal. This may be overwhelming
at times. Therefore, it is strongly recommended that samples be redirected to a file using the
--smpl-outfile option. This option is different from the --outfile option. The latter is where
the final counts are saved, whereas the former is where samples are saved. By default,
no counts are printed in sampling mode.
The filename specified for the --smpl-outfile is just the basename. A suffix is added
by pfmon. The suffix depends on the type of monitoring session. For a system-wide session, you get
one file per cpu and the .cpuXX suffix is added with XX representing the CPU number, starting
at zero. If aggregation is used then nothing is appended. In per-thread mode, the filename is also the
basename. In case the follow-* option is used, you get one file per monitored thread.
10.8 Blocking on overflow notifications
Whenever the sampling buffer becomes full and pfmon is notified, you have the option of either
letting the monitored program continue or block it. In both cases, monitoring is off during the
processing of the sampling buffer. By default, pfmon lets the program continue its execution. It is
possible to block the program using the --overflow-block option. Blocking the program ensures pfmon
sees the entire execution. Keeping the program running ensures that the caches and TLB are kept
somewhat warm, i.e., with some state belonging to the running process, especially on SMP systems.
The option is supported in all types of per-thread sessions including the ones which are following
execution through fork(), vfork(), and pthread_create(). This options is not supported in system-wide
mode.
IMPORTANT:: this option may have some side effects for programs using signals. Signals
may cause the blocking period to be terminated early, yet monitoring would not resume. Thus it is possible to get much fewer
samples than expected. Use this option carefully.
10.9 Sampling and the --follow-* options
Sampling is supported across fork(),vfork(), and pthread_create(). It is also supported when
sampling across exec().
By default, all sampling modules write to the controlling terminal, hence it may be difficult
to distinguish which samples come from which task and output may be interleaved. However, pfmon
does serialize terminal output. This means that samples output is batched per thread.
The best way to handle multi process/thread sampling sessions is to redirect the output to a file
using the --smpl-outfile option. The filename passed to the option serves as a basename very much
like what is done for the --outfile option that we described above.
Let us look at our compilation example:
% pfmon --follow-all --no-cmd-out --smpl-outfile=gcc-smpl --long-smpl-periods=1000000 -- gcc hello.c -o hello -O2 -static
% ls -l gcc-smpl.*
2174981 -rw-r--r-- 1 eranian users 0 Nov 4 17:31 gcc-smpl.24410.24410
2175304 -rw-r--r-- 1 eranian users 0 Nov 4 17:31 gcc-smpl.24411.24411
2175313 -rw-r--r-- 1 eranian users 1046 Nov 4 17:31 gcc-smpl.24412.24412
2175323 -rw-r--r-- 1 eranian users 380 Nov 4 17:31 gcc-smpl.24413.24413
2175327 -rw-r--r-- 1 eranian users 0 Nov 4 17:31 gcc-smpl.24414.24414
2175337 -rw-r--r-- 1 eranian users 14634 Nov 4 17:31 gcc-smpl.24415.24415
Just like for --outfile, the process id and the thread id are appended to the provided filename.
The thread id differs from the process id only when following across threads. For processes, the two numbers are identical:
gcc-smpl.24410.24410
^ ^ ^
| | |
| | thread id
| process id
base filename
You can get detailed information included in each file using the --with-header option,
just like for --outfile.
The --exec-split-results and --follow-exec-* options have the same effects as on --outfile.
The exec count is appended to the filename:
% pfmon -exec-split --follow-all --no-cmd-out --smpl-outfile=gcc-smpl \
--long-smpl-periods=1000000 -- gcc hello.c -o hello -O2 -static
% ls -l gcc-smpl.*
2174981 -rw-r--r-- 1 eranian users 1683 Nov 4 17:38 gcc-smpl.25727.25727.1
2175304 -rw-r--r-- 1 eranian users 1594 Nov 4 17:38 gcc-smpl.25728.25728.0
2175312 -rw-r--r-- 1 eranian users 1990 Nov 4 17:38 gcc-smpl.25728.25728.1
2175317 -rw-r--r-- 1 eranian users 1594 Nov 4 17:38 gcc-smpl.25729.25729.0
2175323 -rw-r--r-- 1 eranian users 2717 Nov 4 17:38 gcc-smpl.25729.25729.1
2175332 -rw-r--r-- 1 eranian users 1594 Nov 4 17:38 gcc-smpl.25730.25730.0
2175333 -rw-r--r-- 1 eranian users 1982 Nov 4 17:38 gcc-smpl.25730.25730.1
2175336 -rw-r--r-- 1 eranian users 1594 Nov 4 17:38 gcc-smpl.25731.25731.0
2175337 -rw-r--r-- 1 eranian users 1964 Nov 4 17:38 gcc-smpl.25731.25731.1
2175341 -rw-r--r-- 1 eranian users 1964 Nov 4 17:38 gcc-smpl.25732.25732.0
2175342 -rw-r--r-- 1 eranian users 16568 Nov 4 17:38 gcc-smpl.25732.25732.1
If we look at the header of the largest file, we find out that this is the linker, which
is exec'ed from the collect2 process:
# command line : /usr/bin/ld -static -o hello /usr/lib/gcc-lib/ia64-linux/2.96/...
#
# process id : 25732
# thread id : 25732
# parent process id: 25731
# parent thread id: 25731
# exec count : 1
10.10 Sampling in system wide mode
Sampling is possible in the same manner for system wide sessions. By default, the buffer is printed on the
controlling terminal. When sampling on more than one CPU at a time, samples for each CPU are printed separately. When
sampling results are redirected into a file with the --smpl-outfile option, you get one file per CPU. If
the file is called 'myresults', then 'myresults.cpu0' contains the samples captured on CPU0,
'myresults.cpu1' the ones from CPU1, and so on:
% pfmon --system-wide --smpl-outfile=sysout --long-smpl-periods=1000000 --session-timeout=3
<session to end in 3 seconds>
% ls -l sysout*
sysout.cpu0 sysout.cpu1
% ls -l sysout.cpu*
-rw-r--r-- 1 eranian users 0 Nov 4 12:22 sysout.cpu0
-rw-r--r-- 1 eranian users 0 Nov 4 12:22 sysout.cpu1
The --aggregate-results option changes the way samples are saved to files. When this option is used,
samples are merged into a single file. In our example, they would go into 'myresults':
10.11 Randomization of sampling periods
Pfmon supports randomization of sampling periods. This is useful to minimize the risk of getting
biased samples. This is especially true when sampling on events that happen very frequently such as taken branches.
The user must supply a bitmask and a seed value using the --smpl-periods-random option. The seed value is
ignored with perfmon v2.2 or higher. The same mask applies to both the long and short period for each event.
10.12 Miscellaneous sampling options
When sampling, the counts printed at the end of the session are not very useful, especially for
the counters used as sampling periods. By default, pfmon does not print them at the end of a
sampling session. You can change this by using the --smpl-print-counts option.
For certain measurements, it is interesting to put the sampling buffer in "saturation" mode.
This means that monitoring stops as soon as the buffer fill up. Pfmon is not notified at
all. In the end, we collect the first N samples, where N is determine by the size of
the sampling buffer. The processes and session continues until normal termination, but
no monitoring is active. You can turn on this mode using the --saturate-smpl-buffer.
By default, pfmon includes the PMD register values for all but the overflowed
PMD in a sample. For instance, if you do the following:
% pfmon -ecpu_cycles,ia64_inst_retired --long-smpl-periods=15000 ls
Assuming cpu_cycles is counted by PMD4 and ia64_inst_retired by PMD5, in each
sample created by the overflow of PMD4, the value of PMD5 is saved in the
body. In some configurations, it may be interesting to include other PMD
register values in the body of a sample, e.g. some PMD may be directly be
related to an event and thus cannot be specified on the command line. For
these configurations, it is possible to use the --extra-smpl-pmds option:
% pfmon -ecpu_cycles,ia64_inst_retired --extra-smpl-pmds=0,2-3 --long-smpl-periods=15000 ls
In this example, each sample will include the value of PMD5 (representing
ia64_inst_retired) and also the values of PMD0, PMD2, PMD3. PMD register
values are always stored in increasing index order. Note that the PMC
registers, if any, corresponding to those extra PMD registers are not
touched. As such they inherit their default values.
11. Getting support
The perfmon project (includes the kernel subsystem and related tools) has
a web site here.
The web site for earlier versions of pfmon is here.
There exists a mailing dedicated to perfmon related discussions. Go to
the perfmon web site for more details.
Last updated 10/18/2007
|