perfmon2: pfmon Itanium 2 processor documentation
perfmon2
   the hardware-based performance monitoring interface for Linux
opensource.hp.com Link to Linux and HP web site  
Pfmon Itanium 2 processor documentation

This page describes the Itanium 2 processor specific features of pfmon. This covers McKinley or Madison processors. The Dual-Core Itanium 2 (Montecito) support is described on this page.

Table of contents for Itanium 2 processor
  1. Introduction
  2. Command line options
  3. Event thresholds
  4. Opcode matching
    1. Options
    2. Events
    3. Basic example
    4. Constraints
    5. Using a logical name for opcodes
  5. Event address registers (EARs)
  6. Branch Trace buffer (BTB)
  7. IA-32 monitoring
    1. Introduction
    2. The --ia32 and --ia64 options
    3. Per event instruction set tuning
    4. Some examples
    5. Limitations
  8. Address range restrictions
    1. Introduction
    2. Itanium 2 processor limitations
    3. Privilege level mask
    4. Inverting code ranges
    5. Examples
    6. The checkpoint-func option
  9. The --insecure and --dont-start options
  10. Interrupted-triggered execution
  11. The dear-inst sampling module
  12. References

1. Introduction

Pfmon provides access to ALL the Itanium2 PMU specific features. This includes:

  • 4 counters
  • Event Address Registers (Data & Code & ALAT)
  • Opcode matching
  • Address range restrictions (Data & Code) with fine mode and inverse code range
  • Branch Trace Buffer (BTB)
  • Event thresholds
  • IA-32 execution monitoring
  • rum/sum control at the user level
  • exclude/include interrupt triggered execution

2. Command line options

The Itanium2 processor specific options of pfmon are as follows:

--event-thresholds=thr1,thr2,... set event thresholds (no space)
--opc-match8=[mifb]:match:mask set opcode match for pmc8
--opc-match9=[mifb]:match:mask set opcode match for pmc9
--btb-tm-tk capture taken IA-64 branches only
--btb-tm-ntk capture not taken IA-64 branches only
--btb-ptm-correct capture branch if target predicted correctly
--btb-ptm-incorrectcapture branch if target is mispredicted
--btb-ppm-correct capture branch if path is predicted correctly
--btb-ppm-incorrectcapture branch if path is mispredicted
--btb-brt-iprelcapture IP-relative branches only
--btb-brt-retcapture return branches only
--btb-brt-indcapture non-return indirect branches only
--irange=start-endspecify an instruction address range constraint
--drange=start-endspecify a data address range constraint
--checkpoint-func=addra bundle address to use as checkpoint
--ia32monitor IA-32 execution only
--ia64monitor IA-64 execution only
--insn-sets=set1,set2,...set per event instruction set (setX=[ia32|ia64|both])
--inverse-irangeinverse instruction range restriction
--insecureallow rum/sum/read(pmd) at user level
--excl-intrexclude interrupt triggered execution
--incl-intrinclude only interrupt triggered execution

3. Event thresholds

Pfmon has support for event thresholds. It is possible to further refine certain events using a threshold. If an event as a threshold set to n, it means that the PMU will not count the occurrences of that event unless it happens more than n times per cycles. So, if the threshold is zero, which is the default, then ALL occurrences will be recorded. But if it is set to 3, then the counter will be increased by one only when more the event happens more than 3 times per cycle. Not all events have the same threshold value. You can determine the maximum increment per cycle for each event using the event info (-i) option of pfmon:

   % pfmon -i nops_retired
   Name   : NOPS_RETIRED
   VCode  : 0x50
   Code   : 0x50
   PMD/PMC: [ 4 5 6 7 ]
   Umask  : 0000
   EAR    : No (N/A)
   BTB    : No
   MaxIncr: 6  (Threshold [0-5])
   Qual   : [Instruction Address Range] [OpCode Match] 
   Group  : None
   Set    : None
   Desc   : Retired NOP Instructions

The information includes the maximum increment for the event. Here 6 means that the CPU can execute up to 6 nop per cycle which corresponds to the dual-bundle maximum issue window of the processor. This combination is possible when using the right template to fill all the execution units. Next to it you see the allowed values for the threshold which not surprisingly go from 0 to max increment - 1.

Now if you want to count the number of times 6 nops are executed in a single cycle, you can do:

   % pfmon --event-threshold=5 -e nops_retired ls /dev/null
   101 NOPS_RETIRED

You can specify the thresholds for every event you use. They MUST be specified in the same order as the event.


4. Opcode matching
4.1 Options

The opcode matcher feature allows constraining of what is being monitored based on the instruction opcode, opcode pattern or functional unit.

Pfmon has two options to support this features:

  • --opc-match8: set the template/match/mask value for PMC8 (first opcode matcher)
  • --opc-match9: set the template/match/mask value for PMC9 (second opcode matcher)

Each option take an argument which is divided into three sections:

  1. mifb: the template match (M, I, F, B). This is an optional field. To skip, just provide :
  2. match: 27-bit bitmask, where bits [0-12] correspond to bits [0-12] of instruction to match, and bits [13:26] correspond to bits [27-40] of instruction to match. This field is mandatory
  3. mask: 27-bit bitmask, where bits [0-12] correspond to bits [0-12] of instruction to match, and bits [13:26] correspond to bits [27-40] of instruction to match. Each bit set in the mask is ignored during matching. This field is mandatory
4.2 Events

These options constrain what is included in the measurement but they do not set what is to be measured, i.e. which event. Many times, the user just wants to count the number of occurrences of a certain instruction or instruction pattern. For this, you need to combine PMC8/PMC9 with an event. To count the number of machine instructions constrained by:

  • PMC8 you need to use the: IA64_TAGGED_INST_RETIRED_IBRP0_PMC8 or IA64_TAGGED_INST_RETIRED_IBRP2_PMC8
  • PMC9 you need to use the IA64_TAGGED_INST_RETIRED_IBRP1_PMC9 or IA64_TAGGED_INST_RETIRED_IBRP3_PMC9
4.3 Basic example

For instance, if you want to count the number of br.cloop executed in a program using PMC8, you can do:

   % pfmon --opc-match8=:0x2000140:0x7ffe3f -e IA64_TAGGED_INST_RETIRED_IBRP0_PMC8 ls /dev/null
   /dev/null
                       2134 IA64_TAGGED_INST_RETIRED_IBRP0_PMC8

The mifb, i.e., template, designation is optional. When not specified, it is equivalent to saying 'mifb', i.e., all slot types. The pattern 0x2000140 corresponds to the match field. the pattern 0x7ffe3f corresponds to the match field. Any bit set to 1 in the mask field means the corresponding match bit is ignored.

The Itanium2 opcode matcher are not 41-bit wide. As a consequence not all instructions can be matched exactly.

The two opcode matchers are not symmetrical in what they can constrain, please refer to Itanium2 documentation for further information.

4.5 Constraints

Not all events can be constrained with the opcode matchers. Pfmon will reject any invalid combination. You can figure out if an event support the opcode matcher feature using the event info option of pfmon:

   % pfmon -i cpu_cycles
   Name   : CPU_CYCLES
   VCode  : 0x12
   Code   : 0x12
   PMD/PMC: [ 4 5 6 7 ]
   Umask  : 0000
   EAR    : No (N/A)
   BTB    : No
   MaxIncr: 1  (Threshold 0)
   Qual   : None
   Group  : None
   Set    : None
   Desc   : CPU Cycles

Here you see on the Qual line that CPU_CYCLES does not support any constraint at all. But if we look at NOPS_RETIRED:

   %  pfmon -i nops_retired
   Name   : NOPS_RETIRED
   VCode  : 0x50
   Code   : 0x50
   PMD/PMC: [ 4 5 6 7 ]
   Umask  : 0000
   EAR    : No (N/A)
   BTB    : No
   MaxIncr: 6  (Threshold [0-5])
   Qual   : [Instruction Address Range] [OpCode Match] 
   Group  : None
   Set    : None
   Desc   : Retired NOP Instructions

You see that this event supports opcode matching: 'OpCode Match'

By default, pfmon will reject any attempt to use opcode matcher with an event that does not support it. For most event, not supporting opcode matching means that the constraint is simply ignored, i.e., count for all instructions. To enable experts to bypass this check, pfmon does provide the --no-qual-check option.

4.4 Using a logical name for opcodes

This option of pfmon, via the .pfmon.conf file has been abandoned in this version.


5. Event address registers (EARs)

The Event Address Registers provide a way to capture where cache, TLB, and ALAT misses occur. For each captured miss, you get the instruction address, the data address (if relevant), the latency of the miss (if relevant), the TLB level at which the miss was resolved (if relevant).

Let us first look at cache misses. You can filter out which misses you are interested in based on the miss latency. EARS DO NOT CAPTURE:

  • NON MISSING L1 cache accesses.
  • Store accesses.
For instance you can say that you want misses that take more than 16 cycles to resolve. The Itanium2 PMU supports a fixed set of latencies going from 4 to 4096. Of course not all latencies are possible, they are usually powers of two. The Itanium2 PMU uses two events to indicate the type of cache misses: code or data. The L1I_EAR_CACHE is used for instruction and DATA_EAR_CACHE is used for data cache misses. Similarly, DATA_EAR_ALAT is used for the ALAT. Theoretically, the latency filter is programmed in one the fields of the PMC controlling the monitor. However to make it easier to use, the library on which pfmon is built, encapsulates the latency with the event by creating 'virtual events'. If you list the events using pfmon -l and a regular expression of '_ear_', you get:

   % pfmon -l_ear_
   DATA_EAR_ALAT
   DATA_EAR_CACHE_LAT1024
   DATA_EAR_CACHE_LAT128
   DATA_EAR_CACHE_LAT16
   DATA_EAR_CACHE_LAT2048
   DATA_EAR_CACHE_LAT256
   DATA_EAR_CACHE_LAT32
   DATA_EAR_CACHE_LAT4
   DATA_EAR_CACHE_LAT4096
   DATA_EAR_CACHE_LAT512
   DATA_EAR_CACHE_LAT64
   DATA_EAR_CACHE_LAT8
   DATA_EAR_EVENTS
   DATA_EAR_TLB_ALL
   DATA_EAR_TLB_FAULT
   DATA_EAR_TLB_L2DTLB
   DATA_EAR_TLB_L2DTLB_OR_FAULT
   DATA_EAR_TLB_L2DTLB_OR_VHPT
   DATA_EAR_TLB_VHPT
   DATA_EAR_TLB_VHPT_OR_FAULT
   L1I_EAR_CACHE_LAT0
   L1I_EAR_CACHE_LAT1024
   L1I_EAR_CACHE_LAT128
   L1I_EAR_CACHE_LAT16
   L1I_EAR_CACHE_LAT256
   L1I_EAR_CACHE_LAT32
   L1I_EAR_CACHE_LAT4
   L1I_EAR_CACHE_LAT4096
   L1I_EAR_CACHE_LAT8
   L1I_EAR_CACHE_RAB
   L1I_EAR_EVENTS
   L1I_EAR_TLB_ALL
   L1I_EAR_TLB_FAULT
   L1I_EAR_TLB_L2TLB
   L1I_EAR_TLB_L2TLB_OR_FAULT
   L1I_EAR_TLB_L2TLB_OR_VHPT
   L1I_EAR_TLB_VHPT
   L1I_EAR_TLB_VHPT_OR_FAULT

You see the events for both TLB and caches and the ALAT. For instance, DATA_EAR_CACHE_LAT64 is the event used to capture data cache misses with a latency of 64 cycles OR more. Similarly, the DATA_EAR_TLB_VHPT is used to capture TLB misses that were resolved by the hardware walker (VHPT). The Data EAR events are all sub-events of DATA_EAR_EVENTS. Similarly the Instruction EAR events are all sub-events of L1I_EAR_EVENTS.

You can get detailed information about EAR events using the event info (-i) option of pfmon:

   % pfmon -i DATA_EAR_CACHE_LAT64
   Name   : DATA_EAR_CACHE_LAT64
   VCode  : 0x405c8
   Code   : 0xc8
   counter: [ 4 5 6 7 ]
   Umask  : 0100
   EAR    : Data (Cache Mode)
   BTB    : No
   MaxIncr: 1  (Threshold 0)
   Qual   : [Instruction Address Range] [OpCode Match] [Data Address Range]
   Group  : None
   Set    : None
   Desc   : Data EAR Cache -- >= 64 Cycles

The EARs are mostly used for sampling, therefore you typically associate a sampling period with them. You configure a sampling period with EAR just like you would do with regular events.

But let us take a simple example to help visualize the difference. Let us suppose you want to capture the data cache misses that take more than 8 cycles. The sampling period is set to 2000 which is quite small but is just used to show the sampling output:

   % pfmon --smpl-module=detailed-ia64 --long-smpl-periods=2000 -e DATA_EAR_CACHE_LAT4 -- ls -l /dev/null
   crw-rw-rw-  1 root root 1, 3 Aug 19 09:02 /dev/null
   entry 0 PID:4123 TID:4123 CPU:0 STAMP:0x6555120cf2a IIP:0x200000000000d651 OVFL:4 LAST_VAL:2000 SET:0
   PMD2  : miss address=0x600000000000d348
   PMD3  : valid=Y, latency=7 cycles, latency overflow=N
   PMD17 : valid=Y, instr address=0x200000000000d5f1

Here again, we get sampling entries which the usual header. However the information in the body of each sample is quite different from what we saw earlier. With the detailed output format for IA-64, pfmon decodes the meaning of each PMD which contains EAR information. For instance, with EAR and data cache misses, PMD3 contains the latency of the miss. In Entry 0, the miss took 7 cycles to resolve. The data that was being accessed was at address 0x600000000000d348 (PMD2) and the instruction which generated the access was at 0x200000000000d5f1. The instruction slot information is encoded in the address field (low 2 bits) as 0, 1, or 2.

If we look at the TLB instead, we get samples that look as follows:

   % pfmon --long-smpl-periods=20 -e DATA_EAR_TLB_VHPT -- ls -l /dev/null
   entry 0 PID:4135 TID:4135 CPU:0 STAMP:0x6aeac1940d7 IIP:0x2000000000023e11 OVFL:4 LAST_VAL:20 SET:0
   PMD2  : miss address=0x20000000003dc090
   PMD3  : valid=Y, TLB:VHPT
   PMD17 : valid=Y, instr address=0x2000000000023e00

Note that this time the interpretation of PMD3 has changed. In TLB mode, you specify the level at which you want to capture the misses. Here we wanted TLB request that missed in L1 and hit in VHPT and that is what is reflected by PMD3. There is no latency information on TLB misses. PMD17 contains the address of the instruction that caused the TLB miss. And PMD2 is the address of the data that was being accessed.

Cache and TLB misses can also be captured for instructions. Pfmon operates in the same manner for instructions. The difference is in the information that is captured.

For instance, if we want to capture the instruction TLB misses that hit in the VHPT you can do as follows:

   % pfmon --long-smpl-periods=20 -e L1I_EAR_TLB_VHPT -- ls -l /dev/null
   entry 0 PID:4449 TID:4449 CPU:0 STAMP:0x8fda49affee IIP:0x2000000000028730 OVFL:4 LAST_VAL:20 SET:0
   PMD0  : valid=Y, cache line 0x2000000000028720, TLB:VHPT

This time, the set of PMDs used to capture the information is different, allowing both data and instruction EAR to operate in parallel. In our example, PMD0 contains the address of the cache line that caused the TLB miss (which was resolved by the VHPT).

For instruction cache misses, you can do:

   % pfmon --long-smpl-periods=5000 -e L1I_EAR_CACHE_LAT8 -- ls -l /dev/null
   crw-rw-rw-  1 root root 1, 3 Aug 19 09:02 /dev/null
   entry 0 PID:4481 TID:4481 CPU:0 STAMP:0x92744d61e5f IIP:0x20000000000233e0 OVFL:4 LAST_VAL:500 SET:0
   PMD0  : valid=Y, cache line 0x2000000000023580
   PMD1  : latency=15 cycles, latency overflow=N

This time both PMD0 and PMD1 contains relevant information. PMD0 contains the address of the cache line that caused the miss and PMD1 the latency to resolve it.

6. Branch Trace buffer (BTB)

The BTB is used to capture branch events. Depending on the configuration of the BTB, it is possible to record the source and target of each branch instruction. It is possible to filter out branches based on how they were predicted by the hardware, whether they were taken or not taken, and so on. Each qualified branch is recorded into the branch buffer and usually each takes two entries (a pair) one for the source (the branch instruction itself) and one for the target of the branch. The hardware buffer has a size of 8 meaning that it can hold up to 4 branch events. The buffer is managed like a ring buffer, once it is full the oldest entries get overwritten. The PMD16 register is used to maintain the index, i.e., were to write next. It also contains a flag indicating whether or not the buffer wrapped around.

You can count how many branches are captured using the BRANCH_EVENT event. You MUST use this event if you want to sample with the BTB. Because the BTB can hold 4 branches, sampling with the BTB means that at the end of each sampling period, up to the last 4 branches are recorded.

By default, pfmon will capture ALL branches (taken, not taken, predicted correctly or mispredicted). Let us take a look at a simple example:

   % pfmon --long-smpl-periods=5000 -e branch_event -- ls -l /dev/null
   crw-rw-rw-  1 root root 1, 3 Aug 19 09:02 /dev/null
   ....
   entry 19 PID:25321 TID:25321 CPU:0 STAMP:0x696ff7355b49 IIP:0x200000000016a7b2 OVFL:4 LAST_VAL:5000 SET:0
        PMD8  : 0x200000000016a539 b=1 mp=0 bru=0 b1=0 valid=y
               source addr=0x200000000016a532
               taken=y prediction=success
        PMD9  : 0x200000000016a6d2 b=0 mp=1 bru=0 b1=0 valid=y
               target addr=0x200000000016a6d0
        PMD10 : 0x200000000016a6f9 b=1 mp=0 bru=0 b1=0 valid=y
               source addr=0x200000000016a6f2
               taken=y prediction=success
        PMD11 : 0x200000000016a539 b=1 mp=0 bru=0 b1=0 valid=y
               source addr=0x200000000016a532
               taken=y prediction=success
        PMD12 : 0x200000000016a6d2 b=0 mp=1 bru=0 b1=0 valid=y
               target addr=0x200000000016a6d0
        PMD13 : 0x200000000016a6ff b=1 mp=1 bru=1 b1=0 valid=y
               source addr=0x200000000016a6f0
               taken=n prediction=FE failure
        PMD14 : 0x200000000016a739 b=1 mp=0 bru=0 b1=1 valid=y
               source addr=0x200000000016a742
               taken=y prediction=success
        PMD15 : 0x200000000016a7b2 b=0 mp=1 bru=0 b1=0 valid=y
               target addr=0x200000000016a7b0
   ....

This time, each entry contains as many as 8 PMDs. Pfmon always prints the branches in the order in which they occurred. Because the ETB is a cyclic buffer, it may be that the oldest branch is not in PMD8.

If we look at the example, PMD8 is the oldest branch in the buffer. It reports a branch source address at 0x200000000016a532, i.e., 0x200000000016a530 bundle slot 2. The branch was taken and predicted correctly by the processor. The second entry, PMD9, reports a branch target, where the branch in PMD8 went to, at address 0x200000000016a6d0. It is possible to have two consecutive branch source entries. This is a special case of the BTB. Refer to the PMU documentation for more details.

It is possible to vary the kind of branches that are recorded using the following options:
--btb-brt-iprel capture IP-relative branches only
--btb-brt-ret capture return branches only
--btb-brt-ind capture non-return indirect branches only
--btb-tm-tkcapture taken IA-64 branches only
--btb-tm-ntkcapture not taken IA-64 branches only
--btb-ptm-correctcapture branch if target predicted correctly
--btb-ptm-incorrectcapture branch if target is mispredicted
--btb-ppm-correctcapture branch if path is predicted correctly
--btb-ppm-incorrectcapture branch if path is mispredicted

You can combine various --btb-* options, however be aware that they are ANDed by the PMU. That means that if you specify --btb-ppm-incorrect and --btb-ppm-incorrect, pfmon is capturing branches which have mispredicted paths AND mispredicted targets at the same time.


7. IA-32 monitoring
7.1 Introduction

By default, pfmon captures events for both IA-32 and IA-64 programs. Not all events are functional in IA-32 mode. The following features are not available when monitoring in IA-32 mode ONLY:

  • The Branch Trace Buffer (BRANCH_EVENT)
  • Code range restriction (--irange, --checkpoint-func)
  • Data range restriction (--drange)

However those features are accepted when monitoring for both IA-64 and IA-32 (default). The results will ONLY represent what was generated by the IA-64 execution.

7.2 Introduction

Using the --ia32 option, the user restricts monitoring to execution occurring while psr.is = 1, i.e., for IA-32 code. Using the --ia64 restricts monitoring to IA-64 code only, i.e., psr.is = 0. Note that those options do apply to ALL specified events.

7.3 Per event instruction set tuning

Pfmon also provides a way to fine-tune the instruction set on a per event basis using the --insn-sets option. The order in which the events are listed determines to which event does each instruction set option apply. The first event gets the first instruction set option specified and so on. You do not need to specify all instruction set option for all events. In this case the event for which no instruction set is specified will use whatever the "global" option, i.e. --ia64 or --ia32 is set to. Note that by default, pfmon does both IA-64 and IA-32 at the same time. You can skip certain events, for instance:

   % pfmon --insn-sets=,ia64 -e l2_misses,l2_misses hello

This will have the first l2_misses event use the default mode, i.e. IA-64 & IA32, while the second l2_misses will be configured for IA-64 only. Similarly, the following command:

   % pfmon --insn-sets=ia32 -e l2_misses,l2_misses hello

will set the first l2_misses event for IA-32 only and the second for both IA-64 and IA-32.

7.3 Some examples

Let us look at a simple example with two hello program, one an IA-64 binary (hello) and the same program compiled as an IA-32 binary (hello.x86):

   % file hello
   hello: ELF 64-bit LSB executable, IA-64, version 1, statically linked, not stripped
   % pfmon --insn-sets=ia32,ia64 -e l2_misses,l2_misses hello
   Hello world
                          0 L2_MISSES
                        578 L2_MISSES

Here we measure twice the same event, but the first one is configured to monitor IA-32 execution whereas the second monitors IA-64. When running an IA=64 binary, the counter is 0. Now let us see what happens with an IA-32 binary:

   % file hello.x86
   hello: ELF 32-bit LSB executable, Intel 80386, version 1, statically linked, not stripped
   % pfmon --insn-sets=ia32,ia64 -e l2_misses,l2_misses hello.x86
   Hello world
                        184 L2_MISSES
                          0 L2_MISSES

Now the first counter reports a non zero value.

7.5 Limitations

Linux/ia64 does not currently support processes where both instructions set are mixed. However the dual mode (IA-32, IA-64) is interesting when running system wide monitoring where all execution is captured. The Linux/ia64 kernel execution ALWAYS happens in IA-64 mode, therefore using --ia32 to monitor kernel level execution has no effect.

Similarly, some events are only relevant in one mode. For instance, IA32_INST_RETIRED only counts IA-32 instructions. Conversely, IA64_INST_RETIRED will return 0 on an IA-32 program.


8. Address range restrictions

This feature does not use the generic trigger options (--trigger-*) offered by pfmon. However, both cannot be used at the same time because they are implemented using the same underlying hardware resource.

8.1 Introduction

Pfmon allows the monitoring to be constrained to a certain range of data or code addresses and provides the following set of options:

--irange=start-end|code_symbolspecify a code address range
--drange=start-end|data_symbolspecify a data address range
--checkpoint-func=code_addr|code_symbolspecify a checkpoint address
--inverse-irangeinverse a code range

The third option is a refinement of the first option as we will see shortly.

A range is defined by its boundaries, a start and end address. For code ranges, monitoring is active when the processor executes instructions between the two limits of the range. For data ranges, monitoring is active for all data accesses between the two limits of the range. In the case of code ranges, it is important to realize that if you are in a function included in the range and that function calls another one which is outside, then the callee function is not monitored. That is a major difference with the generic code triggers where monitor is activated when you execute the bundle at the starting point and continues as long as execution does not cross the end point.

The range can be specified in hexadecimal or decimal. Alternatively, the range can be specified using symbols from the program.

Pfmon currently supports only one range per type at a time, i.e., you cannot specify two instruction ranges. When a range is specified using a numerical value, pfmon does not try to see if the range represents a valid part of the address space of the process. It will simply do sanity check on the bounds. It is possible to specify code or data ranges inside the kernel. When symbols are used, then pfmon checks that the symbol corresponds to data for --drange and code for --irange and --checkpoint-func. For a code range pfmon verifies that the bounds are bundle-aligned.

The range can be delimited by two symbols, but pfmon also supports using a single symbol. In this case, it will use the size of the symbol which is encoded in the symbol table. Note that this can be an approximation in certain situations.

NOTE: in case the symbol size is missing from the symbol table be it from the ELF archive or from a /proc interface for the kernel, pfmon uses an approximation by calculating the distance with the next symbol. This is a rather gross approximation and users are advised to check the actual range using the verbose option of pfmon (-v).

8.2 Itanium 2 processor limitations

The Itanium 2 PMU imposes some restrictions on alignment of the ranges due to the way they are implemented, i.e., using the debug registers. To compensate, there is a fine mode where a code range can be specified without using a mask but instead with a start and end addresses. However, a fine mode range is limited to 4KB and cannot cross a 4KB page boundary. Depending on the event, it may not be possible to use multiple debug register pairs to gain accuracy when the fine mode cannot be used. Even with multiple pairs, it is possible that the programmed range will be slightly larger than what was asked for. You can determine which mode was used for each range and also by how much the debug registers will 'bleed' from the specified range by using the --verbose option of pfmon:

   % pfmon --verbose --irange=0x1000-0x1590 -e ia64_inst_retired /bin/ls /dev/null
   ...
   irange is [0x1000-0x1590)=1424 bytes
   ...
   [0x1000-0x1590): 2 register pair(s), fine_mode
   start offset: -0x0 end_offset: +0x10
   brp0:  db0: 0x0000000000001000 db1: plm=0x8 mask=0x00000000fffff000
   brp2:  db4: 0x0000000000001590 db5: plm=0x8 mask=0x00000000fffff000
   ...

As you can see here, it was possible to use the fine mode because the size of the range was below the 4kB limit and was not crossing a 4KB page boundary. Fine mode has a granularity of two bundle. As such the range must be aligned on a 2-bundle boundary, otherwise there is a 1-bundle bleed as shown in the example above for the actual end offset.

Just like for the opcode matcher, not all events support address range restrictions, you can use the event info option (-i) to verify. When an event does not support range restriction, it typically means that the constraint is ignored, i.e., the range is ignored. Expert user may use the --no-qual-check option to bypass the checks done by pfmon.

The --drange options works just like the --irange options. In fact, both can be combined as they rely on distinct sets of debug registers.

IMPORTANT: The program being monitored by pfmon MUST NOT be using the debug registers.

8.3 Privilege level mask

The range restriction also uses a privilege level mask. It has the same role as the one for events. Pfmon uses the default global privilege level to setup the range restrictions. For instance, the following example:

   % pfmon --irange=main --verbose -eloads_retired,nops_retired,loads_retired noploop 1000000000
   ...
   [0x40000000000004c0-0x4000000000000690): 2 register pair(s), fine_mode
   start offset: -0x0 end_offset: +0x10
   brp0:  db0: 0x40000000000004c0 db1: plm=0x8 mask=0x00fffffffffff000
   brp2:  db4: 0x4000000000000690 db5: plm=0x8 mask=0x00fffffffffff000
   ...

But when privilege level masks are set per event, there can be confusion as the range is systematically applied to all events. Therefore pfmon disallow the use of the --priv-levels option when a range is provided and vice-versa.

8.4 Inverting code range

It is possible to inverse the code range using the --inverse-irange. Inverting the code range means that the PMU will count the events only when they occur outside the specified range.

Let us use our noploop example to demonstrate what happens. First, let us measure the total number of nops instructions retired.

   % pfmon -enops_retired noploop 1000000000
   1000005380 NOPS_RETIRED

Now if we focus on the core loop function:

   % pfmon --irange=noploop -enops_retired noploop 1000000000
   1000000002 NOPS_RETIRED

Now if we inverse the range, i.e. count all the nops outside of noploop():

   % pfmon --inverse --irange=noploop -enops_retired ~/nueh/noploop 1000000000
   5381 NOPS_RETIRED

We can verify that 1000000002+5381 is about the same as 1000005380.

8.5 Some examples

Let us look at some more examples which use symbols directly.

First suppose we have a program which contains a data array called B and we want to know the number of loads from the array:

   % pfmon -eloads_retired -v --drange=b my_program
   ...
   symbol b (data): [0x6000000000001800-0x6000000000003800)=8192 bytes
   drange is [0x6000000000001800-0x6000000000003800)=8192 bytes
   ...
   [0x6000000000001800-0x6000000000003800): 3 register pair(s)
   start offset: -0x0 end_offset: +0x0
   brp0:  db0: 0x6000000000002000 db1: plm=0x8 mask=0x00fffffffffff000 end=0x6000000000002fff
   brp1:  db2: 0x6000000000001800 db3: plm=0x8 mask=0x00fffffffffff800 end=0x6000000000001fff
   brp2:  db4: 0x6000000000003000 db5: plm=0x8 mask=0x00fffffffffff800 end=0x60000000000037ff
   ...
   409600000 LOADS_RETIRED

Here pfmon was able to extract the size of B directly from the symbol table. The array is aligned properly for its size, therefore both start and end offsets are 0.

Now suppose we want to know the number of loads from b which where executed in function doit(). We can combine --irange with --drange for LOADS_RETIRED:

   % pfmon --verb --irange=doit --drange=B -e loads_retired my_test_program
   ...
   symbol doit (code): [0x4000000000003000-0x40000000000030d0)=208 bytes
   irange is [0x4000000000003000-0x40000000000030d0)=208 bytes
   symbol B (data): [0x600000000001c000-0x6000000000024000)=32768 bytes
   drange is [0x600000000001c000-0x6000000000024000)=32768 bytes
   ...
   [0x4000000000003000-0x40000000000030d0): 2 register pair(s), fine_mode
   start offset: -0x0 end_offset: +0x10
   brp0:  db0: 0x4000000000003000 db1: plm=0x8 mask=0x00fffffffffff000
   brp2:  db4: 0x40000000000030d0 db5: plm=0x8 mask=0x00fffffffffff000
   ...
   [0x600000000001c000-0x6000000000024000): 2 register pair(s)
   start offset: -0x0 end_offset: +0x0
   brp0:  db0: 0x6000000000020000 db1: plm=0x8 mask=0x00ffffffffffc000 end=0x6000000000023fff
   brp1:  db2: 0x600000000001c000 db3: plm=0x8 mask=0x00ffffffffffc000 end=0x600000000001ffff
   ...
                     100000000 LOADS_RETIRED

Here, pfmon extracted the size of function doit() from the symbol table and its is small enough to qualify for the fine mode. Due to the fine mode, we still have an end offset of 1 bundle. The run shows that all of the loads are coming from doit().

8.6 The checkpoint-func option

The --checkpoint-func option is a variation of the --irange option as such it cannot be used in conjunction with --irange. It allows a user to specify a bundle address and can be used to verify that execution crosses a certain point (bundle). When the bundle is the first of a function, you can check how many times the function is called. You need to combine the constraint with the IA64_INST_RETIRED event. The result then needs to be divided by three to get the number of calls. Note that pfmon does not impose that the bundle be the first of a function, in fact, it can be anything. There is no equivalent of this option for data.

With this option, you can easily determine the number of times a particular system call is invoked. For instance, to count the number of times the system call sys_open() is invoked:

   % pfmon --verb -k --checkpoint-func=sys_open -e ia64_inst_retired ls /dev/null
    loaded 15082 text symbols from /proc/kallsyms
    loaded 3151 data symbols from /proc/kallsyms
    using approximation for size of symbol sys_open
    symbol sys_open (code): [0xa00000010011f7c0-0xa00000010011f840)=128 bytes
    checkpoint function at 0xa00000010011f7c0
    [0xa00000010011f7c0-0xa00000010011f7d0): 1 register pair(s)
    start offset: -0x0 end_offset: +0x0
    brp0:  db0: 0xa00000010011f7c0 db1: plm=0x1 mask=0x00fffffffffffff0 end=0xa00000010011f7cf
    ...
   /dev/null
   108 IA64_INST_RETIRED

Here we specified, -k to monitor at the kernel level given that sys_open() is a kernel function. The count is 108 which indicates that the function was called 36 times (36=108/3). The result is ALWAYS a multiple of 3 as you have 3 instructions per bundle (predicated off instruction are counted here).

The use of any other event is possible here if that event supports the instruction address range restriction (see pfmon -i). But to count the number of time the function is invoked you MUST use IA64_INST_RETIRED.

At this point only one checkpoint per session is supported.


9. The --insecure and --dont-start options

The Itanium architecture allows user applications to start/stop monitoring with a single assembly instruction: rum and sum respectively. The perfmon2 interface allows this level of control for self-monitoring threads by default. Upon special request, it is also possible to enable it for non self-monitoring threads which is what pfmon does.

To request this special mode, you can use the --insecure option. This option does not work in system-wide mode. With this option, if the monitored thread executes a rum on psr.up, monitoring stops and if it executes a sum of psr.up monitoring starts. This is in addition of the start/stop issued by pfmon. In fact, it can conflict with it. As such, this option is reserved to experts.

The --dont-start option informs pfmon to skip the activation of monitoring. In per-thread mode it means the thread runs with monitoring disabled. Not interesting, unless this option is combined with --insecure. The combination of the two options makes it possible to plant rum/sum instructions around sections of code to monitor and have only those sections monitored.


10. Interrupt-triggered execution

In system-wide mode, it is possible to either include or exclude interrupt triggered execution in the kernel from active monitoring by using the --excl-intr or --intr-only options. These two options have no effect when executing at the user level, they influence monitoring for events programmed to measure at the kernel level. These options are supported per-set.


11. The dear-hist sampling module

The dear-hist module produces a histogram of cache and TLB misses when combined with an EAR event and sampling. The sampling setup is the same as with the default sampling format, but the processing of the samples is different resulting in a histogram. The module supports several options:

--smpl-show-funccollapse instruction samples to function level
--smpl-show-top=nshow only the top n samples/functions. Useful when you have lots of samples with low counts
--smpl-inst-viewshow by loads (instructions) that caused the miss (default)
--smpl-data-viewshow by data addresses which caused a miss
--smpl-level-viewshow by level where miss was resolved

The module does not work if you are not using an EAR event as listed here.

To use the module, you need to explicitly request it as follows:

   $ pfmon --resolv --smpl-module=dear-hist -edata_ear_cache_lat4 --long-smpl-periods=10000 --smpl-periods-random=0xff:5 foo
# total_samples 9790
# instruction addr view
# sorted by count
# showing per distinct value
# L2   :  5 cycles load latency
# L3   : 14 cycles load latency
# %L2  : percentage of L1 misses that hit L2
# %L3  : percentage of L1 misses that hit L3
# %RAM : percentage of L1 misses that hit memory
#count   %self    %cum     %L2     %L3    %RAM   instruction addr
    1657  16.93%  16.93%  98.49%   1.51%   0.00% 0x4000000000001ed0 f1+0x10
    1602  16.36%  33.29%   0.00% 100.00%   0.00% 0x4000000000001ed1 f1+0x11
    1296  13.24%  46.53%   0.00% 100.00%   0.00% 0x4000000000001ff1 f1+0x131
    1243  12.70%  59.22%   0.00% 100.00%   0.00% 0x4000000000002001 f1+0x141
    1226  12.52%  71.75% 100.00%   0.00%   0.00% 0x4000000000002000 f1+0x140
    1217  12.43%  84.18% 100.00%   0.00%   0.00% 0x4000000000001ff0 f1+0x130
     403   4.12%  88.29%   3.23%  54.84%  41.94% 0x4000000000002340 verify+0x100
     112   1.14%  89.44%   0.00%  87.50%  12.50% 0x4000000000001611 f2+0x291
     111   1.13%  90.57%   6.31%  66.67%  27.03% 0x4000000000001660 f2+0x2e0
     109   1.11%  91.69%   6.42%  75.23%  18.35% 0x4000000000001630 f2+0x2b0
     107   1.09%  92.78%   5.61%  82.24%  12.15% 0x40000000000015e1 f2+0x261
     106   1.08%  93.86%   5.66%  71.70%  22.64% 0x4000000000001641 f2+0x2c1
     103   1.05%  94.91%   6.80%  66.02%  27.18% 0x4000000000001671 f2+0x2f1
      94   0.96%  95.87%  97.87%   1.06%   1.06% 0x40000000000015a0 f2+0x220
      92   0.94%  96.81%   6.52%  67.39%  26.09% 0x4000000000001690 f2+0x310
      57   0.58%  97.40%   0.00%  94.74%   5.26% 0x4000000000001591 f2+0x211
      56   0.57%  97.97%   0.00% 100.00%   0.00% 0x40000000000015d1 f2+0x251
      45   0.46%  98.43%  44.44%  53.33%   2.22% 0x4000000000003a71 memcpy+0x411
      38   0.39%  98.82%  86.84%  10.53%   2.63% 0x4000000000001590 f2+0x210
      37   0.38%  99.19%  59.46%  40.54%   0.00% 0x4000000000003aa1 memcpy+0x441
      37   0.38%  99.57%   0.00% 100.00%   0.00% 0x4000000000002311 verify+0xd1
      ...

The example above shows the default view which is instructions oriented. The L2, L3, RAM columns attempt to sort cache misses based on their latencies. This is an approximations because some cache lines may have been already in flight at the time of the miss. For the first line of the histogram, at address f1+0x10, 98% of the misses from this load were resolved in the L2 cache.

We can collapse per function, instead of per-instruction, and we can shorten the histogram to the top 10 functions using the following command line:

   $ pfmon --resolv --smpl-module=dear-hist --smpl-show-top=10 --smpl-show-func -edata_ear_cache_lat4 \
      --long-smpl-periods=10000 --smpl-periods-random=0xff:5 foo
   # total_samples 9794
   # function addr view
   # sorted by count
   # showing per function histogram
   # L2   :  5 cycles load latency
   # L3   : 14 cycles load latency
   # %L2  : percentage of L1 misses that hit L2
   # %L3  : percentage of L1 misses that hit L3
   # %RAM : percentage of L1 misses that hit memory
   #count   %self    %cum     %L2     %L3    %RAM      function addr
     8244  84.17%  84.17%  49.81%  50.19%   0.00% 0x4000000000001ec0 f1
     1000  10.21%  94.38%  19.10%  66.50%  14.40% 0x4000000000001380 f2
      466   4.76%  99.14%   2.15%  62.02%  35.84% 0x4000000000002240 verify
       84   0.86% 100.00%  65.48%  33.33%   1.19% 0x4000000000003660 memcpy

We can look on the data side for the same run, using the --smpl-data-view option:

   $ pfmon --smpl-module=dear-hist --smpl-show-top=10 --smpl-data-view -edata_ear_cache_lat4 \
     --long-smpl-periods=10000 --smpl-periods-random=0xff:5 foo
   # total_samples 9790
   # data addr view
   # sorted by count
   # showing per function histogram
   # L2   :  5 cycles load latency
   # L3   : 14 cycles load latency
   # %L2  : percentage of L1 misses that hit L2
   # %L3  : percentage of L1 misses that hit L3
   # %RAM : percentage of L1 misses that hit memory
   # #count   %self    %cum     %L2     %L3    %RAM          data addr
       1704  17.41%  17.41%   0.00%  99.94%   0.06% 0x60000ffffe4c7620
       1669  17.05%  34.45%  97.84%   2.16%   0.00% 0x60000ffffe4c7628
       1258  12.85%  47.30%   0.00% 100.00%   0.00% 0x6000000000005698
       1212  12.38%  59.68% 100.00%   0.00%   0.00% 0x6000000000005690
       1201  12.27%  71.95% 100.00%   0.00%   0.00% 0x60000000000056a0
       1198  12.24%  84.19%   0.00% 100.00%   0.00% 0x60000000000056a8
          2   0.02%  84.21%   0.00% 100.00%   0.00% 0x6000000010476cb8
          2   0.02%  84.23%   0.00% 100.00%   0.00% 0x60000000103aff10
          1   0.01%  84.24%   0.00% 100.00%   0.00% 0x600007ffffffe0f8
          1   0.01%  84.25%   0.00% 100.00%   0.00% 0x60000000107c6790

The level view shows the same data but based on cache levels:

   $ pfmon --smpl-module=dear-hist --smpl-show-top=10 --smpl-show-func --smpl-level-view \
     -edata_ear_cache_lat4 --long-smpl-periods=10000 --smpl-periods-random=0xff:5 foo
   # total_samples 9793
   # level view
   # sorted by count
   # showing per function histogram
   # L2   :  5 cycles load latency
   # L3   : 14 cycles load latency
   # #count   %self    %cum lat(cycles) lat(ns)
       4355  44.47%  44.47%           5       4
       4176  42.64%  87.11%          11       8
        732   7.47%  94.59%          14      10
        146   1.49%  96.08%           7       5
         64   0.65%  96.73%          17      12
         47   0.48%  97.21%          18      13
         45   0.46%  97.67%          16      11
         42   0.43%  98.10%          15      11
         34   0.35%  98.45%           9       6
         15   0.15%  98.60%          19      14

This histogram shows that for this program about 45% of the misses has a latency of 5 cycles, i.e., hit in the L2 and 11% hit in L3.

Similar information can be obtained for TLB misses for both cache and code. The sampling event determines the type of the output.


12. References

The Itanium2 PMU is described in details in the micro-architecture manual entitled: 'Intel Itanium 2 Processor Reference Manual for Software Development and Optimization'

Additional information can be found in the IA-64 architecture manuals.

All the documents are available from Intel Developer's web site.