CrayPAT
Contents
Recipes for Profiling with CrayPAT 4.2
Simple Profiling
Application Instrumentation with pat_build -
- No source code or makefile modification required
- Automatic instrumentation at group (function) level
- Groups: mpi, io, heap, math SW, user functions …
- Performs link-time instrumentation
- Requires object files
- Instruments optimized code
- Generates stand-alone instrumented program
- Preserves original binary
- Supports sample-based and event-based instrumentation
Use the following steps to profile your code:
-
Remove all object files and any other user libraries that you want profiled. (Probably need to do a
make clean.) -
Issue the command
module load xt-craypat. -
Compile your codes with the options usually used.
-
Rebuild your code, probably via
make.-
You must rebuild to ensure that proper symbols are “put” into the code for profiling.
-
-
Run the
pat_buildcommand to build an instrumented executable. The command will be of the formpat_build [-w -u -g <group>]-O apa a.out [a.out+pat].-
Tracing is currently the only option.
-
-u: Indicates to trace all user functions. -
-g: Possible groups arempi,io,heap(blasandlapackcoming). -
Original
.ofiles must still be around.
-
-
Run the instrumented executable, such as
aprun -n 160 a.out+pat -
Run
pat_reporton the data file, something likepat_report <datafile>.xf.-
We highly recommend doing a
pat_report -f ap2 <datafile>.xfto create an.ap2compressed format file that can be used as input topat_reportor Cray Apprentice2.-
The
.ap2file is portable; it can be moved to any other machine with Cray Apprentice2 and have reports run. That cannot be done with the.xffiles.
-
-
See
man pat_reportfor details.See “Performance Measurement and Visualization on the Cray XT ” (pdf) for a more complete description on how to use CrayPAT
-
Simple Hardware Performance Counter Data
-
If you don’t have an instrumented code, complete steps 1–5 as above. If you already have done steps 1–5 above, go on to the next step below.
-
Set the
PAT_RT_HWPCenvironment variable to a value from 1 to 9.-
1 : FP, LS, L1 Misses & TLB Misses
-
2 : L1 & L2 Data Accesses and Misses
-
3 : L1 Accesses, Misses, and Bandwidth
-
4 : Floating Point Mix
-
-
Run instrumented code again.
-
Run
pat_report.
Overview of CrayPAT Tools
Profiling
Profiling with the Cray tools requires multiple steps. Unlike the X1E it does require you to recompile your code. First, to use the Cray profiling tools, you must load the
craypatmodule such asmodule load craypat. Then you must recompile your code with ftn or cc (the Cray wrappers) to link in the appropriate Cray performance tools/libraries. If you are compiling withpgf95orpgcc, these compilers are not automatically linking in the Cray performance libraries. Furthermore, if you use “Fortran modules,” then you no longer have to compile and link your code with-Mprof=functo get a proper profile. The-Mprof=funcoption should not be used anymore.Two important man pages to check out
patandpat_build. And note that the Fortran application programming interface (API) is similar to the C API. All accept an additional argument for the status of the call (which in C is provided as the return value).pat_build
Builds an instrumented version of an executable code.
> pat_build [options] <executable> <instrumented executable>
Supports
-
Fortran, C, C++
-
MPI, SHMEM
Performance measurements
-
Trace based
-
User functions
-
API for fine-grain instrumentation
-
Predefined function groups (
mpi,shmem,io, etc.)
-
Source code mapping
-
Call stack
-
Line numbers
pat_run
User interface to simplify CrayPAT usage. Runs an instrumented executable and generates a report, all in one step.
The following executes
a.out+patand produces a report measuring the number of floating point operations, calculating the mflop rate, and determining the average number of results produced per vector operation for the traced functions.pat_run -O flops,mflops,vl yod -sz 1 a.out+pat
The following produces a load-balance report showing average versus maximum time per processor (based on wall-clock time) for an MPI program:
pat_run -O balance yod -sz 4 a.out+pat
The “
-O” option is a comma-list of keywords to specify the following:-
How to record it:
trace -
Show callers:
callers,calltree -
Show source/line number:
source,line -
Show load balance:
balance[.$data][.$by]-
$datacan besamplesortime(default),cycles, etc. -
$bycan bepe(default),thread, orssp
-
Examples
To get basic profile run, use the following:
pat_run -b [pe,]function:source,line [-s percent=relative] yod -sz 1 <instrumented executable>
In the output file, use the following:
100.0% | 100.0% | 965 |Total|-------------------------------------| 88.2% | 88.2% | 851 |kron_matmull@module_kron_ ||------------------------------------ || 40.4% | 40.4% | 344 |line.307 || 37.0% | 77.4% | 315 |line.297
To get a calltree run, use the following:
pat_run -b [pe,] function:source,calltree [-s percent=relative] yod -sz 1 <instrumented executable>
pat_report
You can directly run an instrumented executable with
yod, which will produce a performance-data file (ending in.xf). This file can then be processed into a human-readable text profile using thepat_reportcommand.Experiment Types
There is only one type of performance experiment that you can run,
- trace.See the
patman page for more information.Run-Time Library
Use the PAT run-time library to get statistics on a specific region of code.
Example
program test_module_kronuse pat_apiinteger ierr … ! Begin region of interest call PAT_region_begin ( 1, 'kron_matmul_kernel', ierr ) ! # and name must be unique to each region call kron_matmulL(…) ! End region of interest call PAT_region_end ( 1, ierr ) end program
Compile
ftn *.f -o test.exe
Relink
pat_build -w test.exe test.exe+pat
Run and produce a report
pat_run -g normal [-b function,ssp=HIDE] yod -sz 1 test.exe+pat
Apprentice2 Visualizer
Apprentice2 is targeted to help identify and correct
-
Excessive communication
-
Network contention
-
Load imbalance
-
Excessive serialization
Supports
-
Call graph profile
-
Communication statistics
-
Timeline view (Must have
PAT_RT_SUMMARYset to 0 before running instrumented code.)-
Communication
-
I/O
-
-
Activity view
-
Pairwise communication statistics
-
Text reports
-
Source code mapping
Apprentice2 (invoked with
app2) takes as input an XML file. The input file is generated as follows:module load apprentice2pat_report –c records –f ap2 <perf.file>.xf > <perf.file>.ap2
Visualization is possible with both profiles and trace files, but Apprentice2 has less functionality with profiles. The following features are supported for profiles (run-time summaries):
-
Call graph view
-
Function statistics overview
-
Function report
-
Programming environment (PE) breakdown
-
General information
Hardware Performance Counters
pat_hwpc
pat_hwpccollects hardware performance counters information for an application. No instrumentation is required. Usage is as follows:pat_hwpc [options] yod <executable>
pat_hwpcaccepts various hardware counters groups and produces a report with raw counts and derived metrics for the whole execution. The hardware counters are summed across all threads in each process. See thepat_hwpcman page.
- Automatic instrumentation at group (function) level
