Artifact Description: Cross-Architecture Automatic Critical Path Detection For In-Core Performance Analysis
Jan Laukemann, University of Erlangen-Nürnberg, jan.laukemann@fau.de
The creation of performance models is an essential part of optimizing scientific software. To run static performance analyses on code snippets, it is crucial to obtain an accurate in-core execution time, which is highly dependent on the micro-architecture of the chip.
Our previously developed tool Open Source Architecture Code Analyzer (OSACA) is a static performance analyzer for predicting the execution time of sequential loops. It previously supported only x86 (Intel and AMD) micro-architectures and simple, full-throughput prediction. We heavily extended its functionality by the detection of dependencies within and across assembly loops to identify the critical path and loop-carried dependencies. This enables a much improved runtime prediction in steady state execution. Furthermore, we enhanced the throughput prediction and added support for ARM-based micro-architectures, which turns OSACA into a versatile cross-architecture modeling tool. While its throughput and loop-carried dependency analysis give a lower bound runtime prediction, its critical path analysis can function as an upper bound for the execution time.
We evaluate the quality of the analysis for code on Intel Cascade Lake, AMD Zen, and Marvell ThunderX2 micro-architectures based on machine models from available documentation and semi-automatic benchmarking. The predictions are compared with actual measurements and the analysis results from the related tools Intel Architecture Code Analyzer (IACA) and LLVM Machine Code Analyzer (LLVM-MCA). The comparison shows that OSACA is to date the most capable and versatile in-core runtime prediction tool available.
- Compilation:
- Binary: x86, ARM aarch64
- Hardware:
- Marvell ThunderX2
- Intel Cascade Lake
- AMD Zen
Check out https://github.com/RRZE-HPC/OSACA
We ran on an AMD EPYC 7451 (Zen architecture) at 2.3 GHz (fixed, turbo disabled), an Intel Xeon Gold 6248 (Cascade Lake SP architecture) at 2.5 GHz (fixed, turbo disabled) and an ARM-based Marvell ThunderX2 9980 at 2.2 GHz (natively fixed). The results should be reproducible on any Zen1, Skylake SP, and ThunderX2 processors. Fixing the frequency and disabling turbo is vital for experiment reproduction.
- Python >= 3.5, with the following packages installed:
- OSACA v0.3.2.dev5
- likwid (for fixing the frequency)
- IACA v3.0
- LLVM-MCA 9
None necessary, everything is part of the code.
On Ubuntu 18.04 install OSACA and likwid with:
apt install python3 python3-pip likwid libgraphviz-dev
pip3 install osaca==0.3.2.dev5 pygraphviz kerncraft
IACA and LLVM-MCA, are available at the website of the vendors.
To reproduce the assembly kernels used for the analysis, use the following commands.
Download this repository including all scripts and benchmark codes:
git clone https://github.com/RRZE-HPC/OSACA-Artifact-Appendix
cd OSACA-Artifact-Appendix/
Fixing the frequency and disabling turbo is very important to verify our results. Fix frequencies and disable turbo mode on CPU (for 2.3 GHz, or which ever frequency your CPU will be stable on):
likwid-setFrequencies -t 0 -f 2.3
Generate performance results (must be done on AMD Zen, Intel Cascade Lake SP and Marvell ThunderX2 machines) with
./run_benchmarks.sh [ARCH]
The parameter ARCH can be either CSX, ZEN1 or TX2.
Note that for this we expect the commands icc, ifort, gcc, gfortran, armclang, and armflang to be part of the environment.
Make sure to have the assembly output created (e.g., by running run_benchmarks.sh first) and the kernel marked.
The ARM byte markers must be inserted by hand at the time of writing.
The OSACA markers for both x86 and ARM can be inserted as comment line, containing
// OSACA-BEGIN
in front the loop and
// OSACA-END
at the end of the loop (note that the comment symbol may differ on different ISAs).
For adding x86 byte markers, one may use:
osaca --insert-marker --arch ARCH file.s
These byte markers are recognized by OSACA and IACA, therefore, we recommend to use this command on Intel CSX. Additionally, one must add the LLVM-MCA markers in the following format:
# LLVM-MCA-BEGIN
...
# LLVM-MCA-END
All marked assembly files can be also found in the kernel-specific directory in thesis_analysis_reports/.
Since the -mcpu=thunderx2t99 flag is only known to the ARM compiler,
the prediction generation must be done separately for LLVM-MCA on Marvell ThunderX2 and all others. Run
./run_predictions.sh ISA
The parameter ISA can be either aarch64 (for running LLVM-MCA on TX2) or x86 (for running the rest).
Note that for this we expect the commands llvm-mca (for all runs) and osaca, iaca, and gcc (for the x86 run) to be part of the environment.
The evaluation script expects a fixed frequency of 2.5 GHz on CSX, 2.3 GHz on Zen and 2.2 GHz on TX2.
If the measurements were obtained with any different clock frequency, one must edit the FREQ variable in the script.
It further expects the same unrolling factor for a kernel that is stated in the thesis. If the created assembly kernels differ in terms of the unrolling factor (e.g., due to a different compiler version), please adjust the unrolling_factor_dict.py in the scripts directory accordingly.
Run the evaluation with
./run_evaluation.sh
It writes a file summary_table.csv containing the performance measurements and analysis results in the format of Table B.1 in the thesis.
Compare numbers to Table B.1.
We expect these numbers to lie within 10% of those in the paper, if run on the same micro architectures as mentioned. If your numbers are significantly faster, turbo mode or frequency scaling might be the reson. If they are slower, while running on a laptop or desktop machine, energy saving features may have interfered.
Furthermore, all results used for obtaining the performance predictions can be found in the thesis_analysis_reports/ directory.