This is the first post in the series “Hunting Performance in Python Code”. Through each post I’ll present some of the tools and profilers that exists for Python code and how each of them helps you to better find bottlenecks both in frontend (Python scripts) and/or in the backend (Python interpreter).
Series index
The links below will go live once the posts are released:
Setup
Before diving into benchmarking and profiling, first we need a proper environment. This means that both the machine and the operating system must be configured for this task.
As a general view, my machine has the following specs:
- Processor: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
- Memory: 32GB
- OS: Ubuntu 16.04 LTS
- Kernel: 4.4.0-75-generic
The goal is to have reproducible results, thus making sure that our data is not affected by other background processes, operating system configuration or any other hardware performance enhancing technologies.
Let’s start with the configuration of the machine that we use for profiling.
Hardware features
First of all, disable any hardware performance features. This means disable Intel Turbo Boost and Hyper Threading from BIOS/UEFI.
As presented in the official page, Turbo Boost is “a technology that automatically allows processor cores to run faster than the rated operating frequency if they’re operating below power, current, and temperature specification limits”. On the other hand, Hyper Threading is “a technology which uses processor resources more efficiently, enabling multiple threads to run on each core”, as stated here.
Good stuff that we paid for and we really want them in production. Then why is it bad to have them enabled when profiling/benchmarking? Because we don’t get reliable and reproducible results, which translates into run to run variation. Let’s see this in a small example, called primes.py, intentionally written poorly 🙂
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| import time | |
| import statistics | |
| def primes(n): | |
| if n==2: | |
| return [2] | |
| elif n<2: | |
| return [] | |
| s=range(3,n+1,2) | |
| mroot = n ** 0.5 | |
| half=(n+1)/2-1 | |
| i=0 | |
| m=3 | |
| while m <= mroot: | |
| if s[i]: | |
| j=(m*m-3)/2 | |
| s[j]=0 | |
| while j<half: | |
| s[j]=0 | |
| j+=m | |
| i=i+1 | |
| m=2*i+3 | |
| return [2]+[x for x in s if x] | |
| def benchmark(): | |
| results = [] | |
| gstart = time.time() | |
| for _ in xrange(5): | |
| start = time.time() | |
| count = len(primes(1000000)) | |
| end = time.time() | |
| results.append(end-start) | |
| gend = time.time() | |
| mean = statistics.mean(results) | |
| stdev = statistics.stdev(results) | |
| perc = (stdev * 100)/ mean | |
| print "Benchmark duration: %r seconds" % (gend-gstart) | |
| print "Mean duration: %r seconds" % mean | |
| print "Standard deviation: %r (%r %%)" % (stdev, perc) | |
| benchmark() |
The code is also available on GitHub here. As a dependency, you will need to run:
pip install statistics
Let’s run it in a system that has Turbo Boost and Hyper Threading enabled:
python primes.py Benchmark duration: 1.0644240379333496 seconds Mean duration: 0.2128755569458008 seconds Standard deviation: 0.032928838418120374 (15.468585914964498 %)
Now on the same system, but with Turbo Boost and Hyper Threading disabled:
python primes.py Benchmark duration: 1.2374498844146729 seconds Mean duration: 0.12374367713928222 seconds Standard deviation: 0.000684464852339824 (0.553131172568 %)
Observe the standard deviation in the first case – 15%. This is a HUGE value! Suppose you make an optimization that brings 6% speedup, how will you be able to distinguish between a run to run variation and your implementation?
Instead, in the second case, the variation is reduced to approx. 0.6%. Your shiny new optimization will be visible!
CPU power savings
Disable any CPU power savings and use a fixed CPU frequency. This can be done by changing the Linux power governor from intel_pstate to acpi_cpufreq.
The intel_pstate driver implements a scaling driver with an internal governor for Intel Core (Sandy Bridge and newer) processors. The acpi_cpufreq driver utilizes the ACPI Processor Performance States.
Let’s check it out first!
$ cpupower frequency-info
analyzing CPU 0:
driver: intel_pstate
CPUs which run at the same hardware frequency: 0
CPUs which need to have their frequency coordinated by software: 0
maximum transition latency: 0.97 ms.
hardware limits: 1.20 GHz - 3.60 GHz
available cpufreq governors: performance, powersave
current policy: frequency should be within 1.20 GHz and 3.60 GHz.
The governor "powersave" may decide which speed to use
within this range.
current CPU frequency is 1.20 GHz.
boost state support:
Supported: yes
Active: yes
You see that the used governor is set to powersave and the CPU frequency scales between 1.20 GHz and 3.60 GHz. It is good for your personal computer or any other day to day usage, but hurts the results when doing benchmarks.
What are the possible values for the governor? If we browse the documentation we see that we can use the following:
performance– run the CPU at the maximum frequency.powersave– run the CPU at the minimum frequency.userspace– run the CPU at user specified frequencies.ondemand– scales the frequency dynamically according to current load. Jumps to the highest frequency and then possibly back off as the idle time increases.conservative– scales the frequency dynamically according to current load. Scales the frequency more gradually than ondemand.
What we want to use is the performance governor and set the frequency at the maximum one supported by the CPU. Something like this:
$ cpupower frequency-info
analyzing CPU 0:
driver: acpi-cpufreq
CPUs which run at the same hardware frequency: 0
CPUs which need to have their frequency coordinated by software: 0
maximum transition latency: 10.0 us.
hardware limits: 1.20 GHz - 2.30 GHz
available frequency steps: 2.30 GHz, 2.20 GHz, 2.10 GHz, 2.00 GHz, 1.90 GHz, 1.80 GHz, 1.70 GHz, 1.60 GHz, 1.50 GHz, 1.40 GHz, 1.30 GHz, 1.20 GHz
available cpufreq governors: conservative, ondemand, userspace, powersave, performance
current policy: frequency should be within 2.30 GHz and 2.30 GHz.
The governor "performance" may decide which speed to use
within this range.
current CPU frequency is 2.30 GHz.
cpufreq stats: 2.30 GHz:100.00%, 2.20 GHz:0.00%, 2.10 GHz:0.00%, 2.00 GHz:0.00%, 1.90 GHz:0.00%, 1.80 GHz:0.00%, 1.70 GHz:0.00%, 1.60 GHz:0.00%, 1.50 GHz:0.00%, 1.40 GHz:0.00%, 1.30 GHz:0.00%, 1.20 GHz:0.00% (174)
boost state support:
Supported: no
Active: no
Now you are going to use the performance governor and have a fixed frequency of 2.3 GHz. This value is the maximum possible, without Turbo Boost, that can be used on a Xeon E5-2699 v3.
To set everything up, run the following commands with administrative privileges:
cpupower frequency-set -g performance cpupower frequency-set --min 2300000 --max 2300000
If you don’t have cpupower, install it using:
sudo apt-get install linux-tools-common linux-header-`uname -r` -y
The power governor has a great impact on how a CPU is used. By default, the governor is set to automatically scale the frequency to reduce the power consumption. We do not want this on our system and we proceed by disabling it from GRUB. Just edit /boot/grub/grub.cfg (but if you do be careful that on a kernel upgrade, this will be gone) or create a new kernel entry in /etc/grub.d/40_custom. Our boot line must contain the following flag: intel_pstate=disable, like this:
linux /boot/vmlinuz-4.4.0-78-generic.efi.signed root=UUID=86097ec1-3fa4-4d00-97c7-3bf91787be83 ro intel_pstate=disable quiet splash $vt_handoff
ASLR (Address Space Layout Randomizer)
This setting is controverted, as you can see also on Victor Stinner’s post. When I first suggested to disable ASLR when doing benchmarks, it was in the context of further improving the support for Profile Guided Optimizations existing in CPython at that time.
What lead me to state that is the fact that on the particular hardware presented above, disabling ASLR reduces run to run variation to 0.4%!
On the other hand, testing this on my personal computer (which has an Intel Core i7 4710MQ), disabling ASLR lead to the same issues presented by Victor. Testing on even smaller CPUs (like an Intel Atom) led me to an even more run to run variation, instead of canceling it.
Since it seems that it is not a general available truth and greatly depends on the hardware/software configuration, the outcome of this is to leave it enabled and measure, disable it and measure again and then compare results.
On my machine I have it disabled globally by adding the following in /etc/sysctl.conf. Apply using sudo sysctl -p.
kernel.randomize_va_space = 0
If you want to disable it at runtime:
sudo bash -c 'echo 0 >| /proc/sys/kernel/randomize_va_space'
If you want to enable it back:
sudo bash -c 'echo 2 >| /proc/sys/kernel/randomize_va_space'
By Alecsandru Patrascu, alecsandru.patrascu [at] rinftech [dot] com
[…] Part I: WordPress […]
LikeLike
[…] Setup […]
LikeLike
[…] https://pythonfiles.wordpress.com/2017/05/15/hunting-python-performance-setup/ […]
LikeLike