Proof of Concept: Implement asv for profiling+comparison #580

jbrockmendel · 2017-12-10T05:16:40Z

For proof-of-concept I just adapted a handful of the existing test cases. If we go this route we'll want to adapt a broader sample. asv has its share of rough edges, but is way better than trying to roll our own solution. See https://asv.readthedocs.io/en/latest/using.html

Two types of usage. First profiling:

$ cd asv_bench
$ asv profile parsing.Fuzzy.time_parse_fuzzy HEAD
[...]
··· Running parsing.Fuzzy.time_parse_fuzzy                                                                                                                                                                   2.72sSat Dec  9 20:55:28 2017    /var/folders/bn/4tyh5w_17cq5vd8jnlwv1k3m0000gn/T/tmpWd7Jnz

         2619004 function calls in 3.865 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    3.865    3.865 site-packages/asv/benchmark.py:459(method_caller)
        1    0.003    0.003    3.865    3.865 dateutil/asv_bench/benchmarks/parsing.py:29(time_parse_fuzzy)
     1000    0.005    0.000    3.861    0.004 dateutil/parser/_parser.py:1194(parse)
     1000    0.021    0.000    3.857    0.004 dateutil/parser/_parser.py:536(parse)
     1000    0.466    0.000    3.718    0.004 dateutil/parser/_parser.py:627(_parse)
    43000    0.320    0.000    1.180    0.000 {min}
     1000    0.061    0.000    1.088    0.001 dateutil/parser/_parser.py:203(split)
    35000    0.045    0.000    1.006    0.000 dateutil/parser/_parser.py:200(next)
    35000    0.060    0.000    0.961    0.000 dateutil/parser/_parser.py:193(__next__)
    35000    0.635    0.000    0.901    0.000 dateutil/parser/_parser.py:81(get_token)
    21000    0.096    0.000    0.859    0.000 dateutil/parser/_parser.py:334(month)
[...]

This particular result is from before the commit that got rid of the min(...) calls. Second type of usage is comparison:

$ asv continuous master HEAD
[...]
[100.00%] ··· Running parsing.Parsing.time_parse_no_separator                                                                                                                                                       308.73ms    
before     after       ratio
  [cab4149e] [b71550f2]
+  308.73ms   340.71ms      1.10  parsing.Parsing.time_parse_no_separator
-  404.00ms   348.54ms      0.86  parsing.Parsing.time_parse_iso
-  681.19ms   524.99ms      0.77  parsing.Parsing.time_parse_YBD
-     2.57s      1.92s      0.75  parsing.Fuzzy.time_parse_fuzzy
-     2.23s      1.55s      0.70  parsing.Fuzzy.time_parse_fuzzy_long

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

Looking at these results with a critical eye, the first two are statistical noise, since the code paths are unchanged. The other three are real, so we just got a non-trivial performance improvement.

pganssle · 2017-12-10T09:40:20Z

asv_bench/asv.conf.json

@@ -0,0 +1,103 @@
+{
+    // The version of the config file format.  Do not change, unless


The github syntax highlighting is going nuts on this - is this a variation of JSON that has comments?

I guess? I just used the pandas version as a template.

pganssle · 2017-12-10T09:41:55Z

asv_bench/benchmarks/parsing.py

+
+class Parsing(object):
+    def time_parse_iso(self):
+        for n in range(1000):


Do we need to repeat it many times, or can we configure asv to do that?

I'm curious how big a role caching is going to play here and how misleading that will be.

asv does run each of the benchmarks multiple times, I think targeting a total run time of 0.1 seconds by default (configurable). I haven't mastered the semantics it uses. In some circumstances asv spawns subprocesses to avoid caching influencing the results.

Originally I didn't include these loops at all, but in the profiling it only ran once and I wanted a bigger sample.

@jbrockmendel Interesting. I wonder if it is better to just trust asv to run the task multiple times as it sees fit or if it's better to do some averaging on top of their averaging.

It is my suspicion that we'll want the bulk of our tests to be single-run tests and let asv be smart about how to run them, but since caching is a reality we'll want to deliberately induce it with some sort of loop at some times.

@pablogsal Do you have an opinions on this? From our conversations in London it seemed like you have some experience with profiling (or at least timing).

Avs uses python timeit module and some of IPython %timeit magic, so I will not consider this reliable/reproducible for timings less than 0.1 seconds. [1] [2]

< TLDR >
The TLDR is that you have to select what to measure: average execution time or "platonic" execution time (the fastest and usually not very representative). When you do this you need to understand how timeit and %timeit and other tools are measuring. If your individual executions are fast, there is ton of noise that affects the measurements.
< /TLDR >

If you don't want to limit the noise (which is very complicated), then another approach is to limit/understand the effect of the noise using statistics. As the initial distribution for the time measurements has skewness <0, assuming that the distribution is homeomorphic to P(X=x)=a e^(-a(x-b)) for X>b, 0 otherwise (where X is a random variable and b,a are real numbers), then the typical procedure to obtain stable benchmarks by limiting the noise is measuring execution n times, doing the mean of this values and repeating this r times. Using this fact, the distribution representing the means: Y=1/n(X1+X2+...) For the central limit theorem we know that P(X=Y) ~ N(sigma, mu) Where N is the normal distribution of stdv sigma and mean mu. This is a profound result and therefore you don't need to worry about the left skewness or higher central momenta of the distribution of errors, the means of the execution times will be always asymptotically a normal distribution. This means that reporting the mean and the standard deviation of a statistically significant number of experiments is enough because there are no higher momenta available. Notice that this is true even if the errors are on one side of the execution result. This is trying to answer another problem that is "what is the most representative execution time in this particular machine" as opposed to "what is the platonic execution time for this set of instructions". IMHO users want reproducibility and some sense of stability, even under noisy and unstable conditions.

I think it also depends on how seriously do you want this to be. A lot of people are using avs because they don't care about the inexactitudes that much, or because they don't know best. Actually, avs is doing a lot of good work to set up the environment and produce proper metrics (you can select warmup_time,repeat,number and sample times among other options) which I think you can use to be very close to stable benchmarking. Another set of things you probably want to do in the machine you run this is to deactivate frequency switching in the CPU that you are using and isolate one CPU core for running the benchmarks.

If you use avs you should try to keep the benchmarking functions as short as possible, leaving the measures and the control to avs itself.

@pablogsal I think the goal here is mostly going to be differential benchmarks. I think for now we can take the performance metric is "does set of instructions X perform faster than set of instructions Y", and the competing interests are (likely in order of prevalence):

A.) On the average machine, does it perform better in a tight loop
B.) On the average machine, does it perform better when high-latency is needed
C.) Does it perform better in resource-constrained environments.

I imagine that what we'll be doing with asv is trying to find a model that gives approximate answers to these questions given that it will be run on only one machine. For practical purposes, I think that we can easily accept errors that are on the order of magnitude of the natural variation of the project (e.g. if some machines perform faster with instructions X and some faster with instructions Y, or 1 STD of the variation in the differential performance on various common inputs).

Given imperfect tools to do this (and given that I am not prepared to maintain a separate benchmarking tool), I'd like to use the best configuration for asv (or another standard benchmarking tool if there's a better one) that optimizes these goals and lets me make actionable decisions about algorithm choices, even if the decision is "the difference between these is not statistically significant".

So, given that, do you have thoughts about whether we should write tests like:

def time_parse_iso(self): for n in range(1000): parse('2017-09-12')

So that we can add some averaging into the process and make it effectively "longer", or should we write tests like:

def time_parse_iso(self): parse('2017-09-12')

and trust that asv will figure out the best way to benchmark them for us?

If the main concern is the way that asv does things, probably it's best for us to open an issue and possibly a PR there.

@pganssle IMHO You should write test like this:

def time_parse_iso(self): parse('2017-09-12')

and configure warmup_time, repeat, number and sample times (in avs) accordingly to have enough samples and a good average.

If the main concern is the way that asv does things, probably it's best for us to open an issue and possibly a PR there.

I opened a couple last week, very wary of ...

and given that I am not prepared to maintain a separate benchmarking tool

The subprocess indirection asv uses makes it tough to extend the API (at least for one newbie), but I'm cautiously hopeful that some of those rough edges may get sanded down.

pganssle · 2017-12-10T09:42:48Z

dateutil/parser/_parser.py

-                return self._weekdays[name.lower()]
-            except KeyError:
-                pass
+        try:


Is this part of the PoC or something you actually want merged?

PoC to demonstrate the asv continuous comparison. It should go into a separate PR-- it's a pretty nice perf improvement.

I'll revert this part in the next draft.

pganssle · 2017-12-10T09:44:35Z

@jbrockmendel Looks great so far, I like it.

I'm thinking we should add it as an allowed-failure job in the travis CI, unless there's another CI provider that might be better.

jbrockmendel · 2017-12-10T16:05:50Z

I'm thinking we should add it as an allowed-failure job in the travis CI, unless there's another CI provider that might be better.

@TomAugspurger any thoughts on best practices here?

TomAugspurger · 2017-12-11T12:14:58Z

I'm not sure about running asv on CI workers. Everything I've heard is that it's too noisy to be reliable. We could add dateutil to https://github.com/tomaugspurger/asv-runner and start tracking them on the pandas benchmark machine.

…

On Sun, Dec 10, 2017 at 10:05 AM, jbrockmendel ***@***.***> wrote: I'm thinking we should add it as an allowed-failure job in the travis CI, unless there's another CI provider that might be better. @TomAugspurger <https://github.com/tomaugspurger> any thoughts on best practices here? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#580 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIhTqfJDAStYQbegAte83ifGj46KCks5s_AFfgaJpZM4Q8YIW> .

pganssle · 2017-12-11T13:42:49Z

@TomAugspurger I think I see what you mean with the "too much noise", but could you elaborate? I'm guessing what you mean is that the performance of CI machines will depend on instantaneous load, so the benchmarks would not be amazingly reliable. If we only run in "compare to master" mode on the CI, can we just set the threshhold of "benchmarks have changed significantly" to account for the potential change in load? Alternatively, I wonder if we can convince it to fall back to a "best-of-three" mode or something (like, if there's a detected significant change in performance, it waits 5 minutes, then re-runs the benchmark, then waits 5 minutes and does it again).

With regards to adding dateutil to asv-runner, that sounds great to me - even if we do end up adding an asv run to the CI, I think having stable-ish numbers for the overall trend run on a consistent platform would be super useful.

pganssle · 2017-12-11T13:56:37Z

@jbrockmendel Is this a WIP type thing or is it ready to be merged? Right now Appveyor doesn't seem to be running any builds at all, so we're blocked on that for merging, but I'd like to merge this ASAP so that I (and others) can start creating benchmark suites for the other modules - I've got some plans in particular for the tz module, since speed is the only remaining advantage that pytz has over dateutil AFAICT.

jbrockmendel · 2017-12-11T15:50:28Z

Is this a WIP type thing or is it ready to be merged?

I think it can be merged. It isn't user-facing, so worst case scenario we eventually replace it with something better.

jbrockmendel · 2017-12-11T16:14:56Z

On the various how-to-calibrate issues, I think the short-term solution is to recognize that:

for small numbers/runtimes of benchmarks, the virtualenv creation/switching time will swamp the actual benchmark runtime,
as long as the number of benchmarks being run is relatively small, say in the "couple dozen" range, we can crank the number of repetitions up until results stabilize without making runtime burdensome.

pganssle · 2017-12-11T18:53:58Z

asv_bench/asv.conf.json

+    // skipped for the matching benchmark.
+    //
+    "regressions_first_commits": {
+        ".*": "v0.20.0"


I think this is a pandas thing? Should we just drop this?

Good catch.

pganssle · 2017-12-11T19:11:01Z

This doesn't seem amazingly reproducible. Using the current version of this branch, I'm still randomly getting significant changes to the benchmark or not, depending on the run:

$ asv continuous master HEAD
· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.6
·· Installing into virtualenv-py3.6.
· Running 14 total benchmarks (2 commits * 1 environments * 7 benchmarks)
[  0.00%] · For dateutil commit hash ef33abd1:
[  0.00%] ·· Building for virtualenv-py3.6	.
[  0.00%] ·· Benchmarking virtualenv-py3.6
[  7.14%] ··· Running parsing.Fuzzy.time_parse_fuzzy                                                           2.77s
[ 14.29%] ··· Running parsing.Fuzzy.time_parse_fuzzy_long                                                      2.25s
[ 21.43%] ··· Running parsing.Parsing.time_parse_YBD                                                        671.40ms
[ 28.57%] ··· Running parsing.Parsing.time_parse_date_format                                                   1.40s
[ 35.71%] ··· Running parsing.Parsing.time_parse_iso                                                        467.87ms
[ 42.86%] ··· Running parsing.Parsing.time_parse_iso_ymd_hms                                                895.93ms
[ 50.00%] ··· Running parsing.Parsing.time_parse_no_separator                                               406.73ms
[ 50.00%] · For dateutil commit hash 7345fc7a:
[ 50.00%] ·· Building for virtualenv-py3.6.
[ 50.00%] ·· Benchmarking virtualenv-py3.6
[ 57.14%] ··· Running parsing.Fuzzy.time_parse_fuzzy                                                           3.16s
[ 64.29%] ··· Running parsing.Fuzzy.time_parse_fuzzy_long                                                      2.18s
[ 71.43%] ··· Running parsing.Parsing.time_parse_YBD                                                        640.02ms
[ 78.57%] ··· Running parsing.Parsing.time_parse_date_format                                                   1.30s
[ 85.71%] ··· Running parsing.Parsing.time_parse_iso                                                        409.04ms
[ 92.86%] ··· Running parsing.Parsing.time_parse_iso_ymd_hms                                                670.46ms
[100.00%] ··· Running parsing.Parsing.time_parse_no_separator                                               357.29ms    before     after       ratio
  [7345fc7a] [ef33abd1]
+  670.46ms   895.93ms      1.34  parsing.Parsing.time_parse_iso_ymd_hms
+  409.04ms   467.87ms      1.14  parsing.Parsing.time_parse_iso
+  357.29ms   406.73ms      1.14  parsing.Parsing.time_parse_no_separator
-     3.16s      2.77s      0.88  parsing.Fuzzy.time_parse_fuzzy

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

$ asv continuous master HEAD
· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.6
·· Installing into virtualenv-py3.6.
· Running 14 total benchmarks (2 commits * 1 environments * 7 benchmarks)
[  0.00%] · For dateutil commit hash ef33abd1:
[  0.00%] ·· Building for virtualenv-py3.6.
[  0.00%] ·· Benchmarking virtualenv-py3.6
[  7.14%] ··· Running parsing.Fuzzy.time_parse_fuzzy                                                           2.81s
[ 14.29%] ··· Running parsing.Fuzzy.time_parse_fuzzy_long                                                      2.31s
[ 21.43%] ··· Running parsing.Parsing.time_parse_YBD                                                        648.69ms
[ 28.57%] ··· Running parsing.Parsing.time_parse_date_format                                                   1.31s
[ 35.71%] ··· Running parsing.Parsing.time_parse_iso                                                        415.73ms
[ 42.86%] ··· Running parsing.Parsing.time_parse_iso_ymd_hms                                                658.11ms
[ 50.00%] ··· Running parsing.Parsing.time_parse_no_separator                                               352.85ms
[ 50.00%] · For dateutil commit hash 7345fc7a:
[ 50.00%] ·· Building for virtualenv-py3.6.
[ 50.00%] ·· Benchmarking virtualenv-py3.6
[ 57.14%] ··· Running parsing.Fuzzy.time_parse_fuzzy                                                           2.58s
[ 64.29%] ··· Running parsing.Fuzzy.time_parse_fuzzy_long                                                      2.16s
[ 71.43%] ··· Running parsing.Parsing.time_parse_YBD                                                        643.28ms
[ 78.57%] ··· Running parsing.Parsing.time_parse_date_format                                                   1.27s
[ 85.71%] ··· Running parsing.Parsing.time_parse_iso                                                        412.09ms
[ 92.86%] ··· Running parsing.Parsing.time_parse_iso_ymd_hms                                                719.34ms
[100.00%] ··· Running parsing.Parsing.time_parse_no_separator                                               359.67ms
BENCHMARKS NOT SIGNIFICANTLY CHANGED.

(For these runs I changed it so that each for loop runs 5000 times instead of 1000 times to see if it would be more consistent.)

jbrockmendel · 2017-12-11T20:53:02Z

Yah, I'm not thrilled with the consistency either. I've opened up a new issue over there to discuss ways of improving it. For the time being I still think this is better than nothing and we should just crank up the iterations.

…to profile

pganssle · 2017-12-11T21:00:19Z

@jbrockmendel I don't know if cranking up the iterations helps. I get roughly the same inconsistency with 1 iteration or 5000.

This may be a matter of running this on a benchmark machine instead of on my laptop or CI, but that kinda reduces the usefulness of this if the measurements are meaningless on your local machine. This blog post that Pablo linked to has some tips (though I'm not really sure the best way to put them in practice).

There's also pytest-benchmark, but I'm not sure that would be any better (and honestly I think continuous mode seems most important here, and I'm not sure if pytest-benchmark can do that).

jbrockmendel · 2017-12-11T21:10:26Z

hmm if upping the iterations doesn't help, that seems... weird.

Anyway, I think we should merge so that we can tinker with it on independent branches (avoid merge conflicts) while figuring out if/how it can be made useful.

pganssle · 2017-12-11T21:16:31Z

@jbrockmendel I dunno, if it's not useful I don't see any point in it. I am not going to start writing benchmark tests until I know I'm measuring something meaningful, so I'll just tinker with it locally.

That said, I didn't create a merge conflict - I cleaned up your history and forced pushed it to prepare for merge, then I realized how inconsistent the measurements were.

jbrockmendel · 2017-12-11T21:26:16Z

Fair enough. The question becomes what we do if there isn't a "fix" and benchmarking is just fundamentally difficult.

Maybe pytest-profiling? Not so much for the comparison but for identifying hotspots using existing cases.

pganssle · 2017-12-11T22:07:34Z

I think we just need to figure out what the effort / reward trade-off is here. I'm willing to spend a bit of time figuring this out because it's very useful as a general rule, but if in the end it turns out that the options are 1) write or maintain a complicated benchmarking suite or 2) provide something that is more likely to be misleading than helpful, I'd say we just give up on benchmarking until we can free-load on someone who has already done 1. (Which may be asv and we're just running it wrong).

In the end it may be that we set up asv and use it only on Tom's asv-runner as a more-or-less stable "most accurate" measurement, and then also add pytest-benchmark or pytest-profiling tests for fast iteration while trying to make performance improvements, though obviously it would be easiest if these benchmarks were the same.

jbrockmendel · 2017-12-11T23:02:37Z

Yah, this is a tough one because 1) I very much agree about not wanting to build/maintain a tool for this, but 2) several other recent threads would have been improved with robust+reliable profiling info.

I'll follow up on the issue I opened with asv, otherwise not a lot of ideas. A brief poke at pytest-profiling didn't wow me.

pganssle · 2017-12-11T23:17:02Z

@jbrockmendel I think we can leave this open, I'm not saying I don't want to merge this, I just want to take a bit of time to understand how and, more importantly, how not to use it.

jbrockmendel added 3 commits December 9, 2017 20:48

implement asv_bench

7f994b2

remove slow min(...) calls

bd0feea

run each timing 1000x

b71550f

pganssle reviewed Dec 10, 2017

View reviewed changes

pganssle mentioned this pull request Dec 10, 2017

Add profiling python-variants/variants#9

Open

revert perf change out-of-scope

f624528

pganssle added the build label Dec 11, 2017

pganssle mentioned this pull request Dec 11, 2017

tzfile does not support non-integer minute offsets #582

Closed

jbrockmendel added 2 commits December 11, 2017 13:47

implement asv_bench

b578373

run each timing 1000x

ef33abd

pganssle force-pushed the profile branch from f624528 to ef33abd Compare December 11, 2017 18:48

pganssle reviewed Dec 11, 2017

View reviewed changes

jbrockmendel added 2 commits December 11, 2017 12:57

fix asv.conf holdover, bump iterations

532521e

Merge branch 'profile' of https://github.com/jbrockmendel/dateutil in…

f67bb43

…to profile

fixup whitespace

c06c64c

jbrockmendel closed this Dec 11, 2017

pganssle reopened this Dec 11, 2017

pganssle added this to the Master milestone Dec 15, 2017

		@@ -0,0 +1,103 @@
		{
		// The version of the config file format. Do not change, unless

Proof of Concept: Implement asv for profiling+comparison #580

Are you sure you want to change the base?

Proof of Concept: Implement asv for profiling+comparison #580

Uh oh!

Conversation

jbrockmendel commented Dec 10, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pablogsal Dec 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pablogsal Dec 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pganssle commented Dec 10, 2017

Uh oh!

jbrockmendel commented Dec 10, 2017

Uh oh!

TomAugspurger commented Dec 11, 2017 via email

Uh oh!

pganssle commented Dec 11, 2017

Uh oh!

pganssle commented Dec 11, 2017

Uh oh!

jbrockmendel commented Dec 11, 2017

Uh oh!

jbrockmendel commented Dec 11, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pganssle commented Dec 11, 2017

Uh oh!

jbrockmendel commented Dec 11, 2017

Uh oh!

pganssle commented Dec 11, 2017

Uh oh!

jbrockmendel commented Dec 11, 2017

Uh oh!

pganssle commented Dec 11, 2017

Uh oh!

jbrockmendel commented Dec 11, 2017

Uh oh!

pganssle commented Dec 11, 2017

Uh oh!

jbrockmendel commented Dec 11, 2017

Uh oh!

pganssle commented Dec 11, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pablogsal Dec 11, 2017 •

edited

Loading

pablogsal Dec 11, 2017 •

edited

Loading