-
Notifications
You must be signed in to change notification settings - Fork 523
Proof of Concept: Implement asv for profiling+comparison #580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| @@ -0,0 +1,103 @@ | |||
| { | |||
| // The version of the config file format. Do not change, unless | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The github syntax highlighting is going nuts on this - is this a variation of JSON that has comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess? I just used the pandas version as a template.
asv_bench/benchmarks/parsing.py
Outdated
|
|
||
| class Parsing(object): | ||
| def time_parse_iso(self): | ||
| for n in range(1000): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to repeat it many times, or can we configure asv to do that?
I'm curious how big a role caching is going to play here and how misleading that will be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
asv does run each of the benchmarks multiple times, I think targeting a total run time of 0.1 seconds by default (configurable). I haven't mastered the semantics it uses. In some circumstances asv spawns subprocesses to avoid caching influencing the results.
Originally I didn't include these loops at all, but in the profiling it only ran once and I wanted a bigger sample.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel Interesting. I wonder if it is better to just trust asv to run the task multiple times as it sees fit or if it's better to do some averaging on top of their averaging.
It is my suspicion that we'll want the bulk of our tests to be single-run tests and let asv be smart about how to run them, but since caching is a reality we'll want to deliberately induce it with some sort of loop at some times.
@pablogsal Do you have an opinions on this? From our conversations in London it seemed like you have some experience with profiling (or at least timing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avs uses python timeit module and some of IPython %timeit magic, so I will not consider this reliable/reproducible for timings less than 0.1 seconds. [1] [2]
< TLDR >
The TLDR is that you have to select what to measure: average execution time or "platonic" execution time (the fastest and usually not very representative). When you do this you need to understand how timeit and %timeit and other tools are measuring. If your individual executions are fast, there is ton of noise that affects the measurements.
< /TLDR >
If you don't want to limit the noise (which is very complicated), then another approach is to limit/understand the effect of the noise using statistics. As the initial distribution for the time measurements has skewness <0, assuming that the distribution is homeomorphic to P(X=x)=a e^(-a(x-b)) for X>b, 0 otherwise (where X is a random variable and b,a are real numbers), then the typical procedure to obtain stable benchmarks by limiting the noise is measuring execution n times, doing the mean of this values and repeating this r times. Using this fact, the distribution representing the means: Y=1/n(X1+X2+...) For the central limit theorem we know that P(X=Y) ~ N(sigma, mu) Where N is the normal distribution of stdv sigma and mean mu. This is a profound result and therefore you don't need to worry about the left skewness or higher central momenta of the distribution of errors, the means of the execution times will be always asymptotically a normal distribution. This means that reporting the mean and the standard deviation of a statistically significant number of experiments is enough because there are no higher momenta available. Notice that this is true even if the errors are on one side of the execution result. This is trying to answer another problem that is "what is the most representative execution time in this particular machine" as opposed to "what is the platonic execution time for this set of instructions". IMHO users want reproducibility and some sense of stability, even under noisy and unstable conditions.
I think it also depends on how seriously do you want this to be. A lot of people are using avs because they don't care about the inexactitudes that much, or because they don't know best. Actually, avs is doing a lot of good work to set up the environment and produce proper metrics (you can select warmup_time,repeat,number and sample times among other options) which I think you can use to be very close to stable benchmarking. Another set of things you probably want to do in the machine you run this is to deactivate frequency switching in the CPU that you are using and isolate one CPU core for running the benchmarks.
If you use avs you should try to keep the benchmarking functions as short as possible, leaving the measures and the control to avs itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pablogsal I think the goal here is mostly going to be differential benchmarks. I think for now we can take the performance metric is "does set of instructions X perform faster than set of instructions Y", and the competing interests are (likely in order of prevalence):
A.) On the average machine, does it perform better in a tight loop
B.) On the average machine, does it perform better when high-latency is needed
C.) Does it perform better in resource-constrained environments.
I imagine that what we'll be doing with asv is trying to find a model that gives approximate answers to these questions given that it will be run on only one machine. For practical purposes, I think that we can easily accept errors that are on the order of magnitude of the natural variation of the project (e.g. if some machines perform faster with instructions X and some faster with instructions Y, or 1 STD of the variation in the differential performance on various common inputs).
Given imperfect tools to do this (and given that I am not prepared to maintain a separate benchmarking tool), I'd like to use the best configuration for asv (or another standard benchmarking tool if there's a better one) that optimizes these goals and lets me make actionable decisions about algorithm choices, even if the decision is "the difference between these is not statistically significant".
So, given that, do you have thoughts about whether we should write tests like:
def time_parse_iso(self):
for n in range(1000):
parse('2017-09-12')So that we can add some averaging into the process and make it effectively "longer", or should we write tests like:
def time_parse_iso(self):
parse('2017-09-12')and trust that asv will figure out the best way to benchmark them for us?
If the main concern is the way that asv does things, probably it's best for us to open an issue and possibly a PR there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pganssle IMHO You should write test like this:
def time_parse_iso(self):
parse('2017-09-12')and configure warmup_time, repeat, number and sample times (in avs) accordingly to have enough samples and a good average.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the main concern is the way that asv does things, probably it's best for us to open an issue and possibly a PR there.
I opened a couple last week, very wary of ...
and given that I am not prepared to maintain a separate benchmarking tool
The subprocess indirection asv uses makes it tough to extend the API (at least for one newbie), but I'm cautiously hopeful that some of those rough edges may get sanded down.
dateutil/parser/_parser.py
Outdated
| return self._weekdays[name.lower()] | ||
| except KeyError: | ||
| pass | ||
| try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this part of the PoC or something you actually want merged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PoC to demonstrate the asv continuous comparison. It should go into a separate PR-- it's a pretty nice perf improvement.
I'll revert this part in the next draft.
|
@jbrockmendel Looks great so far, I like it. I'm thinking we should add it as an allowed-failure job in the travis CI, unless there's another CI provider that might be better. |
@TomAugspurger any thoughts on best practices here? |
|
I'm not sure about running asv on CI workers. Everything I've heard is that
it's too noisy to be reliable.
We could add dateutil to https://github.com/tomaugspurger/asv-runner and
start tracking them on the pandas benchmark machine.
…On Sun, Dec 10, 2017 at 10:05 AM, jbrockmendel ***@***.***> wrote:
I'm thinking we should add it as an allowed-failure job in the travis CI,
unless there's another CI provider that might be better.
@TomAugspurger <https://github.com/tomaugspurger> any thoughts on best
practices here?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#580 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIhTqfJDAStYQbegAte83ifGj46KCks5s_AFfgaJpZM4Q8YIW>
.
|
|
@TomAugspurger I think I see what you mean with the "too much noise", but could you elaborate? I'm guessing what you mean is that the performance of CI machines will depend on instantaneous load, so the benchmarks would not be amazingly reliable. If we only run in "compare to master" mode on the CI, can we just set the threshhold of "benchmarks have changed significantly" to account for the potential change in load? Alternatively, I wonder if we can convince it to fall back to a "best-of-three" mode or something (like, if there's a detected significant change in performance, it waits 5 minutes, then re-runs the benchmark, then waits 5 minutes and does it again). With regards to adding |
|
@jbrockmendel Is this a WIP type thing or is it ready to be merged? Right now Appveyor doesn't seem to be running any builds at all, so we're blocked on that for merging, but I'd like to merge this ASAP so that I (and others) can start creating benchmark suites for the other modules - I've got some plans in particular for the |
I think it can be merged. It isn't user-facing, so worst case scenario we eventually replace it with something better. |
|
On the various how-to-calibrate issues, I think the short-term solution is to recognize that:
|
asv_bench/asv.conf.json
Outdated
| // skipped for the matching benchmark. | ||
| // | ||
| "regressions_first_commits": { | ||
| ".*": "v0.20.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a pandas thing? Should we just drop this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
|
This doesn't seem amazingly reproducible. Using the current version of this branch, I'm still randomly getting significant changes to the benchmark or not, depending on the run: (For these runs I changed it so that each for loop runs 5000 times instead of 1000 times to see if it would be more consistent.) |
|
Yah, I'm not thrilled with the consistency either. I've opened up a new issue over there to discuss ways of improving it. For the time being I still think this is better than nothing and we should just crank up the iterations. |
|
@jbrockmendel I don't know if cranking up the iterations helps. I get roughly the same inconsistency with 1 iteration or 5000. This may be a matter of running this on a benchmark machine instead of on my laptop or CI, but that kinda reduces the usefulness of this if the measurements are meaningless on your local machine. This blog post that Pablo linked to has some tips (though I'm not really sure the best way to put them in practice). There's also pytest-benchmark, but I'm not sure that would be any better (and honestly I think |
|
hmm if upping the iterations doesn't help, that seems... weird. Anyway, I think we should merge so that we can tinker with it on independent branches (avoid merge conflicts) while figuring out if/how it can be made useful. |
|
@jbrockmendel I dunno, if it's not useful I don't see any point in it. I am not going to start writing benchmark tests until I know I'm measuring something meaningful, so I'll just tinker with it locally. That said, I didn't create a merge conflict - I cleaned up your history and forced pushed it to prepare for merge, then I realized how inconsistent the measurements were. |
|
Fair enough. The question becomes what we do if there isn't a "fix" and benchmarking is just fundamentally difficult. Maybe pytest-profiling? Not so much for the comparison but for identifying hotspots using existing cases. |
|
I think we just need to figure out what the effort / reward trade-off is here. I'm willing to spend a bit of time figuring this out because it's very useful as a general rule, but if in the end it turns out that the options are 1) write or maintain a complicated benchmarking suite or 2) provide something that is more likely to be misleading than helpful, I'd say we just give up on benchmarking until we can free-load on someone who has already done 1. (Which may be In the end it may be that we set up |
|
Yah, this is a tough one because 1) I very much agree about not wanting to build/maintain a tool for this, but 2) several other recent threads would have been improved with robust+reliable profiling info. I'll follow up on the issue I opened with asv, otherwise not a lot of ideas. A brief poke at pytest-profiling didn't wow me. |
|
@jbrockmendel I think we can leave this open, I'm not saying I don't want to merge this, I just want to take a bit of time to understand how and, more importantly, how not to use it. |
For proof-of-concept I just adapted a handful of the existing test cases. If we go this route we'll want to adapt a broader sample. asv has its share of rough edges, but is way better than trying to roll our own solution. See https://asv.readthedocs.io/en/latest/using.html
Two types of usage. First profiling:
This particular result is from before the commit that got rid of the
min(...)calls. Second type of usage is comparison:Looking at these results with a critical eye, the first two are statistical noise, since the code paths are unchanged. The other three are real, so we just got a non-trivial performance improvement.