Optimization contributions should be justified with empirical evidence#8584
Conversation
|
Just for context, are there some specific PRs which led to this? The idea of having this paragraph seems fine, but I think the language could be friendlier - I don't think someone should be discouraged from opening a PR with an optimisation, but requesting help to get more numbers, or an indication of whether with numbers the optimisation may then be accepted. |
|
I can think of a couple specific optimization PRs that were submitted without giving, at first, concrete evidence of performance improvements, but I would rather not point at any specific one right now because I wouldn't want the authors to feel guilty about it.
My understanding is that we don't want to accept optimization PRs that don't provide clear performance improvements in well-understood scenarios. Asking for performance numbers upfront sounds quite reasonable to me. Raising the barrier of entry for optimization PRs might be the right choice if that means less time collectively spent (for submitters and reviewers) discussing optimizations that don't make it in the end. (It feels a bit weird to write this as I just opened a stdlib PR without testsuite coverage for the new functions, which is just as bad.) |
|
I'm not really sure why it's necessary to restrict contributions. If a particular PR doesn't come with sufficient evidence, then the reviewers can always ask for such. As I have mentioned before, it is also not always possible to determine that a particular optimisation is beneficial by using numbers, as the improvement may (for example) be within noise. However multiple such optimisations may bring noticeable value, especially when considering second-order effects, such as code size and cache behaviour. |
|
In that case we could review and benchmark/measure these "multiple such optimisations" together. |
|
Except that you might not have written all of the optimisations at the time when one is ready for submission. I really don't see why this cannot be left to the discretion of the reviewers. |
|
Here is what Alain wrote:
I don't see a situation where it would be reasonable to accept an optimization, with the regression risks that it implies, without any factual evidence that it actually makes something better for someone -- which is the minimal standard being demanded in this text. Clarifying this fairly minimal and reasonable standard in CONTRIBUTING.md sounds like an excellent idea to me, and the somewhat surreal discussion we are having about it would tend to strengthen this opinion. |
|
From my point of view, based on a fair amount of experience in this area, this isn't a surreal discussion. I would have appreciated being able to reach an agreement before the pull request was approved. The discussion above and the text differ as to whether actual numbers are required. Was the intention for them to be required? |
|
I don't have a problem with requiring some evidence that the optimisation has a benefit, but I do think we should allow for indirect evidence of the benefit -- for example concrete numbers for a measure like icache misses when hoping to improve execution time -- simply because sometimes the direct measurement is much more noisy than the indirect measurement. Do people think that the wording
allows for that or not? |
|
In my mind cache misses (or branch mispredictions, cycle count, etc.) fall under the scope of "something which can be measured in an objective way" in the first sentence above:
So I would say "yes they are in scope". Would you like to propose a better wording to capture your idea of "indirect evidence"? |
The current wording is probably fine. I just wanted to make sure that people were on the same page about that aspect. |
|
The ultimate goal of optimizations is usually to make things faster; reducing code size and cache misses usually go in the right direction, so it seems fine to optimize for them, but the case for including a PR will be more easily made if the thing being measured is the thing that end users actually care about. Anyway, the goal of this PR is not to set in stone rigid constraints, but to encourage a minimal standard of empirical testing; I assume that people working on optimization somehow look results upfront, and it's a bit unproductive when the reviewer needs to ask for them. |
|
(An interesting case is #8596. I believe it's a useful one, but reviewing such PR takes time and the discussion shows that the safety of the optimization is not completely straightforward. Luckily, gains mentioned in #6148 are significant, and, in my opinion, justifies the time spent on the PR and the increased fragility of the code w.r.t. future evolutions. If the gains were around 1 or 2%, I'd say it would have been probably better not to do the optimization.) |
|
Let's not waste too much time on this possibly surreal discussion. @mshinwell: are you ok with the current text? If not, do you have suggestions to improve it, or are you globally against the idea of formalizing minimal expectations? |
|
I'm not really happy with the text as it stands. I'm also a bit concerned that there seem to be undertones in this discussion suggesting that people are just writing optimizations for the sake of it, without due thought being given to their applicability -- in general I would be surprised if that were the case. Here is a revised proposal: Contributions to improve the compiler's optimization capabilities are welcome. However, due to the potential risks involved with such changes, we ask the following of contributors when submitting pull requests:
A major criterion in assessing whether to include an optimisation in the compiler is the balance between the increased complexity of the compiler code and the expected benefits of the benchmark. Contributors are asked to bear this in mind when making submissions. The use of formal methods to increase confidence is encouraged. |
No, really, I'm convinced that people actually do the benchmark. The main idea here is to make it clear that sharing those results upfront, without the reviewer having to ask it, improves the decision process. There is also the idea that micro-benchmarks are good to show potential benefits, perhaps more than larger-scale existing applications (where performance critical code might have been written in a style aware of previous limitations of the optimize). I'm ok with your alternative proposal, except for the claimed rationale ("However, due to the potential risks involved with such changes") which sounds overly limited. Taking reviewer resources and increasing the code complexity and maintenance costs are also important criteria (they are shared with any kind of contributions, though, so perhaps this is obvious). Waiting for others to comment before switching the PR to your version. |
|
I tried to address the code complexity issue in the last paragraph. I don't think that we need to explicitly need to mention reviewers' time; as you say, that's the same for any submission. |
|
I also like Mark's proposal. I would suggest adding a list item about testing and correctness:
|
|
I think Mark's proposal is good. The rationale ("Due to the risks ...") seems reasonable enough to me: if there weren't any risks then it would be fine to accept changes without evidence that they improve things. (Wasting reviewers' time and increasing code complexity are among the possible risks, which could be enumerated for clarity.) More generally, it certainly seems appropriate to insist on some kind of measurement and at least one improvement. If there's no way to show an improvement then "optimization" isn't really the right term. Concretely, I'd drop the sentence beginning "If the improvement provided by an optimisation is difficult to measure on its own" from Mark's proposal, and perhaps instead say explicitly that measurements showing clear benefits when combined with some other optimization/change are acceptable. |
|
FTR I also like Mark's proposal. |
|
@alainfrisch there seems to be a consensus around Mark's proposed revision. Could you adopt it, and then we merge? |
320bff0 to
28d5817
Compare
|
Ok, I've switched to Mark's version, including the suggestions from Gabriel (put together with Mark's final sentence) and Jeremy. I'm still unsure with:
(Code size is something which can be measured directly, I don't see why one would want to measure cache misses instead.) |
|
I think the intention is to say that measuring cache misses rather than execution time can be used to justify changes that reduce code size. As you say, you could just measure code size itself and say that is good enough -- but some would argue that code size is also an indirect measure since it is probably execution time that you actually care about. Maybe you can improve the wording to make the point clearer. |
|
Thanks for the explanation. But I suspect that in the vast majority of cases, achieving a significant reduction of cache misses (through reduced code size or other means) should lead to observable speedups, at least on micro-benchmarks (and if one cannot exhibit such benchmark, perhaps the optimization is not worth the trouble). Anyway, if everyone else likes the current version, I'm not opposed to merging as is. |
gasche
left a comment
There was a problem hiding this comment.
Personally I think that the fine details of this discussion do not matter as much as the fact that it now exists, so we shouldn't be too perfectionist about it.
|
Thanks @gasche and to everyone involved in this discussion. Now let's all make OCaml faster/smaller/better! |
I suggest to document in CONTRIBUTING.md that we expect "optimization" contributions to come with at least some empirical results to illustrate their usefulness.