Optimize some local functions#2143
Conversation
gasche
left a comment
There was a problem hiding this comment.
I am pleasantly surprised by how simple the changes are. It looks like a worthy optimization if we can get it in this complexity budget.
One thing that is a bit too subtle for my taste in the current implementation is the fact that while general occurrence-counting is implemented in an ugly and efficient way (pushing and popping integer references from an environment hashtable), tailcall-occurrence counting is implemented using an elegant and inefficient iterator (if I understand correctly, let ... in let ... in let ... in ... risks a quadratic worst-case).
Using a general traversal for occurrence counting would be shorter, easier to read, and also even more quadratic. On the other hand, using an optimized implementation of tailrec-counting is actually very tricky (I first thought it would be easy), because "being in tail position" is not an absolute notion, it is relative to one parent in the AST.
I think the following algorithm would work:
- During the one-pass iteration, keep a hashtable to count arbitrary occurrences of each variable, but also a hashtable to count tail-position calls to each variable.
- Use an iterator on all subterms, which passes the tail-rec status:
(in_tail_position:bool -> Lambda.t -> Lambda.t) -> Lambda.t -> Lambda.t. - When you recurse on a subterm which is in tail-position, nothing special happens (you continue iterating, using the logic in the current PR).
- When you recurse on a subterm which is not in tail-position, you temporarily forget about the current hahshable counting tail calls (for example by pushing it into a stack of disabled hashtables, to be restored later) and start with a fresh empty table. As a result, only the uses of the variables that will be bound deeper in the subterm will be counted as tail-recursive -- as we need. We you are done traversing that part, you restore the old tail-counts hashtable.
bytecomp/simplif.ml
Outdated
| Llet | ||
| ( | ||
| str, kind, id, | ||
| Lfunction {lf with kind=Curried; params; body}, |
There was a problem hiding this comment.
Why are you repeating kind=Curried and params which haven't changed, if I understand correctly?
bytecomp/lambda.ml
Outdated
| let rec g lam = f (shallow_map g lam) in | ||
| g | ||
|
|
||
| let iter_tail f = function |
There was a problem hiding this comment.
This function needs a careful second review to ensure that it is not overly optimistic in which tails are tail-cals (if we get this wrong, the static-raise optimisation is unsound). For example the Lstaticcatch case was a bit subtle to me, although I believe this one is indeed correct.
|
You're right that one should be careful with quadratic behavior (even though the constant factor should be very small here). One possibility, which might be equivalent to your proposal, would be to traverse the term, keeping track during the descent of a "non-tailness-level" (incremented each time we navigate to a sub term in non-tail position). The traversal would store the maximum level for each identifier, which allows, on the way up, to know if a given identifier is only used in tail-call position. |
|
Yes, I thought of this level idea as well at first, but I thought it was a bit more complex than my next proposal (by cutting the previous binders at each non-tail point, I make sure that only the identifiers of non-tailness-level below the current one remain in the active table). Thinking about the two again, maybe the "level" idea is easier to explain because it relies less on the dynamics of traversal. Either implementation would be fine. In both case, I would welcome efforts to keep the code readable by moving this counting logic to a named auxiliary function, and/or having a comment to explain what is going on. |
gasche
left a comment
There was a problem hiding this comment.
The new implementation makes sense to me, although I think that a comment indicating what tbl, depth and static are for could help newcomers understand what the code is doing. Why did you decide to remove the "inline single-use functions" case, is it subsumed by another simplification?
bytecomp/lambda.mli
Outdated
| expression. | ||
| *) | ||
|
|
||
| val shallow_iter: tail:(lambda -> unit) -> non_tail:(lambda -> unit) -> lambda -> unit |
There was a problem hiding this comment.
This function is fine, but personally I think that (tail:bool -> lambda -> unit) -> lambda -> unit would be a bit easier to use than the current proposal. It makes it easy to ignore the information (you can completely replace the main iter function with this generalization, instead of having to provide a separate function because the API is a bit inconvenient), and I think in most cases tail and non_tail are going to share most of their logic, so marking the differences with if tail ...; inserts makes a lot of sense.
There was a problem hiding this comment.
you can completely replace the main iter function with this generalization, instead of having to provide a separate function because the API is a bit inconvenient
The current approach allows defining the previous iter_head_constructor (which does not distinguish between the two kinds of sub-nodes) without having to allocate a closure at each node; and for the only use of the more fine-grained version (in Simplif), it works quite well to have two different callbacks.
|
(The commit messages says "pseudi linear-time" instead of "pseudo", I think.) |
|
@gasche : thanks for the live review 😄 The latest version is a generalization of the two cases initially described: it turns a function into a static handler as soon as all uses of the function are full applications, all of them in them "tail scope". This cover in particular the case of a function used only once (including for module-level function bindings, i.e. a function introduced in a module but not exported and used only once in the module). Nested functions are supported as well, for instance all local function bindings are eliminated in: let foo x y =
let f x = x * x * x in
let ret n = f (x + n * y) in
let a =
if x mod 2 = 0 then ret 3
else if y mod 2 = 0 then ret 4
else f 10
in
a + 1 |
|
(Ah, there is something wrong in the current version, I'm investigating...) |
True, but this might be a bit performance sensitive (called for each non-tail lambda), so I felt that the extra 3 lines of code where worth it. |
Yes, of course. If the PR is accepted, it should be squashed I think, so I won't bother rewriting the history to fix this typo. |
|
@lpw25 and I discussed this proposal; here are some thoughts. (This was a comment on your first version.) As background: one of the main design principles behind the next version of Flambda is to try to ensure that, even in the presence of high optimisation, users are able to predict and understand what the compiler is doing. The current compiler is good in this regard and we would like not to compromise it. The principle extends, in particular, to avoiding "cliff edges" where optimisations suddenly do (or do not) kick in and cause sudden performance changes. One place where "cliff edges" can occur is when using optimisations that depend on the number of uses of some particular thing, such as the first optimisation proposed here: adding a second use of a local function could suddenly cause a non-negligible change in performance. As such we think we should be a bit careful here both in terms of the specification of the optimisation, and also in its implementation. In terms of the specification, it should be syntactically obvious when the optimisation is going to take effect, so we can give a clear criterion in the manual that is easily checked by the programmer. In terms of the implementation, this probably means doing the analysis prior to Writing high-performance code sometimes requires manual guidance to be given to the compiler. It's worth noting that these optimisations, perhaps surprisingly, are not always a win. For example, for the tail optimisation, if the function is small it might be better just to inline it at each of the use sites (especially if the inlined versions are then variously simplified by using information about the function's arguments). Another problem is interaction with inlining of the outer function: if the outer function gets larger by virtue of containing the code of the inner local function as a result of these optimisations, it might prevent the outer function getting inlined, which could be worse. These kind of interactions, along with the ability if we do these optimisations in Flambda instead of at the Lambda level to integrate their results into the inlining reports, tend to support an approach where the relevant sub-terms are marked for later analysis. We could just disable the transformation on Lambda and do it later, somehow integrating these optimisations into the usual inlining benefit calculations, or perhaps having them as a separate phase after inlining. (We haven't spent enough time to really get to grips with how this would work yet, but these are some preliminary ideas.) That said, we also suspect that the application of the tail optimisation prior to Flambda (but not the used-only-once optimisation) might not compromise optimisation in Flambda 2.0, because its whole modus operandi is that of static catch handlers (continuations). Another issue relates to closed local functions, which will be lifted out, so there won't be any closure allocation. There still will be register shuffling to match the calling convention---but on the other hand, so will there probably be under these optimisations, where the shared continuation has its own calling convention assigned by the register allocator. As such, we think it would be better to phrase both optimisations in terms of guaranteeing lack of closure allocations: "a local function syntactically used only once, or syntactically used only in tail position, is guaranteed not to allocate a closure". Then the compiler can choose whether to lift the function out or apply one of these optimisations. Given that these optimisations are effectively stand-alone, I also recommend they should be in a separate module, rather than lumping them together into |
I wonder how this could be achieved with non-trivial heuristics to decide whether to inline or not. Inlining can have a huge impact in terms of further simplification (and thus avoiding allocations, or even some pure calculations), but it would be crazy to inline blindly as soon as at least one such "possibly important" optimization is made possible. So there needs to be heuristics and probably thresholds, and I don't see how you can achieve the smoothness and predictability you are aiming at. The current compiler has already plenty of "cliff edges". For instance, turning references into mutable variables will be disabled as soon as there is one occurrence of the reference being passed to another function; or, we used to have the property that any use of a local float variable would disabled its unboxing if the context expected a boxed version (luckily, this is now more "continuous").
I understand all what you are saying, but do you agree that this is really not how things are done today? The current proposal seems to be aligned with optimizations as currently implemented at the lambda level in Simplif. Also, I'm rather sensitive to compiler performance and simplicity of the compiler code base, and I'd prefer not to complexify the current implementation nor introduce more intermediate languages. This would probably exceed the "complexity budget" @gasche mentioned. More advanced approches should be left to flambda(2) I think.
The examples you give are useful, but I think they apply equally to inlining itself (inlining an inner function can make the outer function less inlinable). I expect this to be handled by flambda(2) (which can disable or not the new optimization). flambda could also decide to re-introduce functions for all "static handlers" (or even all sub-expressions) and base its inlining decisions independently of actual function boundaries in the source code. I'd really appreciate if the current PR could be evaluated independently of flambda(2). Considering how simple its implementation is and how straightforward it is to enable/disable it, we can always revisit its relevance later.
I'm not sure to understand your point here. Yes, the benefit of the optimization would be less important with closed functions, but would it be harmful to still apply it? One still avoids crossing function boundaries, without expanding the total size of the code. There could of course be bad cache effects in some cases, but this is the case for almost any optimization. |
|
FWIW, I tried compiling typing/*.ml with and without the new optimization enabled (with ocamlc.opt) and could not observe difference in compilation time. |
It's subsumed by the latest version, which will turn the single call to the version to a staticcatch just around the call (turned into a staticraise), and the simplify_exits pass will do the actual inlining. For instance, consider: let foo i =
let f x = x + 1 in
let r = f (i * 2) in
r * 2We get the following versions of the lambda code: rawlambda: |
|
With regard to what Mark says, one argument in favor of the proposed optimization is that, if I understand correctly, it is program-size-preserving: some ¹: this is modulo Alain's last trick which lifts a function declaration within the deepest possible subterm with the hope of increasing localization opportunities. Personally I think that this optimization is rather well-behaved and it is the kind I don't find shocking to have early in the pipeline (as early as "static raises and catch" are available). Another similar optimization would be lowering try/raise into staticcatch/staticraise. This is all very similar in spirit to lowering references to mutable variables, which is already done in Simplif. Of course, all this is assuming that the PR can be implemented within a reasonably low complexity budget. My guess from preliminary reviewing is that it is doable. |
The trouble is when the inlining heuristics is based on the size of the function's body (excluding inner functions), which is increased by the current proposal. But with such heuristics, it's also the case that inlining itself could increase the size of some functions, making themselves ineligible to inlining. So I think this is a topic for the inliner to deal with, and hopefully flambda(2) has something clever to do here.
(FTR, a version of this was proposed in #638 ) |
|
The problem with optimisations based on the number/nature of uses is not only that they produce sudden changes, but that these changes are often "spooky action at a distance". You add a call to function A at position B in your code and an unrelated call to A far away at point C gets dramatically slower. The classic example of this is when adding some test code for a function dramatically slows down its use production. The main way to mitigate against this is to limit the "distance" at which these changes can occur. For this reason I strongly suggest that you change this PR back to only working on local functions. A related issue is that the number of uses can be affected by earlier optimisations. For example, you let foo ... =
let bar x = ... in
bar x;
...
if false then bar xwhich allows the optimisation to trigger, but if you make a simple change to the second use: let foo ... =
let bar x = ... in
bar x;
...
let debug = false in
if debug then bar xand suddenly your program slows down. This is why I think it would be much better to base the analysis on the number of uses in the original source program rather than the number of uses at the point in the compilation pipeline that you happen to put the analysis. It seems like it wouldn't be too hard to do the analysis on the typedtree as it is converted into lambda code, which would then ensure that the optimisation was based on a simple predictable syntactic criteria.
I'm not sure about "plenty". To my knowledge the reference optimisation is the only other usage-based optimisation we do. I'd also love to base that on a reliable syntactic condition as well. Either way, having some unpredictable optimisations already is not a great argument for adding more of them.
They do, and flambda has strategies to counter them. This is an argument in favour of delaying the actual transformation until later in the compilation, so that flambda can manage this aspect of things. If you were to do the analysis earlier than |
|
I have investigated the test failure with flambda ( The failure comes from the first assertion, The difference between the two sides is that the right-hand side gets recognized as constant during earlier transformations ( Note that the same issue appears if |
At the level where the optimization operates, there is no obvious distinction between global and local functions. The first version already dealt with global functions. Adding complexity to a relatively straightforward optimization in order to reduce the number of cases where it applies seems a bit weird. In my opinion, what's important is not so much to ensure that the optimization does not apply "by chance", but to provide decent guarantees that some well-specified forms will be optimized. For instance, we don't need to advertise that global functions will be impacted, and if they do, the gain for them will be usually quite small (so the loss if the optimization is broken by a second use would be small as well). Honestly, I don't believe that flambda(2) will ever be able to achieve a state where the user can reasonably predict exactly which inlining will happen, purely following simple syntactic criteria. What is possible is that some specific code patterns (including with explicit attributes) can be documented to be handled efficiently, and this is already quite good. Concretely, there are two important cases where we want to guarantee that the current optimization applies:
Unfortunately, there is currently no "Writing Efficient Code with OCaml" manual that I could extend with such descriptions.
We used to have the unboxing strategy for boxed numbers (which was broken by any use in a non-unboxing context). The new heuristics is more robust in a sense, but adding one use of a variable can inform the compiler that a given variable is, say, a float, trigger its unboxing, resulting in a re-boxing of all accesses (required a boxed float), and thus slowing down dramatically a tight loop. There also also optimizations in Simplif where (1) "static handlers" are simplified away depending on the fact they are used only once and (2) some local pure "alias" let-bindings are inlined when used only once. If we add the case of turning reference into mutable variable, this pretty much covers all existing optimizations in Simplif -- i.e. they are all more or less based on tracking use sites. I think the current PR is pretty much in line with the existing approach; I understand it does not align with the goals of flambda(2), though, and I'm really not shocked if those about 100 lines of code are disabled when flambda(2) is in use. |
|
@lthls Thanks for the analysis! |
It's a pretty standard approach to improving predictability of a system -- we do it with type inference all the time.
Flambda doesn't rely on syntactic criteria but all its optimisations are based on analysing the definitions of values not their uses. This means that when your code slows down it is because you changed something about the code that is now slower not some unrelated code. That property does not hold for this PR, which I think justifies making its behaviour as predictable as possible, especially if it be made completely predictable with relatively little effort.
From this I gather that the optimisation does not apply to calls within other functions? If that's the case then this is indeed not too bad. Your description of it working on module-level functions sounded like any global function used only once -- even if within another module-level function -- would be inlined.
To be fair, those optimisations do not affect the amount of allocation in the program. They are not likely to have much affect on performance. They are also optimising constructs that were introduced by the compiler, not altering the behaviour of something specified by the user. |
This remains to be proven, and I don't believe this is the case. To be clear, I don't intend to spend efforts trying to do it. To elaborate a bit on my position: The current implementation does two traversals of the Lambda structure, one to detect the optimization opportunities, another one to apply them. IIUC, you are suggesting to do the first pass on the Typedtree, or perhaps during the Typedtree->Lambda translation, and keep the second one as a rewriting Lambda->Lambda. But (1) analyzing the Typedtree is much more difficult, (2) I don't see how to cleanly pass the information on "tail scope" from Typedtree to Lambda without changing the Lambda definition (e.g. by adding explicit scope identifiers). Anyway, splitting the optimization in 2 passes implemented on different AST is bound to make it harder to maintain, understand, and to combine it with possible other simplification passes applied at the Lambda level. For instance, one could very well imagine some other simple inlining heuristics at the lambda level, which could be applied before the currently discussed optimization and interact with it (by exposing more opportunities). In your view, this is bad since it reduces the "predictability" of the optimization, but I don't buy it. It seems weird to me to apply to the proposed optimization some "syntactic predictability" criteria which don't hold for most other current and planned optimizations. Moreover, your real concern seems to be specifically about the "usage" analysis, but implementing the analysis in the Typedtree level will not really change the situation. It could be perhaps more predictable, but as easily broken by a "remote use". The alternative is either to completely disable the optimization, or to allow it to duplicate some code. Now, if one wants to make the optimization more "predictable", one could add a dedicated attribute that would check that the optimization applies to a given function. Would it be enough to address parts of your concern? |
|
Yes, I think a |
|
I've piggy-backed the existing The new |
|
Having Any other ideas? Maybe |
I still suspect this is easier than you think, but I think you can leave this part for this PR. If this problem comes up too much in practice then we can make it more predictable -- possibly alongside making the ref optimisation similarly predictable. As a side-note, one way to think about what worries me is to think in terms of code review. When reviewing changes to code whose performance is important you want to have a reasonable idea of what changes might affect that performance. In the general case usage-based analyses make this much harder. In this specific case I think the usages are local enough to make this not too bad. However, I would still rather have a clear specification that I can tell users so that they know exactly what they are looking for during review.
This at least allows you to prevent the problem from happening more that once to the same code, so it's definitely worth having. I think it should be a separate annotation from As for the particular choice, I'm not sure that The final issue I have with this PR is how it affects back-traces and other debug information. I think that you should do something similar to the body of the continuation to what is done to the bodies of inlined functions. This means going to each |
How does that work for flambda? Do you expect users to predict which calls will be inlined? Even if the heuristics is not usage-based, a tiny change to a function definition, or its call site, can change the inlining decision, and any change to a performance-sensitive piece of code could thus have bad consequence. I don't know if this applies to flambda, but with Closure, even removing code can disable some inlining (by first allowing more inlining, which then makes some function's body too large to be inlined itself).
Ok. About the name: what really happens is that one tries to replace all calls to the function by gotos/jumps to its body (this works if all call sites share the same continuation). So perhaps an attribute name like
Ok, thanks, I will look at that. |
Debuginfo is introduced later and is not available at the lambda level. I don't think it's currently possible to keep track of the rewriting and maintain stack traces without rather large changes to the Lambda definition. Anyway, since the continuation is shared, the stack trace would not allow to distinguish from multiple call sites, and I'm not sure that it would bring much information to keep the location of the following continuation. |
I expect people to know that changing a function they call in their hot loop might affect their application's performance. Currently, at that point they then need to re-benchmark things to check that its all ok, but we have some prototype tooling that gives detailed output about the inlining decisions in the editor which will help with this part too.
Ok. It's probably worth waiting for @mshinwell's upcoming changes before looking at this properly. Even ignoring the backtrace we're still going to need to do some processing here to make the scopes of variables in DWARF correct -- but obviously that requires us to actually have the code that handles scopes merged.
It's not so much about the additional information as about not confusing people when they look at the backtrace. Missing backtrace frames do surprise people and can mislead beginners. |
|
I would like this PR to converge rather than be closed because @alainfrisch wants to cut his losses in terms of time invested. The current implementation passed my review and I believe it is reasonable. I'm proposing to push this through sooner rather than later; my goal is to have something merged that seems reasonable enough to everyone. Here are the main points of discussion. We need to either reach a consensus or make a decision.
|
A new optimization pass is added in Simplif to avoid the overhead of
allocating and calling local closures. Two cases are supported:
1. Local function used only once:
let f x y = ... in
...
.... f e1 e2 ...
In this case, the function body is inlined on the call site.
2. Local function only used in tail position:
let f x y = ... in
match ... with
| ... -> .... in f e1 e2
| ... -> if .... then f e3 e4 else ...
| ... -> ...
In this case, the function is turned into a "static exception":
staticcatch
match ... with
| ... -> .... in staticraise E(e1, e2)
| ... -> if .... then staticraise E(e1, e2) else ...
| ... -> ...
with E(x, y) ->
...
Only the "tail position" optimization is kept.
d6bec1b to
98512cd
Compare
|
The benchmark is not very conclusive, so what shall with do? Merge in the current state, or disable when flambda is enabled? |
|
Given the benchmarks, I'm ok with merging as is. |
|
Thanks! |
|
Shouldn't the new attribute be documented in the manual? |
|
Ah right sorry, I will submit a further PR. |
|
-> #2169 |
This is a revised version of MPR#6242.
[EDIT: modified description to match the latest version of the PR]
A new optimization pass is added in Simplif to avoid the overhead of allocating and calling local closures. Functions bound to an identifier are simplified away if all references to the identifier are full applications of the function (i.e. same number of arguments) and if all these calls are in the same "tail scope". In that case, a "staticcatch" handler is inserted around that scope and each call is turned into a "staticraise" jumping to that handler. A special case is when a function is used only once (for a full application), which obviously fall into the case above; in that case, an existing pass further simplifies the static exception by inlining the handler (i.e. the function's body) on the call site.
The case of local functions used in tail position corresponds to a very frequent use case where multiple branches (of pattern matching or conditionals) need to finish with the same computation. There is currently no very good way to express this pattern: (1) a local function allocates the closure (even when unused), calling the function has a cost, and this breaks optimizations restricted to a single function body (unboxing of boxed number, register allocation); (2) a
try...withis costly (even with raise_notrace), breaks the tail-context (i.e. can turn tail calls into non-tail ones) and is syntactically not very pleasant.It could be the case that some cases are or will be optimized with flambda (I did not check), but (1) this would not benefit to Javascript backends (js_of_ocaml & Bucklescript) and (2) flambda is still not the default (and not very widely used I think).
Playing again the micro-benchmark from MPR6242, I get:
Default settings for ocamlopt (no flambda):
before: 4.0s
after: 3.3s (18% faster)
With -inline 100:
before: 4.0s
after: 1.5s (62% faster)
Another micro-benchmark:
Default settings for ocamlopt:
before: 5.9s
after: 4.2s (29% faster)
With -inline 100:
before: 5.9s
after: 2.4s (59% faster)