Conversation
|
This wouldn't be complete with a ridiculous micro-benchmark. So here we go: let rec sum (acc : float) = function
| hd :: tl -> sum (acc +. hd) tl
| [] -> acc
let () =
let n = ref 0. in
for i = 1 to 10000000 do
n := sum !n [1.; 2.; 100.; 5.; 100.; 500.];
done;
Printf.printf "%f\n%!" !nruns twice faster with this PR. (Tested on amd64 with the msvc64 port.) |
|
We are actively working on full unboxing support in Flambda which should include the passing of unboxed arguments and the returning of unboxed results (indeed, multiple results, for unboxing tuples, variants, etc). "unboxed" may also include untagged integers, although we are not completely sure of the complexity/benefit ratio for those yet. I think some of the changes here are probably needed for that work, but will have to study them together with @chambart . By the way, isn't some of this already in GPR#1029? By the way, as for (iii), some work is in progress that should make Flambda classic mode significantly more attractive for people to switch to. |
|
We've effectively have been also busy doing similar for type propagation on #1029. There is just a few more cases to handle functions with GADT's and tupled functions. |
|
I forgot about #1029... It's indeed very close to the first two commits in this PR. I've used a different representation for storing the result type, but this is cosmetic. The algorithm to extract the types in transl_function is also a bit different (and better in #1029, in think). #1029 also makes use of the new type information at once place in simplif.ml, which is good. Cherry-picking those into this PR will be easy. It's good that most of the machinery here will be useful for the flambda work as well. My feeling is that we are still a least a few years before flambda + flambda-cps + flambda-classic + the unboxing work will be ready for prime time, and that in the meanwhile, including this rather simple PR could remove a serious annoyance soon. |
|
@mshinwell @chambart Is your latest work in the cps_types branch in @chambart's fork? Also, it'd be really nice to keep the community up to date about how things are going with your work, since we're all really interested. A thread on discuss.ocaml.org updated with a line or two once in a while would be much appreciated. |
|
@alainfrisch I have no objection for doing that part here currently. The main reason why I haven't pushed for it already in some form was the constraint not to risk to introduce more boxing. I personally don't care that much about it, but Xavier showed some restriction previously about that. Doing the analysis to prevent it on clambda is quite tedious and is separately needed for flambda. The plan for the flambda part of it is to do the unboxing decision completely on the flambda side both to avoid doing any guessing game on that analysis and to handle some more complex situations that arise with a cps form. This is handled by adding primitives working on unboxed values (it is an annotation). The closure part use the classic version, while flambda convert every primitive to a sequence of unbox, compute, box. It is not right now in a pull-requestable shape, but if you don't object on this in principle, I will push that part somewhere. By the way, I have no objection to this as long as it doesn't complicate things in the long term and it does not seem to be the case. You will just have to convince Xavier. |
asmcomp/clambda.ml
Outdated
| params : Ident.t list; | ||
| body : ulambda; | ||
| params : (Ident.t * value_kind) list; | ||
| body : (ulambda * value_kind); |
There was a problem hiding this comment.
Why do you put the return type here rather than in a 'return_type' field ?
There was a problem hiding this comment.
This seemed coherent with the handling of parameters. The value_kind describe the type of values returned by the ulambda in body. This "forces" to think about updating the result type if body is changed for whatever reason.
There was a problem hiding this comment.
In #1029, you had to be careful for:
https://github.com/ocaml/ocaml/pull/1029/files#diff-704f66c0fa0fc9339230b39ce7d90919R464
While here, I have:
https://github.com/ocaml/ocaml/pull/1322/files#diff-704f66c0fa0fc9339230b39ce7d90919R468
("clearly", since the "body" of the result is the one from the inner function, the "value_kind" should also be taken from that function)
|
I don't remember @xavierleroy complaining about recent changes to the unboxing strategy (http://www.lexifi.com/blog/unboxed-floats-ocaml), even though they can definitely create more boxing in some cases, but also avoid many of them for typical cases. I think that guaranteeing that some programming style (e.g. "only direct calls to known functions", "monomorphic code", etc) doesn't require float boxing is more useful for the programmers than trying to come up with clever but non-local and "discontinuous" heuristics that can easily break. A typical example is a for-loop that need to Printf an accumulator even N steps for debug purposes; it's better to box the accumulator when needed (even if this means potentially boxing several times the same value) than to keep it always in boxed form (i.e. box every new value for it). The compiler doesn't know since we are not doing any profiling-based optimization. So, for let bindings the rule is now simple: if the bound value is known to be a float, the value will be kept in unboxed form (and boxed when nedded). The decision is purely local to the let binding itself. This PR implements a similar rule for function calls: if one calls a known function taking float arguments or returning a float, they will be passed in unboxed form; this is purely local to function call (once scoping and typing have been resolved). I believe that in the vast majority of cases, this is much better than the current situation, but of course, one can build a counter-example where this creates more boxing. |
@xavierleroy Before I spent time polishing the proposal, can you comment on the general strategy? |
|
I see I'm casted as the Old Scrooge character in this play... Since I'm supposed to, let me ramble about two things:
So, globally I need to be convinced: not by the proposed unboxing algorithm, which I haven't reviewed, but by its general applicability, especially in the presence of aggressive function inlining like in FLambda. |
I tried to gave you an opportunity not to play this role ("I don't remember @xavierleroy complaining")!
This can definitely happen, and this was also true for recent changes to the strategy for let bindings. A different strategy can result in a much better allocation profile in some cases, and a degraded one in other cases. My position is that what matters are (i) typical payloads; (ii) the ability to write efficient code even if this incurs manual tweaks to the code; (iii) simplicity of the strategy so that programmers get a good intuition about performance. Your position seems to be that any change that can degrade some cases is bad in essence ("never"). In my opinion, restricting future improvements to those that can never produce more boxing than today constrains those future improvements way too much, and I don't see why the current situation should be blessed as the reference point. Let's look at example where the proposed strategy would create more boxing.
Possible improvements to the suggested strategy could be:
One can also think of more dynamic techniques. For instance, the calling convention could be to return both the unboxed form and some pointer, which could either be null or an actual boxed float for the result. Branches of the function that naturally computes a float in boxed form would return both the unboxed and boxed form; and branches that naturally computes a float in unboxed form would only return the unboxed one. The caller of the function can then either use directly the unboxed form, or box it if needed. The same could be done for arguments (the caller can optionally pass a boxed form, and if it doesn't the function would memoize any boxing applied to the argument). I proposed to apply such a technique when we discussed changes for let-bindings. Such technique guarantees that there is never more boxing than today, and never more boxing than in this PR. But it can definitely be slower, and I remember you (@xavierleroy) were strongly against such approach.
Benefits are apparent as soon as the cost of boxing represents a good part of the execution time of the function. The size of the function (and thus its eligibility to inlining) is not directly linked to its execution time. For instance, the function can be a large pattern matching with only simple cases. It's not uncommon in numerical code to have multiple formulas to evaluate depending on order of magnitude of the argument (hence a decision tree and rather simple formulas on the leaves). Even if we consider only functions without conditionals or pattern matching, one would need to use a rather aggressive inlining to eliminate all function calls for which the cost of boxing would represent, say, more than 10% of total execution time for the function. Such aggressive inlining necessarily results in much bigger code (which can also be bad for performance). Anyway, as I thought experiment: would you trust the inliner (current or future) to eliminate the overhead that would be added if we decided to add one allocation for all function parameters on each function call? Say, how much slower would Coq become if we did that? That's the situation authors of numerical code face today (except -- I know -- that all their arguments are not floats). (But I agree that boxing is not super-slow. This is also why I'm not traumatized by the idea that the new strategy could in some cases create a bit more boxing than today.) On top of that, there is little hope to inline arbitrary recursive functions. Even flambda configured in super-aggressive mode could not do much with, say, a function that traverses a binary tree of floats to compute the sum of leaves (or more realistically, which evaluates an AST of float formulas). And as said, flambda guys are working on something more ambitious than this PR, but (i) a lot of the machinery introduced here will be useful for them; (ii) it's not something that will be ready for wide-spread use in production code soon. |
|
Another small benchmark: type t =
| Acc
| Var of int
| Cst of float
| Add of t * t
| Mul of t * t
let rec eval acc env = function
| Acc -> acc
| Var i -> env.(i)
| Cst x -> x
| Add (e1, e2) -> eval acc env e1 +. eval acc env e2
| Mul (e1, e2) -> eval acc env e1 *. eval acc env e2
let () =
let n = ref 0. in
let env = [| 10.; 20. |] in
for i = 1 to 100000000 do
n := eval !n env (Add (Acc, (Add (Add (Var 0, Mul (Var 0, Var 1)), Cst 3.))))
done;
Printf.printf "%f\n%!" !nThis one is about 25% faster on this PR compared to trunk (testing with -unsafe). |
|
Having been bitten a few times by the excessively naive approach to unboxing of lets that was recently introduced I am quite concerned that a similarly naive approach to unboxing of function parameters will make things even worse. The change to lets caused programs which look like: let g f = ()
[@@inline never]
let f x y =
let z = x +. y in
for _ = 0 to 1000000 do
g z
doneto go from allocating once to allocating thousands of times. We've had this increase the allocation of some code noticeably -- in particular, turning on flambda makes a lot more code that looks like this appear (e.g. inlining Personally, I would much rather see something like:
re-introduced for |
|
Well, if g is known to take a float, this example would benefit from this PR :)
For let bindings, this is doable. Either easily but not very efficiently (compilation-time wise), or, if one wants to do that without doing multiple pass over the tree (and in particular avoid scanning the result of Cmmgen.transl after applying it to the body of the let binding in Cmmgen.transl_let), one would need to have For functions, this would be more tricky, since one needs to know about the calling convention when we generate the function call, and this can be before translating the function's body. |
|
@lpw25 Your example would also be addressed by a criterion simpler to compute: do not unbox if all use sites require the boxed form. Would that address most cases for you? |
|
Btw, one could also argue that this case (for the let binding) could be addressed by some more generic CSE-like pass. |
I no longer have the actual examples to hand, but it would not surprise me if, in some of them, the body of the let required both the unboxed and boxed versions. |
|
I think where we really want to have unboxing (and currently don't) is on recursive functions, as Alain notes. This is where most computation work would happen, and yet we currently have to write these as loops if we want any chance of decent performance on boxed types. At the same time, I think it makes sense not to prematurely optimize. Any scheme that gets close to the 'ultimate analysis' (one which is dynamic) would require a lot of work, bump into the limitations of separate compilation/static analysis, make performance less predictable, and ultimately make Flambda's work more complicated. Perhaps all we want right now is the ability to switch calling conventions for basic types when requested by the user. If we could annotate a function argument with [@unbox_param] or some such thing, so long as the typechecker would carry it through to the type interface, we could already write recursive functions that didn't box/unbox on every iteration. This would give Flambda the tool it needs to do unboxing when it "inlines" recursive functions. |
I said that recursive functions are a lost cause for inlining, but avoiding boxing across function calls is useful more generally. It's not true that in typical numerical code, most computation happens in recursive functions. There are also a lot of medium-sized functions, which we don't necessarily want to inline. |
True, but if local analysis is going to potentially cause more boxing, forcing us to do more and more complex analysis, then let's wait to see what Flambda's analysis can do first and build on that. In the meantime, an annotation that allows the user to modify the calling convention would be extremely useful, both for recursive functions and larger ones, and support for this would also help Flambda. |
|
It would be straightforward to adapt this PR to only perform unboxing on arguments/results when explicitly requested through attributes, as in: let f (x[@unboxed]) (y[@unboxed]) = ... [@unboxed](This would only need a change in Translcore.transl_function.) This would already allow the performance-minded numerical programmer to avoid boxing across function boundaries, at the cost of some verbosity. This could be a decent short-term solution. Another approach: one could also keep a description of the calling convention on the arrow type itself, and make arrow types with different convention incompatible. For instance, |
|
@alainfrisch there might be other benefits from having different arrow, like avoiding the caml_apply stub (if you are ready to annotate some boxing information, you are probably ready to lose curryfication), or making some reverse bindings easier (by fixing the ABI of the function). But that's clearly a matter for another PR. There is probably no drawback to having an annotation right now. It can be turned into a warning later if we choose to apply unboxing more agresively. If you go for this, please make it a record of containing parameter informations. There is already the type and the unbox annotation, and I have some future PR to add inlining attributes on function arguments. |
So in this case, you'd use the same logic as in this PR, but only when requested? I think that makes sense. It would allow easy expansion into types other than float as well, since the cost is controlled by the user. BTW how do you prevent combinatorial blowup, say if there are 3 float unboxed arguments, do you need to generate 2^3 functions, one for each calling convention?
This is haskell's approach. They have [https://downloads.haskell.org/~ghc/7.2.1/docs/html/users_guide/primitives.html](unboxed primitive types) which are not compatible with type variables (Int# etc). They change the calling convention and also allow creating them inside data structures. Of course, it's all easier when you have type-based dispatch. |
No, there is one "unboxed" version, which unbox all arguments/result that can be (or which would unbox those requested with the attributes); and one "generic" version using the standard calling convention (which could be turned into a wrapper to the unboxed version). |
5a94e5f to
1b93288
Compare
|
Ok, I've added some explicit control. Internally, each function parameter and the result can be annotated with a boxing annotation. Only when the annotation is used and some of the argument or the result can indeed be unboxed, the backend will generate the unboxed version of the type and use it for direct calls. There is currently no check that unboxing indeed happen when requested. For now, the syntax only allows marking a function (all its argument and the result are being marked), through an One could use an approach similar to warnings, i.e. allowing to control the mode through more global attributes (useful for a float-intensive module where the author is aware of the new strategy). When the new attribute is not used, ocamlopt should generate the same code as before. The PR will also make it easy to pass more info on parameters from the typedtree down to lambda/clambda. |
… use them for direct calls.
For now, only an "ocaml.unbox" attribute is recognized on the whole function (it applies to all arguments and the result), but the plumbing is here to support per-argument control.
1b93288 to
13e8bd7
Compare
|
Rebased on trunk. First result on a non-toy benchmark, using a relatively naive implementation (but representative in style of read-world, non heavily optimized, code) of a model calibration algorithm (https://github.com/OCamlPro/ocaml-benchs/tree/master/lexifi-g2pp). Tested with the msvc64 port. The gain in execution time (testing with the version using automatic unboxing) is about 5% compared to trunk, and the total number of allocations is divided by 2.3. Some notes:
Being able to get a 5% speed up in such a case does not seem too bad. |
|
Another micro-benchmark: let () =
let s = ref 0. in
let n = 100000000 in
for j = 1 to n do
s := !s +. Random.float 1.
done;
Printf.printf "%f\n%!" (!s /. float n)This is a bit more than 20% faster on the branch compared to trunk. This illustrates that eliminating a few boxing (here, on the result of |
|
You should probably try those benchmarks with and without flambda, and possibly with a variety of inlining options, since inlining will decrease the effectiveness of the optimisation. |
|
Indeed, with a high-enough "-inline" parameter, the allocation scheme and timing become similar. The same would apply to any improvement/degradation on calling conventions (except for things that cannot be inlined, such as recursive functions). If OCaml was passing all arguments on the stack instead of using registers, or if curried function were compiled naively, would you argue that it's useless to improve that because an aggressive-enough inliner would often get rid of the overhead? Don't get me wrong: I think inlining is useful and great. But it's not a silver bullet to all performance issues related to function calls. We need both a good inliner that can be pushed to its limit for performance-critical section, and decent calling conventions for the majority of the code base. I don't believe that setting very aggressive inlining everywhere in a project is a good solution anyway, considering the drawbacks in terms of compilation time and code size. |
I wasn't suggesting that inlining means that you don't need this optimisation. I just meant that the argument for this optimisation is more persuasive if you have an idea of have much higher your example would need the inlining set before the optimisation no longer helped. (Obviously this is less interesting for micro-benchmarks -- but for actual programs I think it gives a good measure of how useful the optimisation is). |
|
@alainfrisch Really cool work! I have friends who do machine learning and others who do complicated CPU based rendering who could benefit from writing OCaml but basically can't because of performance issues related to boxed floats (and lack of SIMD possibilities). They run away to Python with pytorch (for ML) or Rust. I'm hoping that annotations like these would great help the very performance-oriented people, in parallel with flambda becoming better. Hope this can get merged <3 |
|
@bsansouci You want to watch Owl, which is a mix of pytorch and numpy, for this domain. It'll have much more impact than this PR. Owl could really use some help incorporating CUDA/OpenCL code so it can really compete with pytorch. |
|
I don't see this PR being merged, so closing it. |
This is a very early WIP, inspired from https://caml.inria.fr/mantis/view.php?id=5894 . For discussion only, for now.
Currently, ocamlopt will always use the boxed representation for floats (and boxed integers) across function calls. This can be a major cause of inefficiency for numerical code, and little can be done through manual optimization, except inlining everything and turning recursive functions into while loops.
This PR keeps track of an approximation of the type for function parameters and result, from the Typedtree and through lambda+clambda. When a function takes or return a float, it will generate two cmm functions: one with the usual calling convention, the other taking/returning unboxed floats. The unboxed version is used in case of a known direct call to the function (inserting boxing/unboxing on the call site, if needed).
With this approach, no boxing should be caused by crossing function boundaries when using first-order and monomorphic code (without higher-order functions). This would not avoid boxing for highly abstracted code, but at least this provides a way to write modular numerical code that doesn't allocate like crazy.
Contrary to MPR#5894; no effort is made to predict if using the unboxing calling convention will avoid any boxing or avoid adding more boxing. This is coherent with the current treatment of let-binding (incl for references turned into mutable variables): whenever the bound value is tracked to be of type float, we use the unboxed binding scheme.
Currenlty, the function is compiled twice. We could avoid that by considering the generic function to be a simple wrapper around the unboxed one. This would reduce the code size. There would be no need try to inline such wrapper, since inlining could only happen when calling a known function, and in that case one would call the unboxed version directly. There would still be an extra function call, but I suspect this would not impact performance too much in practice, at least for performance-critical code (that would likely not use higher-order numerical functions anyway).
Currently this PR only deals with float, but extending the treatment to unboxed integers will be trivial.
I know that flambda guys had plans to deal with this topic at some point, but (i) I haven't heard about anything on this front recently; (ii) the work done here might be useful for flambda as well; (iii) not everyone is ready to jump to flambda; (iv) this PR is a rather low-hanging fruit.
TODO: