Skip to content

Unboxed calling convention#1322

Closed
alainfrisch wants to merge 10 commits intoocaml:trunkfrom
alainfrisch:unboxed_calling_convention
Closed

Unboxed calling convention#1322
alainfrisch wants to merge 10 commits intoocaml:trunkfrom
alainfrisch:unboxed_calling_convention

Conversation

@alainfrisch
Copy link
Copy Markdown
Contributor

@alainfrisch alainfrisch commented Sep 6, 2017

This is a very early WIP, inspired from https://caml.inria.fr/mantis/view.php?id=5894 . For discussion only, for now.

Currently, ocamlopt will always use the boxed representation for floats (and boxed integers) across function calls. This can be a major cause of inefficiency for numerical code, and little can be done through manual optimization, except inlining everything and turning recursive functions into while loops.

This PR keeps track of an approximation of the type for function parameters and result, from the Typedtree and through lambda+clambda. When a function takes or return a float, it will generate two cmm functions: one with the usual calling convention, the other taking/returning unboxed floats. The unboxed version is used in case of a known direct call to the function (inserting boxing/unboxing on the call site, if needed).

With this approach, no boxing should be caused by crossing function boundaries when using first-order and monomorphic code (without higher-order functions). This would not avoid boxing for highly abstracted code, but at least this provides a way to write modular numerical code that doesn't allocate like crazy.

Contrary to MPR#5894; no effort is made to predict if using the unboxing calling convention will avoid any boxing or avoid adding more boxing. This is coherent with the current treatment of let-binding (incl for references turned into mutable variables): whenever the bound value is tracked to be of type float, we use the unboxed binding scheme.

Currenlty, the function is compiled twice. We could avoid that by considering the generic function to be a simple wrapper around the unboxed one. This would reduce the code size. There would be no need try to inline such wrapper, since inlining could only happen when calling a known function, and in that case one would call the unboxed version directly. There would still be an extra function call, but I suspect this would not impact performance too much in practice, at least for performance-critical code (that would likely not use higher-order numerical functions anyway).

Currently this PR only deals with float, but extending the treatment to unboxed integers will be trivial.

I know that flambda guys had plans to deal with this topic at some point, but (i) I haven't heard about anything on this front recently; (ii) the work done here might be useful for flambda as well; (iii) not everyone is ready to jump to flambda; (iv) this PR is a rather low-hanging fruit.

TODO:

  • Extend to boxed integers.
  • Evaluate the cost of turning the "generic" function into a simple wrapper around the "unboxed" function (so as to contain increase in code size).
  • More realistic benchmarks.
  • Non-regression tests.

@alainfrisch
Copy link
Copy Markdown
Contributor Author

This wouldn't be complete with a ridiculous micro-benchmark. So here we go:

let rec sum (acc : float) = function
  | hd :: tl -> sum (acc +. hd) tl
  | [] -> acc

let () =
  let n = ref 0. in
  for i = 1 to 10000000 do
    n := sum !n [1.; 2.; 100.; 5.; 100.; 500.];
  done;
  Printf.printf "%f\n%!" !n

runs twice faster with this PR. (Tested on amd64 with the msvc64 port.)

@mshinwell
Copy link
Copy Markdown
Contributor

We are actively working on full unboxing support in Flambda which should include the passing of unboxed arguments and the returning of unboxed results (indeed, multiple results, for unboxing tuples, variants, etc). "unboxed" may also include untagged integers, although we are not completely sure of the complexity/benefit ratio for those yet.

I think some of the changes here are probably needed for that work, but will have to study them together with @chambart . By the way, isn't some of this already in GPR#1029?

By the way, as for (iii), some work is in progress that should make Flambda classic mode significantly more attractive for people to switch to.

@chambart
Copy link
Copy Markdown
Contributor

chambart commented Sep 8, 2017

We've effectively have been also busy doing similar for type propagation on #1029. There is just a few more cases to handle functions with GADT's and tupled functions.

@alainfrisch
Copy link
Copy Markdown
Contributor Author

I forgot about #1029... It's indeed very close to the first two commits in this PR. I've used a different representation for storing the result type, but this is cosmetic. The algorithm to extract the types in transl_function is also a bit different (and better in #1029, in think). #1029 also makes use of the new type information at once place in simplif.ml, which is good. Cherry-picking those into this PR will be easy.

It's good that most of the machinery here will be useful for the flambda work as well. My feeling is that we are still a least a few years before flambda + flambda-cps + flambda-classic + the unboxing work will be ready for prime time, and that in the meanwhile, including this rather simple PR could remove a serious annoyance soon.

@bluddy
Copy link
Copy Markdown
Contributor

bluddy commented Sep 8, 2017

@mshinwell @chambart Is your latest work in the cps_types branch in @chambart's fork?

Also, it'd be really nice to keep the community up to date about how things are going with your work, since we're all really interested. A thread on discuss.ocaml.org updated with a line or two once in a while would be much appreciated.

@chambart
Copy link
Copy Markdown
Contributor

chambart commented Sep 8, 2017

@alainfrisch I have no objection for doing that part here currently. The main reason why I haven't pushed for it already in some form was the constraint not to risk to introduce more boxing. I personally don't care that much about it, but Xavier showed some restriction previously about that. Doing the analysis to prevent it on clambda is quite tedious and is separately needed for flambda.

The plan for the flambda part of it is to do the unboxing decision completely on the flambda side both to avoid doing any guessing game on that analysis and to handle some more complex situations that arise with a cps form. This is handled by adding primitives working on unboxed values (it is an annotation). The closure part use the classic version, while flambda convert every primitive to a sequence of unbox, compute, box. It is not right now in a pull-requestable shape, but if you don't object on this in principle, I will push that part somewhere.

By the way, I have no objection to this as long as it doesn't complicate things in the long term and it does not seem to be the case. You will just have to convince Xavier.

params : Ident.t list;
body : ulambda;
params : (Ident.t * value_kind) list;
body : (ulambda * value_kind);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you put the return type here rather than in a 'return_type' field ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seemed coherent with the handling of parameters. The value_kind describe the type of values returned by the ulambda in body. This "forces" to think about updating the result type if body is changed for whatever reason.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #1029, you had to be careful for:

https://github.com/ocaml/ocaml/pull/1029/files#diff-704f66c0fa0fc9339230b39ce7d90919R464

While here, I have:

https://github.com/ocaml/ocaml/pull/1322/files#diff-704f66c0fa0fc9339230b39ce7d90919R468

("clearly", since the "body" of the result is the one from the inner function, the "value_kind" should also be taken from that function)

@alainfrisch
Copy link
Copy Markdown
Contributor Author

I don't remember @xavierleroy complaining about recent changes to the unboxing strategy (http://www.lexifi.com/blog/unboxed-floats-ocaml), even though they can definitely create more boxing in some cases, but also avoid many of them for typical cases.

I think that guaranteeing that some programming style (e.g. "only direct calls to known functions", "monomorphic code", etc) doesn't require float boxing is more useful for the programmers than trying to come up with clever but non-local and "discontinuous" heuristics that can easily break. A typical example is a for-loop that need to Printf an accumulator even N steps for debug purposes; it's better to box the accumulator when needed (even if this means potentially boxing several times the same value) than to keep it always in boxed form (i.e. box every new value for it). The compiler doesn't know since we are not doing any profiling-based optimization.

So, for let bindings the rule is now simple: if the bound value is known to be a float, the value will be kept in unboxed form (and boxed when nedded). The decision is purely local to the let binding itself. This PR implements a similar rule for function calls: if one calls a known function taking float arguments or returning a float, they will be passed in unboxed form; this is purely local to function call (once scoping and typing have been resolved). I believe that in the vast majority of cases, this is much better than the current situation, but of course, one can build a counter-example where this creates more boxing.

@alainfrisch
Copy link
Copy Markdown
Contributor Author

You will just have to convince Xavier.

@xavierleroy Before I spent time polishing the proposal, can you comment on the general strategy?

@xavierleroy
Copy link
Copy Markdown
Contributor

I see I'm casted as the Old Scrooge character in this play... Since I'm supposed to, let me ramble about two things:

  1. Unboxing is good, but reboxing because we've unboxed too much is bad and can cause ridiculous slowdowns. (I know because this is what killed my 1990's type-directed unboxing scheme, which was really good except for this little blemish.) Make sure this can never happen here.
  2. It looks like your unboxing scheme works only when calling known functions, and its benefits are apparent for small functions -- for bigger functions the overhead of boxing results and unboxing parameters is negligible. Yet, calls to known small functions are prime targets for inlining, in which case boxing and unboxing are already optimized by other means. So, is there really a benefit here? How often do we have non-inlinable function calls that benefit noticeably from unboxing of parameters and results?

So, globally I need to be convinced: not by the proposed unboxing algorithm, which I haven't reviewed, but by its general applicability, especially in the presence of aggressive function inlining like in FLambda.

@alainfrisch
Copy link
Copy Markdown
Contributor Author

I see I'm casted as the Old Scrooge character in this play...

I tried to gave you an opportunity not to play this role ("I don't remember @xavierleroy complaining")!

Unboxing is good, but reboxing because we've unboxed too much is bad and can cause ridiculous slowdowns. (I know because this is what killed my 1990's type-directed unboxing scheme, which was really good except for this little blemish.) Make sure this can never happen here.

This can definitely happen, and this was also true for recent changes to the strategy for let bindings.

A different strategy can result in a much better allocation profile in some cases, and a degraded one in other cases. My position is that what matters are (i) typical payloads; (ii) the ability to write efficient code even if this incurs manual tweaks to the code; (iii) simplicity of the strategy so that programmers get a good intuition about performance.

Your position seems to be that any change that can degrade some cases is bad in essence ("never"). In my opinion, restricting future improvements to those that can never produce more boxing than today constrains those future improvements way too much, and I don't see why the current situation should be blessed as the reference point.

Let's look at example where the proposed strategy would create more boxing.

  • First, this can happen when calling a known function that takes a float argument, and the function's body needs to access the boxed version of the float (to store it in a data structure, or to pass it to a unknown or polymorphic function). Two sub-cases:

    • If the caller produces the float in unboxed form (e.g. as the result of some operation, or extracting from an unboxed data structure): previously, the caller would box the float before passing to the function; what we loose now is that the function might need to box its argument several time, while previously if was boxed at most once (by the caller). It could also happen that the function does not need its argument at all, in which case we avoid a useless boxing. Obviously, the good case is when the called function needs only the unboxed form in most cases. For instance, if the function has some rare error cases that need to return the float in boxed form (as an argument to Printf or in an exception constructor), it would be bad having to box the argument to account for this rare case. But the compiler has no way to know that.

    • If the caller gets the float in boxed form (e.g. by extracting from an boxed data structure), this is a bit worst, since we loose one boxing even if the function needs to access its argument in boxed form exactly once. (Still, other aspects might compensate the degradation: for instance, passing in boxed form might force to pass some (other) argument on the stack, which also creates memory traffic.)

  • Second, extra boxing can happen when calling a known function that returns a float, if the function's body creates naturally the float in boxed form and the caller wants to get it in that form. Note that we get the same behavior today if we bind the result of the function with a let-binding (which would be unbxoed).

Possible improvements to the suggested strategy could be:

  • For the first case: if we can detect that any execution of the function requires at least one access to the float in boxed form, it would make sense to disable the unboxing of the argument. The same could be applied to immutable let-binding (box, or avoid unboxing, at binding time if the body requires at least one access to the float in boxed form). However, this means analyzing the body before deciding on the unboxing strategy, which adds some complexity (and compilation-time cost).

  • Similarly, for the second case: if we detect that all branches of the function return a float which is naturally in boxed form, it would make sense to disable the unboxing of the result.

One can also think of more dynamic techniques. For instance, the calling convention could be to return both the unboxed form and some pointer, which could either be null or an actual boxed float for the result. Branches of the function that naturally computes a float in boxed form would return both the unboxed and boxed form; and branches that naturally computes a float in unboxed form would only return the unboxed one. The caller of the function can then either use directly the unboxed form, or box it if needed. The same could be done for arguments (the caller can optionally pass a boxed form, and if it doesn't the function would memoize any boxing applied to the argument).

I proposed to apply such a technique when we discussed changes for let-bindings. Such technique guarantees that there is never more boxing than today, and never more boxing than in this PR. But it can definitely be slower, and I remember you (@xavierleroy) were strongly against such approach.

It looks like your unboxing scheme works only when calling known functions, and its benefits are apparent for small functions -- for bigger functions the overhead of boxing results and unboxing parameters is negligible. Yet, calls to known small functions are prime targets for inlining, in which case boxing and unboxing are already optimized by other means. So, is there really a benefit here?

Benefits are apparent as soon as the cost of boxing represents a good part of the execution time of the function. The size of the function (and thus its eligibility to inlining) is not directly linked to its execution time. For instance, the function can be a large pattern matching with only simple cases. It's not uncommon in numerical code to have multiple formulas to evaluate depending on order of magnitude of the argument (hence a decision tree and rather simple formulas on the leaves).

Even if we consider only functions without conditionals or pattern matching, one would need to use a rather aggressive inlining to eliminate all function calls for which the cost of boxing would represent, say, more than 10% of total execution time for the function. Such aggressive inlining necessarily results in much bigger code (which can also be bad for performance).

Anyway, as I thought experiment: would you trust the inliner (current or future) to eliminate the overhead that would be added if we decided to add one allocation for all function parameters on each function call? Say, how much slower would Coq become if we did that? That's the situation authors of numerical code face today (except -- I know -- that all their arguments are not floats).

(But I agree that boxing is not super-slow. This is also why I'm not traumatized by the idea that the new strategy could in some cases create a bit more boxing than today.)

On top of that, there is little hope to inline arbitrary recursive functions. Even flambda configured in super-aggressive mode could not do much with, say, a function that traverses a binary tree of floats to compute the sum of leaves (or more realistically, which evaluates an AST of float formulas). And as said, flambda guys are working on something more ambitious than this PR, but (i) a lot of the machinery introduced here will be useful for them; (ii) it's not something that will be ready for wide-spread use in production code soon.

@alainfrisch
Copy link
Copy Markdown
Contributor Author

Another small benchmark:

type t =
  | Acc
  | Var of int
  | Cst of float
  | Add of t * t
  | Mul of t * t

let rec eval acc env = function
  | Acc -> acc
  | Var i -> env.(i)
  | Cst x -> x
  | Add (e1, e2) -> eval acc env e1 +. eval acc env e2
  | Mul (e1, e2) -> eval acc env e1 *. eval acc env e2

let () =
  let n = ref 0. in
  let env = [| 10.; 20. |] in
  for i = 1 to 100000000 do
    n := eval !n env (Add (Acc, (Add (Add (Var 0, Mul (Var 0, Var 1)), Cst 3.))))
  done;
  Printf.printf "%f\n%!" !n

This one is about 25% faster on this PR compared to trunk (testing with -unsafe).

@lpw25
Copy link
Copy Markdown
Contributor

lpw25 commented Sep 11, 2017

Having been bitten a few times by the excessively naive approach to unboxing of lets that was recently introduced I am quite concerned that a similarly naive approach to unboxing of function parameters will make things even worse.

The change to lets caused programs which look like:

let g f = ()
[@@inline never]

let f x y =
  let z = x +. y in
  for _ = 0 to 1000000 do
    g z
  done

to go from allocating once to allocating thousands of times. We've had this increase the allocation of some code noticeably -- in particular, turning on flambda makes a lot more code that looks like this appear (e.g. inlining Array.iter) which then interacts with the naive unboxing and asymptotically increases the amount of allocation in the program.

Personally, I would much rather see something like:

For the first case: if we can detect that any execution of the function requires at least one access to the float in boxed form, it would make sense to disable the unboxing of the argument. The same could be applied to immutable let-binding (box, or avoid unboxing, at binding time if the body requires at least one access to the float in boxed form).

re-introduced for let and also used for this new optimisation.

@alainfrisch
Copy link
Copy Markdown
Contributor Author

Well, if g is known to take a float, this example would benefit from this PR :)

I would much rather see something like:

For let bindings, this is doable. Either easily but not very efficiently (compilation-time wise), or, if one wants to do that without doing multiple pass over the tree (and in particular avoid scanning the result of Cmmgen.transl after applying it to the body of the let binding in Cmmgen.transl_let), one would need to have transl return a more complex result.

For functions, this would be more tricky, since one needs to know about the calling convention when we generate the function call, and this can be before translating the function's body.

@alainfrisch
Copy link
Copy Markdown
Contributor Author

@lpw25 Your example would also be addressed by a criterion simpler to compute: do not unbox if all use sites require the boxed form. Would that address most cases for you?

@alainfrisch
Copy link
Copy Markdown
Contributor Author

Btw, one could also argue that this case (for the let binding) could be addressed by some more generic CSE-like pass.

@lpw25
Copy link
Copy Markdown
Contributor

lpw25 commented Sep 11, 2017

Would that address most cases for you?

I no longer have the actual examples to hand, but it would not surprise me if, in some of them, the body of the let required both the unboxed and boxed versions.

@bluddy
Copy link
Copy Markdown
Contributor

bluddy commented Sep 11, 2017

I think where we really want to have unboxing (and currently don't) is on recursive functions, as Alain notes. This is where most computation work would happen, and yet we currently have to write these as loops if we want any chance of decent performance on boxed types.

At the same time, I think it makes sense not to prematurely optimize. Any scheme that gets close to the 'ultimate analysis' (one which is dynamic) would require a lot of work, bump into the limitations of separate compilation/static analysis, make performance less predictable, and ultimately make Flambda's work more complicated.

Perhaps all we want right now is the ability to switch calling conventions for basic types when requested by the user. If we could annotate a function argument with [@unbox_param] or some such thing, so long as the typechecker would carry it through to the type interface, we could already write recursive functions that didn't box/unbox on every iteration. This would give Flambda the tool it needs to do unboxing when it "inlines" recursive functions.

@alainfrisch
Copy link
Copy Markdown
Contributor Author

I think where we really want to have unboxing (and currently don't) is on recursive functions, as Alain notes.

I said that recursive functions are a lost cause for inlining, but avoiding boxing across function calls is useful more generally. It's not true that in typical numerical code, most computation happens in recursive functions. There are also a lot of medium-sized functions, which we don't necessarily want to inline.

@bluddy
Copy link
Copy Markdown
Contributor

bluddy commented Sep 11, 2017

I said that recursive functions are a lost cause for inlining, but avoiding boxing across function calls is useful more generally. It's not true that in typical numerical code, most computation happens in recursive functions. There are also a lot of medium-sized functions, which we don't necessarily want to inline.

True, but if local analysis is going to potentially cause more boxing, forcing us to do more and more complex analysis, then let's wait to see what Flambda's analysis can do first and build on that. In the meantime, an annotation that allows the user to modify the calling convention would be extremely useful, both for recursive functions and larger ones, and support for this would also help Flambda.

@alainfrisch
Copy link
Copy Markdown
Contributor Author

It would be straightforward to adapt this PR to only perform unboxing on arguments/results when explicitly requested through attributes, as in:

  let f (x[@unboxed]) (y[@unboxed]) = ... [@unboxed]

(This would only need a change in Translcore.transl_function.)

This would already allow the performance-minded numerical programmer to avoid boxing across function boundaries, at the cost of some verbosity. This could be a decent short-term solution.

Another approach: one could also keep a description of the calling convention on the arrow type itself, and make arrow types with different convention incompatible. For instance, float[@unboxed] -> float[@unboxed] would not be an instance of 'a -> 'a. Only explicit annotations would introduce such function types with non-generic calling conventions. The advantage is that one could then use the unboxed calling convention also for unknown function calls (for instance, a function could take an argument of type float[@unboxed] -> float[@unboxed] and apply it without having to box the argument). Of course, this would be much more invasive in the type system.

@chambart
Copy link
Copy Markdown
Contributor

@alainfrisch there might be other benefits from having different arrow, like avoiding the caml_apply stub (if you are ready to annotate some boxing information, you are probably ready to lose curryfication), or making some reverse bindings easier (by fixing the ABI of the function). But that's clearly a matter for another PR.

There is probably no drawback to having an annotation right now. It can be turned into a warning later if we choose to apply unboxing more agresively. If you go for this, please make it a record of containing parameter informations. There is already the type and the unbox annotation, and I have some future PR to add inlining attributes on function arguments.

@bluddy
Copy link
Copy Markdown
Contributor

bluddy commented Sep 12, 2017

It would be straightforward to adapt this PR to only perform unboxing on arguments/results when explicitly requested through attributes

So in this case, you'd use the same logic as in this PR, but only when requested? I think that makes sense. It would allow easy expansion into types other than float as well, since the cost is controlled by the user.

BTW how do you prevent combinatorial blowup, say if there are 3 float unboxed arguments, do you need to generate 2^3 functions, one for each calling convention?

Another approach: one could also keep a description of the calling convention on the arrow type itself, and make arrow types with different convention incompatible.

This is haskell's approach. They have [https://downloads.haskell.org/~ghc/7.2.1/docs/html/users_guide/primitives.html](unboxed primitive types) which are not compatible with type variables (Int# etc). They change the calling convention and also allow creating them inside data structures. Of course, it's all easier when you have type-based dispatch.

@alainfrisch
Copy link
Copy Markdown
Contributor Author

BTW how do you prevent combinatorial blowup, say if there are 3 float unboxed arguments, do you need to generate 2^3 functions, one for each calling convention?

No, there is one "unboxed" version, which unbox all arguments/result that can be (or which would unbox those requested with the attributes); and one "generic" version using the standard calling convention (which could be turned into a wrapper to the unboxed version).

@alainfrisch alainfrisch force-pushed the unboxed_calling_convention branch from 5a94e5f to 1b93288 Compare September 15, 2017 12:12
@alainfrisch
Copy link
Copy Markdown
Contributor Author

Ok, I've added some explicit control. Internally, each function parameter and the result can be annotated with a boxing annotation. Only when the annotation is used and some of the argument or the result can indeed be unboxed, the backend will generate the unboxed version of the type and use it for direct calls. There is currently no check that unboxing indeed happen when requested.

For now, the syntax only allows marking a function (all its argument and the result are being marked), through an [@unboxed] attribute on the function expression ((fun x y -> ...)[@unboxed]) or an [@@unboxed] attribute on a functional binding (let f x y = ... [@@unboxed]). It would be easy to recognize attributes e.g. on the argument patterns, or on the body expression.

One could use an approach similar to warnings, i.e. allowing to control the mode through more global attributes (useful for a float-intensive module where the author is aware of the new strategy).

When the new attribute is not used, ocamlopt should generate the same code as before.

The PR will also make it easy to pass more info on parameters from the typedtree down to lambda/clambda.

@damiendoligez damiendoligez added this to the 4.07-or-later milestone Sep 25, 2017
@alainfrisch alainfrisch force-pushed the unboxed_calling_convention branch from 1b93288 to 13e8bd7 Compare November 17, 2017 08:17
@alainfrisch
Copy link
Copy Markdown
Contributor Author

alainfrisch commented Nov 17, 2017

Rebased on trunk.

First result on a non-toy benchmark, using a relatively naive implementation (but representative in style of read-world, non heavily optimized, code) of a model calibration algorithm (https://github.com/OCamlPro/ocaml-benchs/tree/master/lexifi-g2pp). Tested with the msvc64 port. The gain in execution time (testing with the version using automatic unboxing) is about 5% compared to trunk, and the total number of allocations is divided by 2.3. Some notes:

  • This benchmark is known to have a bottleneck (> 50% of total) in the calculation of exponentials. (Btw, trying with the mingw64 port is about 3 times slower, perhaps because of a much worse implementation of exp?)

  • The code does not allocate like crazy anyway. The benchmark takes about 10 seconds to run and triggers only 9 major collections and 5679 minor ones on trunk (4 major, 2433 minor ones on the PR branch).

Being able to get a 5% speed up in such a case does not seem too bad.

@alainfrisch
Copy link
Copy Markdown
Contributor Author

Another micro-benchmark:

let () =
  let s = ref 0. in
  let n = 100000000 in
  for j = 1 to n do
    s := !s +. Random.float 1.
  done;
  Printf.printf "%f\n%!" (!s /. float n)

This is a bit more than 20% faster on the branch compared to trunk. This illustrates that eliminating a few boxing (here, on the result of Random.float) is not negligible compared to non-trivial operations (Random.float draws twice random 30-bits integers, does some memory read and writes, convert two integers to floats, does 2 float divs, 1 float mul, 1 float add).

@lpw25
Copy link
Copy Markdown
Contributor

lpw25 commented Nov 17, 2017

You should probably try those benchmarks with and without flambda, and possibly with a variety of inlining options, since inlining will decrease the effectiveness of the optimisation.

@alainfrisch
Copy link
Copy Markdown
Contributor Author

Indeed, with a high-enough "-inline" parameter, the allocation scheme and timing become similar. The same would apply to any improvement/degradation on calling conventions (except for things that cannot be inlined, such as recursive functions). If OCaml was passing all arguments on the stack instead of using registers, or if curried function were compiled naively, would you argue that it's useless to improve that because an aggressive-enough inliner would often get rid of the overhead?

Don't get me wrong: I think inlining is useful and great. But it's not a silver bullet to all performance issues related to function calls. We need both a good inliner that can be pushed to its limit for performance-critical section, and decent calling conventions for the majority of the code base. I don't believe that setting very aggressive inlining everywhere in a project is a good solution anyway, considering the drawbacks in terms of compilation time and code size.

@lpw25
Copy link
Copy Markdown
Contributor

lpw25 commented Nov 17, 2017

I don't believe that setting very aggressive inlining everywhere in a project is a good solution anyway, considering the drawbacks in terms of compilation time and code size.

I wasn't suggesting that inlining means that you don't need this optimisation. I just meant that the argument for this optimisation is more persuasive if you have an idea of have much higher your example would need the inlining set before the optimisation no longer helped. (Obviously this is less interesting for micro-benchmarks -- but for actual programs I think it gives a good measure of how useful the optimisation is).

@xavierleroy xavierleroy removed this from the consider-for-4.07 milestone Mar 2, 2018
@bsansouci
Copy link
Copy Markdown

bsansouci commented Jun 18, 2018

@alainfrisch Really cool work! I have friends who do machine learning and others who do complicated CPU based rendering who could benefit from writing OCaml but basically can't because of performance issues related to boxed floats (and lack of SIMD possibilities). They run away to Python with pytorch (for ML) or Rust. I'm hoping that annotations like these would great help the very performance-oriented people, in parallel with flambda becoming better.

Hope this can get merged <3

@bluddy
Copy link
Copy Markdown
Contributor

bluddy commented Jun 19, 2018

@bsansouci You want to watch Owl, which is a mix of pytorch and numpy, for this domain. It'll have much more impact than this PR. Owl could really use some help incorporating CUDA/OpenCL code so it can really compete with pytorch.

@alainfrisch
Copy link
Copy Markdown
Contributor Author

I don't see this PR being merged, so closing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants