A self-describing representation for function closures#9619
A self-describing representation for function closures#9619xavierleroy merged 8 commits intoocaml:trunkfrom
Conversation
This commit introduces a quantity Lambda.max_arity that is the maximal number of parameters that a Lambda function can have. Uncurrying is throttled so that, for example, assuming the limit is 10, a 15-argument curried function fun x1 ... x15 -> e becomes a 10-argument function (x1...x10) that returns a 5-argument function (x11...x15). Concerning untupling, a function that takes a N-tuple of arguments, where N is above the limit, remains a function that takes a single argument that is a tuple. Currently, max_arity is set to 126 in native-code, to match the new representation of closures proposed in ocaml#9619. A signed 8-bit field is used to store the arity. 126 instead of 127 to account for the extra "environment" argument. In bytecode the limit is infinity (max_int) because there are no needs yet for a limited number of parameters.
Could you mention where the page table is still needed in no-naked-pointer mode? |
The things you make me do... I checked the remaining calls to
The remaining occurrences of |
fafdc23 to
71e4f00
Compare
mshinwell
left a comment
There was a problem hiding this comment.
Very good. I checked all of the arithmetic and couldn't find any mistakes. I left a couple of minor comments.
Regarding removal of the page table entirely, I don't think we can do this on single core OCaml, because of the presence of the compactor.
|
I believe this is ready to merge once the CI passes. Thanks. |
damiendoligez
left a comment
There was a problem hiding this comment.
I have reviewed this PR and everything looks good. Note that there is another place where we examine pointers and could avoid looking at the code pointers of closures: the compactor. But we still need the page table there anyway.
@xavierleroy you'll have to rebase to get CI to run.
|
The rebase is a pain because of the bootstrap in the middle. I'm running a round of CI precheck right now, while I figure out what to do. |
Add a "closure information" field after each code pointer in a closure. This field generalizes the "arity" field used by the native-code compiler, in that it has room both for an arity (always 0 in bytecode) and for the distance from the closure to the first environment variable in the closure block. This makes closures "self-described" and easy to scan for pointers: everything up to the first env var is out-of-heap code pointers or integers; everything after the first env var is a well-formed value. At this point a bootstrap is needed.
ocamlc's computations of offsets in closures changed to match changes in the abstract machine.
In code that builds closures, instead of the old arity field, produce a closure information field encoding arity + position of environment.
Expected outputs contain integer values for the "closure info" field of some closures. These values differ in 32 and 64 bits, since the arity is stored in top 8 bits. This test would need different expected outputs for 32- and 64-bit platforms. To keep things simple, this commit restricts the test to only run on 64-bit platforms. Since this changes the locations
Here we start reaping the benefits of the new closure representation. The fields of a closure block that contain the code pointers need not be scanned (in general) and must not be scanned (in no-naked-pointers mode). Here, conservatively, we skip them in no-naked-pointers mode only, but it would be sound to skip them unconditionally.
Atoms are zero-sized blocks allocated outside the heap. It simplifies the GC in no-naked-pointers mode if their headers have GC color Black, meaning "don't traverse". In ocamlopt, Black is already used as the color for constant blocks statically allocated outside the heap.
Now that atoms have black headers, all zero-sized blocks (atoms or ocamlopt-generated static data) have black headers and will not be traversed or changed by the major GC.
042824d to
59da229
Compare
|
CI precheck: OK. Rebase from hell: done. Waiting for Travis&Appveyor. |
|
Kicking Travis. |
|
Travis is not run, and Appveyor is pending as usual, so I'll just merge and take a well-deserved week-end. |
This commit introduces a quantity Lambda.max_arity that is the maximal number of parameters that a Lambda function can have. Uncurrying is throttled so that, for example, assuming the limit is 10, a 15-argument curried function fun x1 ... x15 -> e becomes a 10-argument function (x1...x10) that returns a 5-argument function (x11...x15). Concerning untupling, a function that takes a N-tuple of arguments, where N is above the limit, remains a function that takes a single argument that is a tuple. Currently, max_arity is set to 126 in native-code, to match the new representation of closures proposed in ocaml#9619. A signed 8-bit field is used to store the arity. 126 instead of 127 to account for the extra "environment" argument. In bytecode the limit is infinity (max_int) because there are no needs yet for a limited number of parameters.
…9620) This commit introduces a quantity Lambda.max_arity that is the maximal number of parameters that a Lambda function can have. Uncurrying is throttled so that, for example, assuming the limit is 10, a 15-argument curried function fun x1 ... x15 -> e becomes a 10-argument function (x1...x10) that returns a 5-argument function (x11...x15). Concerning untupling, a function that takes a N-tuple of arguments, where N is above the limit, remains a function that takes a single argument that is a tuple. Currently, max_arity is set to 126 in native-code, to match the new representation of closures implemented by #9619. A signed 8-bit field is used to store the arity. 126 instead of 127 to account for the extra "environment" argument. In bytecode the limit is infinity (max_int) because there are no needs yet for a limit on the number of parameters.
|
I was also caught up by the bug found in #9681, which led me to do a It turns out that the closinfo field is initialized by |
|
Re |
|
@xavierleroy Do we already have a documentation entry for new closure representation? I'm not interested in the documentation per se, actually I want to try to implement memoization/hashconsing of partially applied high-order functions. It seems that your changes are directly related to un-(partial applying) because
Last time when I tried to dig into it, I realized that the representation changes when |
For the moment, there's the description at the top of this PR #9619 and you can see examples of how to traverse closures here for example: Lines 264 to 283 in 302d735
Yes and no. Yes insofar as partial applications produce "interesting" closures, with non-trivial environments. No because the change of representation applies to all closures, not just those arising out of partial applications. |
(This PR is a resurrect of #1156 following the suggestions made in comment #1156 (review).)
The context
Today, a closure for a function or a set of mutually-recursive functions is a heap block with the following structure:
The values are the environment part of the closure: the values of the free variables. Each entrypoint is a single code pointer in bytecode, and a 2- or 3- word record in native code:
Note that the closure really consists in two consecutive parts:
The problem
Given a closure block at run-time, there is no way to determine which well-formed OCaml values it contains. More precisely, there is no way to know where the entrypoints end and where the environment starts, since the first value of the environment can be a tagged integer that looks like an infix header.
This is problematic for the GC in particular, which needs to follow OCaml values that are pointers into the OCaml heap, but must not follow code pointers. Today, we use a page table to distinguish the two kinds of pointers. But we want to get rid of the page table (the "no naked pointers" mode), for better performance and for Multicore OCaml compatibility.
The solution
The solution implemented in this PR adds to each "entrypoint" record the position of the environment, i.e. the field number for the first value of the value sequence that represents the environment.
In native code we just steal some bits from the arity field:
(#9620 ensures that ocamlopt never generates functions with arities that would not fit in 8 bits.)
In bytecode, we need to add a closure-info word to each entrypoint, making closures a bit larger:
(The '0' value for the arity field is arbitrary, as the abstract machine does not need to know the arity of a function from its closure.)
This way, given a closure value
v, the field number where the environment starts is the "start-of-environment" bits ofField(v, 1).Example: native-code, two mutually recursive functions of arities 2 and 1, two free variables:
The pull request
It is best reviewed commit by commit:
FAQ
To speed up reviewing, here are some questions and answers.
Q: What are the differences with #1156 ?
A: #1156 used a single bit in the arity words to mark the last entrypoint in the sequence of entrypoints. Consequently, jumping to the environment part required scanning the entry points one after the other. This PR turns this into a constant-time operation. (As suggested by @chambart in #1156 (review) .)
Q: What are the differences with #8984 ?
A: #8984 explores another closure representation strategy, where each function has its own heap block, containing only one entry point. This is a much bigger change in the compilers and the run-time system, with potential for bigger simplifications (getting rid of infix pointers entirely), but the various representation choices still need to be explored and benchmarked, so it will not be ready immediately. The present PR is the quick, minimal change that suffices for Multicore OCaml.
Q: What is the impact on execution time (GC not included)?
A: In native code, negligible: one extra shift instruction when applying an unknown function to 2 or more arguments. In bytecode, closures become bigger and may take a little longer to create.
Q: What is the impact on GC time?
A: Currently, nothing changes in naked-pointers mode. More generally, skipping over the entrypoints part of closures rather than scanning them might give a small speedup for the major GC.
Q: Why not steal bits in infix headers to store the "start of environment" info?
A: There is room there, but for one function we don't have an infix header. (N-1 infix headers for a N-function closure.)
Q: Was it absolutely necessary to increase the size of bytecode closures?
A: I could not find a nice solution that has only one code pointer per entry point and no closure info field. Plus, having similar closure representations in bytecode and in native-code might simplify other things in the future.
Q: Why only 8 bits for the arity?
A: So that it leaves enough bits for the "start of environment" field to represent any valid position in any heap block, even in 32 bit machines. Plus, it's not a limitation in practice.
Q: But what happens if I have a curried function of 300 arguments?
A: Right now, an assertion failure at compile-time. With #9620, your function will be compiled to a 126-argument function returning a 126-argument function returning a 48-argument function.
Q: Why the 8 most significant bits for the arity and not bits 1 to 8?
A: The arity is a signed integer (negative arity = tupled function), so a single arithmetic shift right extracts the arity with the correct sign.
Q: How far are we from a "no-naked-pointer" mode that does not need the page table at all?
A: In native code I think we're almost there. In bytecode, some more work remains (code pointers are used as naked-pointer values in many places).