Skip to content

A self-describing representation for function closures#9619

Merged
xavierleroy merged 8 commits intoocaml:trunkfrom
xavierleroy:new-closure-repr
Jun 5, 2020
Merged

A self-describing representation for function closures#9619
xavierleroy merged 8 commits intoocaml:trunkfrom
xavierleroy:new-closure-repr

Conversation

@xavierleroy
Copy link
Copy Markdown
Contributor

@xavierleroy xavierleroy commented May 29, 2020

(This PR is a resurrect of #1156 following the suggestions made in comment #1156 (review).)

The context

Today, a closure for a function or a set of mutually-recursive functions is a heap block with the following structure:

closure ::= entrypoint (infix-header entrypoint)* value*

The values are the environment part of the closure: the values of the free variables. Each entrypoint is a single code pointer in bytecode, and a 2- or 3- word record in native code:

entrypoint ::= code-pointer                         (in bytecode)

entrypoint ::= code-pointer  arity=1                (in native code)
             | code-pointer  arity<>1  code-pointer

Note that the closure really consists in two consecutive parts:

  • the entrypoints part, comprising code pointers and tagged integers (infix headers look like tagged integers)
  • the environment part, comprising well-formed OCaml values.

The problem

Given a closure block at run-time, there is no way to determine which well-formed OCaml values it contains. More precisely, there is no way to know where the entrypoints end and where the environment starts, since the first value of the environment can be a tagged integer that looks like an infix header.

This is problematic for the GC in particular, which needs to follow OCaml values that are pointers into the OCaml heap, but must not follow code pointers. Today, we use a page table to distinguish the two kinds of pointers. But we want to get rid of the page table (the "no naked pointers" mode), for better performance and for Multicore OCaml compatibility.

The solution

The solution implemented in this PR adds to each "entrypoint" record the position of the environment, i.e. the field number for the first value of the value sequence that represents the environment.

In native code we just steal some bits from the arity field:

entrypoint ::= code-pointer closure-info                (with arity = 1)
             | code-pointer closure-info code-pointer

closure-info ::=  arity (8 bits) . start-of-environment (wordsize - 9 bits) . 1

(#9620 ensures that ocamlopt never generates functions with arities that would not fit in 8 bits.)

In bytecode, we need to add a closure-info word to each entrypoint, making closures a bit larger:

entrypoint ::= code-pointer closure-info                (with arity = 0)

(The '0' value for the arity field is arbitrary, as the abstract machine does not need to know the arity of a function from its closure.)

This way, given a closure value v, the field number where the environment starts is the "start-of-environment" bits of Field(v, 1).

Example: native-code, two mutually recursive functions of arities 2 and 1, two free variables:

code1  {arity=2,start-env=6}  code1'  infix-hdr  code2  {arity=1,start_env=2}  freevar1  freevar2

The pull request

It is best reviewed commit by commit:

  • 9f59d7f changes the representation of closures in the bytecode system and 8e97e63 does the same for the native-code compiler.
  • cd2ed2b shows how the GC can use the new "start of environment" information to skip over the entrypoints parts of closures. Currently it's done only in no-naked-pointers mode, but would work in naked-pointers mode too.
  • 3d1a411 and 95bd982 further simplify the GC in no-naked-pointers mode.
  • 14c96d1 tweaks a silly test and 7fbb556 is the obligatory bootstrap when bytecode compilation changes.

FAQ

To speed up reviewing, here are some questions and answers.

Q: What are the differences with #1156 ?

A: #1156 used a single bit in the arity words to mark the last entrypoint in the sequence of entrypoints. Consequently, jumping to the environment part required scanning the entry points one after the other. This PR turns this into a constant-time operation. (As suggested by @chambart in #1156 (review) .)

Q: What are the differences with #8984 ?

A: #8984 explores another closure representation strategy, where each function has its own heap block, containing only one entry point. This is a much bigger change in the compilers and the run-time system, with potential for bigger simplifications (getting rid of infix pointers entirely), but the various representation choices still need to be explored and benchmarked, so it will not be ready immediately. The present PR is the quick, minimal change that suffices for Multicore OCaml.

Q: What is the impact on execution time (GC not included)?

A: In native code, negligible: one extra shift instruction when applying an unknown function to 2 or more arguments. In bytecode, closures become bigger and may take a little longer to create.

Q: What is the impact on GC time?

A: Currently, nothing changes in naked-pointers mode. More generally, skipping over the entrypoints part of closures rather than scanning them might give a small speedup for the major GC.

Q: Why not steal bits in infix headers to store the "start of environment" info?

A: There is room there, but for one function we don't have an infix header. (N-1 infix headers for a N-function closure.)

Q: Was it absolutely necessary to increase the size of bytecode closures?

A: I could not find a nice solution that has only one code pointer per entry point and no closure info field. Plus, having similar closure representations in bytecode and in native-code might simplify other things in the future.

Q: Why only 8 bits for the arity?

A: So that it leaves enough bits for the "start of environment" field to represent any valid position in any heap block, even in 32 bit machines. Plus, it's not a limitation in practice.

Q: But what happens if I have a curried function of 300 arguments?

A: Right now, an assertion failure at compile-time. With #9620, your function will be compiled to a 126-argument function returning a 126-argument function returning a 48-argument function.

Q: Why the 8 most significant bits for the arity and not bits 1 to 8?

A: The arity is a signed integer (negative arity = tupled function), so a single arithmetic shift right extracts the arity with the correct sign.

Q: How far are we from a "no-naked-pointer" mode that does not need the page table at all?

A: In native code I think we're almost there. In bytecode, some more work remains (code pointers are used as naked-pointer values in many places).

xavierleroy added a commit to xavierleroy/ocaml that referenced this pull request May 29, 2020
This commit introduces a quantity Lambda.max_arity that is the maximal
number of parameters that a Lambda function can have.

Uncurrying is throttled so that, for example, assuming the limit is 10,
a 15-argument curried function fun x1 ... x15 -> e
becomes a 10-argument function (x1...x10) that returns a 5-argument
function (x11...x15).

Concerning untupling, a function that takes a N-tuple of arguments,
where N is above the limit, remains a function that takes a single
argument that is a tuple.

Currently, max_arity is set to 126 in native-code, to match the new
representation of closures proposed in ocaml#9619.  A signed 8-bit field
is used to store the arity.  126 instead of 127 to account for the
extra "environment" argument.

In bytecode the limit is infinity (max_int) because there are no needs
yet for a limited number of parameters.
@kayceesrk
Copy link
Copy Markdown
Contributor

Q: How far are we from a "no-naked-pointer" mode that does not need the page table at all?
A: In native code I think we're almost there.

Could you mention where the page table is still needed in no-naked-pointer mode?

@xavierleroy
Copy link
Copy Markdown
Contributor Author

Could you mention where the page table is still needed in no-naked-pointer mode?

The things you make me do... I checked the remaining calls to caml_page_table_lookup in the native-code runtime system compiled in no-naked-pointers mode. Here they are:

array.c
    function caml_make_vect: Is_in_value_area
    function caml_make_array: Is_in_value_area

compact.c
    function invert_pointer_at: Is_in_heap

extern.c
    function extern_rec: Is_in_value_area

finalise.c
    function generic_final_register: Is_in_heap_or_young
    + lots of assertions

globroots.c
    function classify_gc_root: Is_in_heap

hash.c
    function caml_hash: Is_in_value_area
    function hash_aux: Is_in_value_area

major_gc.c
   function mark_ephe_aux: Is_in_heap

minor_gc.c
    function caml_oldify_one: Is_in_value_area

obj.c
    function caml_obj_tag: Is_in_value_area
    function caml_lazy_follow_forward: Is_in_value_area
    function caml_obj_reachable_words: Is_in_heap_or_young

signals_nat.c
   function handle_signal: Is_in_code_area

caml/weak.h
   function caml_ephe_clean_partial: Is_in_heap_or_young

Is_in_value_area can be replaced with 1 in no-naked-pointers mode, if I'm not mistaken.

Is_in_code_area could be implemented differently, using the table of code fragments. It is not performance-critical.

The remaining occurrences of Is_in_heap and Is_in_heap_or_young need scrutiny.

Copy link
Copy Markdown
Contributor

@mshinwell mshinwell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good. I checked all of the arithmetic and couldn't find any mistakes. I left a couple of minor comments.

Regarding removal of the page table entirely, I don't think we can do this on single core OCaml, because of the presence of the compactor.

@mshinwell
Copy link
Copy Markdown
Contributor

I believe this is ready to merge once the CI passes. Thanks.

Copy link
Copy Markdown
Member

@damiendoligez damiendoligez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed this PR and everything looks good. Note that there is another place where we examine pointers and could avoid looking at the code pointers of closures: the compactor. But we still need the page table there anyway.

@xavierleroy you'll have to rebase to get CI to run.

@xavierleroy
Copy link
Copy Markdown
Contributor Author

The rebase is a pain because of the bootstrap in the middle. I'm running a round of CI precheck right now, while I figure out what to do.

Add a "closure information" field after each code pointer in
a closure.  This field generalizes the "arity" field used by
the native-code compiler, in that it has room both for an arity
(always 0 in bytecode) and for the distance from the closure
to the first environment variable in the closure block.

This makes closures "self-described" and easy to scan for pointers:
everything up to the first env var is out-of-heap code pointers or
integers; everything after the first env var is a well-formed value.

At this point a bootstrap is needed.
ocamlc's computations of offsets in closures changed to match changes in the
abstract machine.
In code that builds closures, instead of the old arity field,
produce a closure information field encoding arity + position of environment.
Expected outputs contain integer values for the "closure info"
field of some closures.  These values differ in 32 and 64 bits,
since the arity is stored in top 8 bits.  This test would need
different expected outputs for 32- and 64-bit platforms.

To keep things simple, this commit restricts the test
to only run on 64-bit platforms.  Since this changes the locations
Here we start reaping the benefits of the new closure representation.

The fields of a closure block that contain the code pointers need not
be scanned (in general) and must not be scanned (in no-naked-pointers mode).

Here, conservatively, we skip them in no-naked-pointers mode only,
but it would be sound to skip them unconditionally.
Atoms are zero-sized blocks allocated outside the heap.
It simplifies the GC in no-naked-pointers mode if their headers
have GC color Black, meaning "don't traverse".
In ocamlopt, Black is already used as the color for constant blocks
statically allocated outside the heap.
Now that atoms have black headers, all zero-sized blocks (atoms or
ocamlopt-generated static data) have black headers and will not
be traversed or changed by the major GC.
@xavierleroy
Copy link
Copy Markdown
Contributor Author

CI precheck: OK. Rebase from hell: done. Waiting for Travis&Appveyor.

@xavierleroy
Copy link
Copy Markdown
Contributor Author

Kicking Travis.

@xavierleroy xavierleroy closed this Jun 5, 2020
@xavierleroy xavierleroy reopened this Jun 5, 2020
@xavierleroy
Copy link
Copy Markdown
Contributor Author

Travis is not run, and Appveyor is pending as usual, so I'll just merge and take a well-deserved week-end.

@xavierleroy xavierleroy merged commit 9d4679f into ocaml:trunk Jun 5, 2020
@xavierleroy xavierleroy deleted the new-closure-repr branch June 5, 2020 16:26
xavierleroy added a commit to xavierleroy/ocaml that referenced this pull request Jun 5, 2020
This commit introduces a quantity Lambda.max_arity that is the maximal
number of parameters that a Lambda function can have.

Uncurrying is throttled so that, for example, assuming the limit is 10,
a 15-argument curried function fun x1 ... x15 -> e
becomes a 10-argument function (x1...x10) that returns a 5-argument
function (x11...x15).

Concerning untupling, a function that takes a N-tuple of arguments,
where N is above the limit, remains a function that takes a single
argument that is a tuple.

Currently, max_arity is set to 126 in native-code, to match the new
representation of closures proposed in ocaml#9619.  A signed 8-bit field
is used to store the arity.  126 instead of 127 to account for the
extra "environment" argument.

In bytecode the limit is infinity (max_int) because there are no needs
yet for a limited number of parameters.
xavierleroy added a commit that referenced this pull request Jun 5, 2020
…9620)

This commit introduces a quantity Lambda.max_arity that is the maximal
number of parameters that a Lambda function can have.

Uncurrying is throttled so that, for example, assuming the limit is 10,
a 15-argument curried function fun x1 ... x15 -> e
becomes a 10-argument function (x1...x10) that returns a 5-argument
function (x11...x15).

Concerning untupling, a function that takes a N-tuple of arguments,
where N is above the limit, remains a function that takes a single
argument that is a tuple.

Currently, max_arity is set to 126 in native-code, to match the new
representation of closures implemented by #9619.  A signed 8-bit field
is used to store the arity.  126 instead of 127 to account for the
extra "environment" argument.

In bytecode the limit is infinity (max_int) because there are no needs
yet for a limit on the number of parameters.
@jhjourdan
Copy link
Copy Markdown
Contributor

I was also caught up by the bug found in #9681, which led me to do a grep Closure_tag in the runtime sources: apparently caml_alloc_dummy_infix still allocates a block with this tag, but without taking care of the closinfo field. Is this intended?

It turns out that the closinfo field is initialized by caml_alloc with Val_unit, which (by chance?) corresponds to a "somewhat valid" value of the closinfo field (if my understanding is good). If this is intended, then a comment would at the very welcome!

@xavierleroy
Copy link
Copy Markdown
Contributor Author

Re caml_alloc_dummy_infix: no, it's not intended, thanks for having spotted it. I also need to check caml_update_dummy carefully.

@xavierleroy
Copy link
Copy Markdown
Contributor Author

Could you mention where the page table is still needed in no-naked-pointer mode?

Once #9698 is merged, only the compactor will need the page table, more precisely this line of the compactor:

if (Ecolor (q) == 0 && Is_in_heap (q)){

@Kakadu
Copy link
Copy Markdown
Contributor

Kakadu commented Jul 29, 2020

@xavierleroy Do we already have a documentation entry for new closure representation?

I'm not interested in the documentation per se, actually I want to try to implement memoization/hashconsing of partially applied high-order functions. It seems that your changes are directly related to un-(partial applying) because

Given a closure block at run-time, there is no way to determine which well-formed OCaml values it contains. More precisely, there is no way to know where the entrypoints end and where the environment starts, since the first value of the environment can be a tagged integer that looks like an infix header.

Last time when I tried to dig into it, I realized that the representation changes when -flambda is on, and I couldn't to understand it from documentation too.

@xavierleroy
Copy link
Copy Markdown
Contributor Author

Do we already have a documentation entry for new closure representation?

For the moment, there's the description at the top of this PR #9619 and you can see examples of how to traverse closures here for example:

ocaml/runtime/hash.c

Lines 264 to 283 in 302d735

case Closure_tag: {
mlsize_t startenv;
len = Wosize_val(v);
startenv = Start_env_closinfo(Closinfo_val(v));
CAMLassert (startenv <= len);
/* Mix in the tag and size, but do not count this towards [num] */
h = caml_hash_mix_uint32(h, Whitehd_hd(Hd_val(v)));
/* Mix the code pointers, closure info fields, and infix headers */
for (i = 0; i < startenv; i++) {
h = caml_hash_mix_intnat(h, Field(v, i));
num--;
}
/* Copy environment fields into queue,
not exceeding the total size [sz] */
for (/*nothing*/; i < len; i++) {
if (wr >= sz) break;
queue[wr++] = Field(v, i);
}
break;
}

It seems that your changes are directly related to un-(partial applying)

Yes and no. Yes insofar as partial applications produce "interesting" closures, with non-trivial environments. No because the change of representation applies to all closures, not just those arising out of partial applications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants