Skip to content

Don't use the page table when scanning closures#1156

Closed
mshinwell wants to merge 2 commits intoocaml:trunkfrom
mshinwell:closures_take_2
Closed

Don't use the page table when scanning closures#1156
mshinwell wants to merge 2 commits intoocaml:trunkfrom
mshinwell:closures_take_2

Conversation

@mshinwell
Copy link
Copy Markdown
Contributor

@mshinwell mshinwell commented Apr 28, 2017

This is related to #203 , which relates to an overall goal of removing the page table completely when in "no naked pointers" mode (and, in due course, for multicore). When that goal has been reached I am expecting approximately 5-10% performance improvement over the existing page table mode.

At present it is not possible to scan closures without the page table (or alternatively the code fragments table) because there is an ambiguity between environment entries and Infix_tag separators. This means that the positions of code pointers cannot be determined. Here is an example of the ambiguity. The following closure relates to only one function:

Closure_tag | Code ptr 1 | Arity = 2 | Code ptr 2 | FV1 | FV2 | FV3

and we suppose the value of the first free variable, when encoded as an OCaml integer, happens to equal Infix_tag. This looks just like the following closure which relates to two closed functions:

Closure_tag | Code ptr 1 | Arity = 2 | Code ptr 2 | Infix_tag | Code ptr | Arity = 1

In #203 an attempt was made to add GC headers before the start of functions' instructions to solve this problem. That proved to be complicated on one architecture and on x86-64 appears to have a surprising penalty -- on the cpdf libraries for example code size was increased by 5%. This is not helped that it is difficult to efficiently pack headers before the start of functions that must be 16 byte aligned, since no assembler directive exists to "align to the next 8-byte boundary that is not a 16-byte boundary".

This patch solves the problem another way, which I think is overall better (benchmarks in progress); it simply uses the bottom bit of the arity field (after having de-tagged it) to indicate whether the corresponding function is the last one in the closure block. This means that the GC can skip right over all of the code pointers and arities, landing at the start of any environment which may be present. I claim the performance impact should be minimal since the only place that appears to examine arities is the caml_apply family of functions, which now only gain two simple arithmetic instructions.

I mentioned above about the code fragments table, which might provide another solution to this problem, namely by checking every non-immediate member of a closure against it. However I think in general such tables are fragile and best avoided; indeed, there appear to be at least two different idioms for use of the code fragments table in the compiler code already. byterun/extern.c only checks the table when locating a code pointer, whereas Is_in_code_area also checks the page table. Furthermore, checking of the code fragments table is likely to be slow in the presence of a large number of dynamically-linked objects.

Copy link
Copy Markdown
Contributor

@chambart chambart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good change and seems to be ok.

But I would push it a bit further. Having different native and bytecode value representation makes the GC more complex. I don't see a good reason not to addopt this change in bytecode too. Of course this would require some changes to the VM, but I don't think that it would break anything.

Also, rather than using a single bit to mark if the next value is the beginning of the environment part of the closure, you could use the most significant bits (or least ones) to store how far the environment is from the current position. This would simplify distance_to_environment and probably make it fast enough to remove the #ifdef in caml_oldify_one. Restricting functions to at most 32000 arguments and closures to 65000 functions on 32 bits systems does not feel like too bothersome.

mlsize_t caml_distance_to_closure_environment(value closure)
{
#ifndef NATIVE_CODE_AND_NO_NAKED_POINTERS
return 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems to be correct without NO_NAKED_POINTERS.
The ifndef should only be on NATIVE_CODE even if it is unused

CAMLassert ((pos - start) < Wosize_val(start));

arity = Long_val(pos[1]);
finished = (arity & (uintnat) 1) != 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be 'documented' as an arity manipulating macro (or static inline function), probably in mlvalues.h.
#define Last_function_of_closures(x) ...


arity >>= 1;

if (arity <= 1) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly this should probably be a Next_function_of_closure macro

[alloc_closure_header 5 Debuginfo.none;
Cconst_symbol(name1 ^ "_" ^ string_of_int (num+1));
int_const (arity - num - 1);
int_const (((arity - num - 1) lsl 1) lor 1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use the arity_word function.

[alloc_closure_header 4 Debuginfo.none;
Cconst_symbol(name1 ^ "_" ^ string_of_int (num+1));
int_const 1; Cvar arg; Cvar clos],
int_const ((1 lsl 1) lor 1); Cvar arg; Cvar clos],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

@mshinwell
Copy link
Copy Markdown
Contributor Author

Thanks for the review. I will think about changing the arity field representation as you suggest.

I believe @stedolan is in favour of harmonising the closure representations too. I will discuss that with him later this week, there may be some code already in existence for this.

Copy link
Copy Markdown
Contributor

@stedolan stedolan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good idea. In particular, something like this is necessary to spot code pointers when marshalling with marshalling of closures enabled.

I'm not sure that the arity field is the right place to steal a bit from, since it doesn't exist on bytecode. Consider stealing a bit from the Infix_tag word instead, where there are lots of free bits (although, then you'll need to figure out how to distinguish closures with no infix objects from closures with just one).

for (i = 0; i < sz; i++) {
value field = Field (v, i);
if (i >= first_part && Is_block (field) && Is_young (field)) {
/* This seems most unlikely to cause a stack overflow. */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hughes-style difference lists will cause this to stack overflow:

type 'a dlist = 'a list -> 'a list
let singleton x = fun tail -> x :: tail
let concat a b = fun tail -> a (b tail)

A long list consists of deeply nested closures.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, indeed. I'll fix that then.

@mshinwell
Copy link
Copy Markdown
Contributor Author

I think we could just introduce the arity word for bytecode as well.

@xavierleroy
Copy link
Copy Markdown
Contributor

Generally speaking: I am very much in favour of changing the in-memory representation of closures to simplify block scanning or other operations of the runtime system. However, I'd like to see some design discussions and some exploration of the design space before discussing pull requests that implement a particular design.

@stedolan
Copy link
Copy Markdown
Contributor

stedolan commented May 2, 2017

I think we could just introduce the arity word for bytecode as well.

That's a better plan.

Incidentally, the link is wrong in the description - you meant #203, not #208.

@xavierleroy
Copy link
Copy Markdown
Contributor

Continuing to think aloud: as an example of "(re-) exploring the design space", now might possibly be a good time to give up on infix pointers, Infix_tag, and the associated complexity, and go back to a pre-SML/NJ world where closures for mutually-recursive functions are implemented using cycles between multiple blocks. Just a thought, mind you. But one that we may want to entertain together.

@mshinwell
Copy link
Copy Markdown
Contributor Author

Here are some options that come to mind:

  1. Put GC headers before function entry points. I've experimented with this; I think the code size penalty (5% on cpdf x86-64 for example) seemed too large, and performance implications were a bit uncertain. It's also potentially problematic on architectures with complicated ABIs. I don't think this is worth pursuing, even though as mentioned above, it is in theory possible to reduce the code size penalty.
  2. Do something like what is in this pull request, by adjusting the arity field so that the start of the environment can be identified. This is a minor change that seems neutral for performance and code size. (The only place decoding the arity fields is the caml_apply functions; the overhead of the arithmetic will be negligible and there isn't any noticeable overall code size increase since these functions are never inlined.)
  3. Abandon the Infix_tag business altogether. I believe @stedolan is in favour of this. I'm a bit wary of going down this path, personally, as it doesn't seem that unlikely to be found useful for other things in the future. As an implementation remark, Spacetime's in-memory representation currently makes use of this tag.

Even if we decide on something like option 3, I'd like to argue for doing something like option 2 in the short term, simply because this issue has been dragging on for a long time and it would be nice to bring it to a resolution. We are sufficiently pressed for manpower that I don't reckon implementing option 3 at the present time can be justified.

There is a separate issue pertaining to the harmonisation of the closure representation between bytecode and native. My current thoughts on this are to simply adopt the native representation in bytecode.

@stedolan
Copy link
Copy Markdown
Contributor

I am indeed in favour of getting rid of Infix_tag - surprisingly, most of the really odd segfaults during development of the multicore GC were not weird concurrency bugs but failure to specially handle Infix_tag. For instance, Wosize_val(v) tells you how long a block is (but for Infix_tag), and Color_val(v) will tell you whether it has been marked (but for Infix_tag). In the minor GC, oldify_one returns an object with a valid header (but for Infix_tag).

More speculatively, I'd like to eventually be able to mix bytecode and native code in the same process, especially to be able to load native libraries in a bytecode toplevel. Making the closure representations agree would be a good step in this direction.

Copy link
Copy Markdown
Member

@damiendoligez damiendoligez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch looks OK in general, except for the stack overflow, but it looks like you're going to refactor anyway.

match rem with
| [] -> true
| _::_ -> false
in
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could write this as

let last_one = rem = [] in

| [] -> true
| _::_ -> false
in
let arity_word = (f2.arity lsl 1) lor (if last_one then 1 else 0) in
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, or make last_one an integer and avoid double-testing?

CAMLassert(Tag_val(closure) == Closure_tag);
CAMLassert(Wosize_val(closure) >= 2);

start = (value*) closure;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be &Field(closure, 0)

start = (value*) closure;
pos = start;

while (!finished) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could refactor this loop to avoid testing finished twice, either as a for loop, or as a while(1) loop with if (finished) break; in the right place. I'd find the latter more readable, but it's probably a matter of taste.

@damiendoligez damiendoligez added this to the 4.07-or-later milestone Oct 11, 2017
@damiendoligez damiendoligez removed this from the consider-for-4.07 milestone Jun 5, 2018
@kayceesrk
Copy link
Copy Markdown
Contributor

I am in favour of this PR. This should be sufficient to unblock marshalling on multicore.

In the long term, it would be useful to remove Infix_tag as it has proved to be a constant source of bugs during the implementation of the multicore GC. But since that ford was crossed, I find less motivation to remove Infix_tag in the short term.

@mshinwell
Copy link
Copy Markdown
Contributor Author

Right, so, it's been more than a year since this has been stalled pending any further design discussions. As far as I know no-one has had the time to try implementing closure environments in a different form during this period. My preferred way forward is to see option 2 above, which corresponds to this patch, approved in principle. I can then address @damiendoligez 's comments above. We can always consider changing the representation in more major ways at a later date.

@kayceesrk
Copy link
Copy Markdown
Contributor

Any updates on this PR. Could we aim to get this in for 4.08?

@XVilka
Copy link
Copy Markdown
Contributor

XVilka commented May 10, 2019

@kayceesrk as practice shown - no.

@mshinwell
Copy link
Copy Markdown
Contributor Author

We're now exploring use of a different (and self-describing) closure representation that would supercede this work.

@mshinwell mshinwell closed this Apr 20, 2020
stedolan added a commit to stedolan/ocaml that referenced this pull request Mar 21, 2023
EmileTrotignon pushed a commit to EmileTrotignon/ocaml that referenced this pull request Jan 12, 2024
EmileTrotignon pushed a commit to EmileTrotignon/ocaml that referenced this pull request Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants