Don't use the page table when scanning closures#1156
Don't use the page table when scanning closures#1156mshinwell wants to merge 2 commits intoocaml:trunkfrom
Conversation
There was a problem hiding this comment.
I think this is a good change and seems to be ok.
But I would push it a bit further. Having different native and bytecode value representation makes the GC more complex. I don't see a good reason not to addopt this change in bytecode too. Of course this would require some changes to the VM, but I don't think that it would break anything.
Also, rather than using a single bit to mark if the next value is the beginning of the environment part of the closure, you could use the most significant bits (or least ones) to store how far the environment is from the current position. This would simplify distance_to_environment and probably make it fast enough to remove the #ifdef in caml_oldify_one. Restricting functions to at most 32000 arguments and closures to 65000 functions on 32 bits systems does not feel like too bothersome.
| mlsize_t caml_distance_to_closure_environment(value closure) | ||
| { | ||
| #ifndef NATIVE_CODE_AND_NO_NAKED_POINTERS | ||
| return 0; |
There was a problem hiding this comment.
This function seems to be correct without NO_NAKED_POINTERS.
The ifndef should only be on NATIVE_CODE even if it is unused
| CAMLassert ((pos - start) < Wosize_val(start)); | ||
|
|
||
| arity = Long_val(pos[1]); | ||
| finished = (arity & (uintnat) 1) != 0; |
There was a problem hiding this comment.
This should be 'documented' as an arity manipulating macro (or static inline function), probably in mlvalues.h.
#define Last_function_of_closures(x) ...
|
|
||
| arity >>= 1; | ||
|
|
||
| if (arity <= 1) { |
There was a problem hiding this comment.
Similarly this should probably be a Next_function_of_closure macro
| [alloc_closure_header 5 Debuginfo.none; | ||
| Cconst_symbol(name1 ^ "_" ^ string_of_int (num+1)); | ||
| int_const (arity - num - 1); | ||
| int_const (((arity - num - 1) lsl 1) lor 1); |
There was a problem hiding this comment.
This should use the arity_word function.
| [alloc_closure_header 4 Debuginfo.none; | ||
| Cconst_symbol(name1 ^ "_" ^ string_of_int (num+1)); | ||
| int_const 1; Cvar arg; Cvar clos], | ||
| int_const ((1 lsl 1) lor 1); Cvar arg; Cvar clos], |
|
Thanks for the review. I will think about changing the arity field representation as you suggest. I believe @stedolan is in favour of harmonising the closure representations too. I will discuss that with him later this week, there may be some code already in existence for this. |
stedolan
left a comment
There was a problem hiding this comment.
I think this is a good idea. In particular, something like this is necessary to spot code pointers when marshalling with marshalling of closures enabled.
I'm not sure that the arity field is the right place to steal a bit from, since it doesn't exist on bytecode. Consider stealing a bit from the Infix_tag word instead, where there are lots of free bits (although, then you'll need to figure out how to distinguish closures with no infix objects from closures with just one).
| for (i = 0; i < sz; i++) { | ||
| value field = Field (v, i); | ||
| if (i >= first_part && Is_block (field) && Is_young (field)) { | ||
| /* This seems most unlikely to cause a stack overflow. */ |
There was a problem hiding this comment.
Hughes-style difference lists will cause this to stack overflow:
type 'a dlist = 'a list -> 'a list
let singleton x = fun tail -> x :: tail
let concat a b = fun tail -> a (b tail)
A long list consists of deeply nested closures.
There was a problem hiding this comment.
Ah, indeed. I'll fix that then.
|
I think we could just introduce the arity word for bytecode as well. |
|
Generally speaking: I am very much in favour of changing the in-memory representation of closures to simplify block scanning or other operations of the runtime system. However, I'd like to see some design discussions and some exploration of the design space before discussing pull requests that implement a particular design. |
|
Continuing to think aloud: as an example of "(re-) exploring the design space", now might possibly be a good time to give up on infix pointers, Infix_tag, and the associated complexity, and go back to a pre-SML/NJ world where closures for mutually-recursive functions are implemented using cycles between multiple blocks. Just a thought, mind you. But one that we may want to entertain together. |
|
Here are some options that come to mind:
Even if we decide on something like option 3, I'd like to argue for doing something like option 2 in the short term, simply because this issue has been dragging on for a long time and it would be nice to bring it to a resolution. We are sufficiently pressed for manpower that I don't reckon implementing option 3 at the present time can be justified. There is a separate issue pertaining to the harmonisation of the closure representation between bytecode and native. My current thoughts on this are to simply adopt the native representation in bytecode. |
|
I am indeed in favour of getting rid of More speculatively, I'd like to eventually be able to mix bytecode and native code in the same process, especially to be able to load native libraries in a bytecode toplevel. Making the closure representations agree would be a good step in this direction. |
damiendoligez
left a comment
There was a problem hiding this comment.
The patch looks OK in general, except for the stack overflow, but it looks like you're going to refactor anyway.
| match rem with | ||
| | [] -> true | ||
| | _::_ -> false | ||
| in |
There was a problem hiding this comment.
You could write this as
let last_one = rem = [] in
| | [] -> true | ||
| | _::_ -> false | ||
| in | ||
| let arity_word = (f2.arity lsl 1) lor (if last_one then 1 else 0) in |
There was a problem hiding this comment.
Same here, or make last_one an integer and avoid double-testing?
| CAMLassert(Tag_val(closure) == Closure_tag); | ||
| CAMLassert(Wosize_val(closure) >= 2); | ||
|
|
||
| start = (value*) closure; |
There was a problem hiding this comment.
Should be &Field(closure, 0)
| start = (value*) closure; | ||
| pos = start; | ||
|
|
||
| while (!finished) { |
There was a problem hiding this comment.
You could refactor this loop to avoid testing finished twice, either as a for loop, or as a while(1) loop with if (finished) break; in the right place. I'd find the latter more readable, but it's probably a matter of taste.
|
I am in favour of this PR. This should be sufficient to unblock marshalling on multicore. In the long term, it would be useful to remove |
|
Right, so, it's been more than a year since this has been stalled pending any further design discussions. As far as I know no-one has had the time to try implementing closure environments in a different form during this period. My preferred way forward is to see option 2 above, which corresponds to this patch, approved in principle. I can then address @damiendoligez 's comments above. We can always consider changing the representation in more major ways at a later date. |
|
Any updates on this PR. Could we aim to get this in for 4.08? |
|
@kayceesrk as practice shown - no. |
|
We're now exploring use of a different (and self-describing) closure representation that would supercede this work. |
This is related to #203 , which relates to an overall goal of removing the page table completely when in "no naked pointers" mode (and, in due course, for multicore). When that goal has been reached I am expecting approximately 5-10% performance improvement over the existing page table mode.
At present it is not possible to scan closures without the page table (or alternatively the code fragments table) because there is an ambiguity between environment entries and
Infix_tagseparators. This means that the positions of code pointers cannot be determined. Here is an example of the ambiguity. The following closure relates to only one function:Closure_tag | Code ptr 1 | Arity = 2 | Code ptr 2 | FV1 | FV2 | FV3and we suppose the value of the first free variable, when encoded as an OCaml integer, happens to equal
Infix_tag. This looks just like the following closure which relates to two closed functions:Closure_tag | Code ptr 1 | Arity = 2 | Code ptr 2 | Infix_tag | Code ptr | Arity = 1In #203 an attempt was made to add GC headers before the start of functions' instructions to solve this problem. That proved to be complicated on one architecture and on x86-64 appears to have a surprising penalty -- on the
cpdflibraries for example code size was increased by 5%. This is not helped that it is difficult to efficiently pack headers before the start of functions that must be 16 byte aligned, since no assembler directive exists to "align to the next 8-byte boundary that is not a 16-byte boundary".This patch solves the problem another way, which I think is overall better (benchmarks in progress); it simply uses the bottom bit of the arity field (after having de-tagged it) to indicate whether the corresponding function is the last one in the closure block. This means that the GC can skip right over all of the code pointers and arities, landing at the start of any environment which may be present. I claim the performance impact should be minimal since the only place that appears to examine arities is the
caml_applyfamily of functions, which now only gain two simple arithmetic instructions.I mentioned above about the code fragments table, which might provide another solution to this problem, namely by checking every non-immediate member of a closure against it. However I think in general such tables are fragile and best avoided; indeed, there appear to be at least two different idioms for use of the code fragments table in the compiler code already.
byterun/extern.conly checks the table when locating a code pointer, whereasIs_in_code_areaalso checks the page table. Furthermore, checking of the code fragments table is likely to be slow in the presence of a large number of dynamically-linked objects.