Ensure that Gc.minor_words remains accurate after a GC.#8619
Ensure that Gc.minor_words remains accurate after a GC.#8619xavierleroy merged 21 commits intoocaml:trunkfrom
Conversation
73808c6 to
2178f1f
Compare
If an allocation fails, the decrement of young_ptr should be undone before the GC is entered. This happened correctly on bytecode but not on native code. This commit fixes it for amd64 native. This is a partial cherry-pick of commit 8ceec on multicore.
2178f1f to
f065ca6
Compare
|
Running through precheck - job #246 |
|
Precheck crashed, but I restarted it. |
xavierleroy
left a comment
There was a problem hiding this comment.
I'm concerned that this can have a negative impact on performance, see below why. Is it THAT important that Gc.minor_words is accurate?
asmcomp/amd64/emit.mlp
Outdated
| assert Config.spacetime; | ||
| spacetime_before_uninstrumented_call ~node_ptr ~index | ||
| end; | ||
| I.add (int gc.gc_size) r15; |
There was a problem hiding this comment.
Even though this add is out-of-line and in a cold part of the code, you're increasing the overall code size and making the code cache less efficient. This will have to be benchmarked.
|
The test fails (as could be expected) on ARM, ARM64, PowerPC, and System Z. |
|
I think yes, for some applications it is important to be able to accurately measure allocations. It's certainly important for the multicore GC, which needs a similar patch (without this patch, there are holes of uninitialised data in the minor heap, which confuse multicore's promotion). I don't think the code size increase is significant. I compared the code size before and after on the compiler distribution (full data here). In total, |
Thanks! Happy to fix those, but I'll wait until there's consensus on whether this should be merged. |
|
I would add that when we investigated this behaviour at Jane Street a couple of years ago, it was in response to users being confused about why the minor words measurement wasn't accurate -- so this was user-driven rather than just spotting a technicality. |
|
Thanks for the explanations and for the measurements. I buy the "necessary for Multicore OCaml" argument. The increase in code size is less than I feared. Indeed, for ocamlopt.opt the whole TEXT section increases by less than 1%. It is possible to reduce the increase to negligible (0.1% overall, 0 for In In runtime/amd64.S: (and likewise for the other 4 special entry points). 5 entry points is probably overkill. But as you noticed I'm a stickler for code size. |
|
(I think that @jhjourdan probably looked at this question when he implemented statmemprof, so I'm pinging him in case the changes here imply some required changes for his code -- possibly simplifications.) |
Good idea, thanks!
I've added entry points for 1/2/3-word allocations, since those are the sizes special-cased when optimising for size with @gasche |
|
I had a very quick look at the other architectures. ARM, ARM64 and System Z can be fixed exactly like x86-64. PowerPC needs more work. I could look into it or we could decide that PowerPC is not worth the effort. |
|
I'd like to make this bug go away entirely, if possible. I'll have a go at fixing the other archs early next week. Is there something especially hairy about allocations on PowerPC? |
Just look at
I can see plausible ways to fix the allocation pointer. But there's the extra issue that our ppc32/ppc64 test machine (a PowerMac G5) died recently, leaving only ppc64le as testable. |
|
I think I have access to a system that can do ppc64 and possibly ppc32 as well. |
Introduce one GC call point per allocation size. Each call point corrects the allocation pointer r31 before calling caml_call_gc. Tested on ppc64le (little-endian, ELF64v2 ABI). Alternate code generation for calling the GC
|
I pushed a fix for PPC directly on this branch. The generated code is quite compact, I'm pleased :-) It was tested on ppc64le, but the changes are not specific to any of the 3 ppc variants, so with luck it should work for ppc32 and ppc64 as well. |
|
I've just pushed an ARM64 patch. I don't have an ARM64 machine handy to test on right now, so I'd like to see whether this passes inria CI. I'm using another approach: instead of mutating and then fixing young_ptr, this version only mutates young_ptr on successful allocations, doing the intermediate calculations on the result register instead. I think this is simpler, and it causes no code size increase on ARM because the 3-address format means it's just as easy to subtract from young_ptr and store the result elsewhere as to mutate young_ptr. |
|
Incidentally, the arm64 allocation code contains a special sequence to handle allocations of more than 0xfff bytes. The largest possible allocation is 0x808 bytes, so this can never run. Anyone mind if I delete this code path? Or there a possibility of Max_young_wosize increasing significantly? |
|
CI precheck in progress. Concerning allocations of more than 0xffff bytes: I guess I didn't realize Max_young_wosize is much lower than 0xffff... I'm OK with removing the large allocation case and replace it with an assert. |
|
CI precheck is successful for arm64 and for ppc64le. |
|
@xavierleroy: I don't know if you missed it, or you are secretly working on it, or you checked that there is no problem, but my comment above points at a potential bug in your PowerPC patch when Spacetime is enabled. |
The full story is the Principles of Operation manual referenced here https://github.com/ocaml/ocaml/blob/trunk/asmcomp/s390x/NOTES.md . But it's a big manual. I got the "12" by looking at how integer comparisons are compiled already, especially the Line 241 in f845abe For a "less than" comparison it gives code 4, and for a "less than or equal" comparison it gives code 12. The one-line comment suggests what's going on: the branch conditional instruction uses the magic number as a mask against the (eq, lt, gt, ov) flags, branching if any of the flags with a 1 in the mask is set. So, 12 means "branch if eq or lt", and 13 means "branch if eq or lt or overflow". The "overflow" flag is important for FP comparisons, as it stands for "unordered", hence 13 means "branch if not greater than". For integer comparisons, it's not obvious to me whether the "overflow" flag is cleared or left unchanged, so I'd rather not test it. |
I think it will, because statmemprof does a precise accounting of the allocation pointer in the minor heap in order to accurately sample memory blocks according to the desired probability distribution. I am certain that the corresponding change will be minimal, though. |
|
MSVC should now be fixed - it's running through precheck |
|
ARM is still failing: |
In the fast path for Ialloc, 4 extra bytes were allocated.
|
I fixed the ARM code generator. 4 too many bytes were allocated in the fast path of Ialloc, turning the minor heap into Swiss cheese. |
|
Thanks for the fix! |
|
Running through precheck again. |
As promised during review.
As noticed by @stedolan, the maximal size for an allocation in the minor heap is well under 0x1000, hence the generated code need not make provisions for bigger allocations.
|
Precheck seems happy. However, I pushed two additional commits, as discussed during review,. 90f80dd adds a comment explaining the PowerPC code generation strategy. c6e64b2 simplifies the allocation sequence on ARM64, based on the fact that minor heap allocations are always < 0x10000 bytes in size. So, I'll run precheck again. |
xavierleroy
left a comment
There was a problem hiding this comment.
CI is happy, and I read the whole diff once again. The changes are too large already for me to swear they are correct, but this is the best we can do, so let's merge.
|
(Release pseudo-management on report: this is arguably a bugfix but it is large and touchy, so we are not putting this in 4.09.) |
If an allocation fails, the decrement of young_ptr should be undone before the GC is entered. This happened correctly on bytecode but not on native code. This commit (squash of pull request ocaml#8619) fixes it for all the platforms supported by ocamlopt. amd64: add alternate entry points caml_call_gc{1,2,3} for code size optimisation. powerpc: introduce one GC call point per allocation size per function. Each call point corrects the allocation pointer r31 before calling caml_call_gc. i386, arm, arm64, s390x: update the allocation pointer after the conditional branch to the GC, not before. arm64: simplify the code generator: Ialloc can assume that less than 0x1_0000 bytes are allocated, since the max allocation size for the minor heap is less than that. This is a partial cherry-pick of commit 8ceec on multicore.
amd64: remove caml_call_gc{1,2,3} and simplify caml_alloc{1,2,3,N}
by tail-calling caml_call_gc.
i386: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
these functions do not need to preserve ebx.
arm: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
partial revert of ocaml#8619.
arm64: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
FIXME: temporarily, arm64 have fastcode_flag disabled.
amd64: remove caml_call_gc{1,2,3} and simplify caml_alloc{1,2,3,N}
by tail-calling caml_call_gc.
i386: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
these functions do not need to preserve ebx.
arm: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
partial revert of ocaml#8619.
arm64: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
power: partial revert of ocaml#8619.
avoid restarting allocation sequence after failure.
FIXME: temporarily, arm64 have fastcode_flag disabled.
amd64: remove caml_call_gc{1,2,3} and simplify caml_alloc{1,2,3,N}
by tail-calling caml_call_gc.
i386: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
these functions do not need to preserve ebx.
arm: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
partial revert of ocaml#8619.
arm64: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
power: partial revert of ocaml#8619.
avoid restarting allocation sequence after failure.
FIXME: temporarily, arm64 has fastcode_flag disabled.
amd64: remove caml_call_gc{1,2,3} and simplify caml_alloc{1,2,3,N}
by tail-calling caml_call_gc.
i386: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
these functions do not need to preserve ebx.
arm: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
partial revert of ocaml#8619.
arm64: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
partial revert of ocaml#8619.
power: partial revert of ocaml#8619.
avoid restarting allocation sequence after failure.
s390: partial revert of ocaml#8619.
avoid restarting allocation seqeunce after failure.
amd64: remove caml_call_gc{1,2,3} and simplify caml_alloc{1,2,3,N}
by tail-calling caml_call_gc.
i386: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
these functions do not need to preserve ebx.
arm: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
partial revert of ocaml#8619.
arm64: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
partial revert of ocaml#8619.
power: partial revert of ocaml#8619.
avoid restarting allocation sequence after failure.
s390: partial revert of ocaml#8619.
avoid restarting allocation seqeunce after failure.
amd64: remove caml_call_gc{1,2,3} and simplify caml_alloc{1,2,3,N}
by tail-calling caml_call_gc.
i386: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
these functions do not need to preserve ebx.
arm: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
partial revert of ocaml#8619.
arm64: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
partial revert of ocaml#8619.
power: partial revert of ocaml#8619.
avoid restarting allocation sequence after failure.
s390: partial revert of ocaml#8619.
avoid restarting allocation seqeunce after failure.
amd64: remove caml_call_gc{1,2,3} and simplify caml_alloc{1,2,3,N}
by tail-calling caml_call_gc.
i386: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
these functions do not need to preserve ebx.
arm: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
partial revert of ocaml#8619.
arm64: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
partial revert of ocaml#8619.
power: partial revert of ocaml#8619.
avoid restarting allocation sequence after failure.
s390: partial revert of ocaml#8619.
avoid restarting allocation seqeunce after failure.
If an allocation fails, the decrement of young_ptr should be undone before the GC is entered. This happened correctly on bytecode but not on native code. This commit fixes it for amd64 native.
This PR shouldn't be merged yet - it's probably broken on other native platforms as well. This patch adds a regression test, so we can see which platforms don't count minor_words correctly by seeing which builds are broken by this patch.I think this is ready to merge now!Closes #7798.