Ensure that Gc.minor_words remains accurate after a GC. by stedolan · Pull Request #8619 · ocaml/ocaml

stedolan · 2019-04-16T10:17:43Z

If an allocation fails, the decrement of young_ptr should be undone before the GC is entered. This happened correctly on bytecode but not on native code. This commit fixes it for amd64 native.

This PR shouldn't be merged yet - it's probably broken on other native platforms as well. This patch adds a regression test, so we can see which platforms don't count minor_words correctly by seeing which builds are broken by this patch. I think this is ready to merge now!

Closes #7798.

If an allocation fails, the decrement of young_ptr should be undone before the GC is entered. This happened correctly on bytecode but not on native code. This commit fixes it for amd64 native. This is a partial cherry-pick of commit 8ceec on multicore.

dra27 · 2019-04-16T10:49:39Z

Running through precheck - job #246

xavierleroy · 2019-04-16T13:58:26Z

Precheck crashed, but I restarted it.

xavierleroy

I'm concerned that this can have a negative impact on performance, see below why. Is it THAT important that Gc.minor_words is accurate?

xavierleroy · 2019-04-16T13:59:57Z

asmcomp/amd64/emit.mlp

    assert Config.spacetime;
    spacetime_before_uninstrumented_call ~node_ptr ~index
  end;
+  I.add (int gc.gc_size) r15;


Even though this add is out-of-line and in a cold part of the code, you're increasing the overall code size and making the code cache less efficient. This will have to be benchmarked.

xavierleroy · 2019-04-16T14:29:23Z

The test fails (as could be expected) on ARM, ARM64, PowerPC, and System Z.

stedolan · 2019-04-16T14:37:41Z

I think yes, for some applications it is important to be able to accurately measure allocations. It's certainly important for the multicore GC, which needs a similar patch (without this patch, there are holes of uninitialised data in the minor heap, which confuse multicore's promotion).

I don't think the code size increase is significant. I compared the code size before and after on the compiler distribution (full data here). In total, .text sections got 1.7% bigger, with the biggest outlier being asmcomp/x86_dsl.o, whose .text section got 4% bigger.

stedolan · 2019-04-16T14:39:37Z

The test fails (as could be expected) on ARM, ARM64, PowerPC, and System Z.

Thanks! Happy to fix those, but I'll wait until there's consensus on whether this should be merged.

mshinwell · 2019-04-16T15:02:05Z

I would add that when we investigated this behaviour at Jane Street a couple of years ago, it was in response to users being confused about why the minor words measurement wasn't accurate -- so this was user-driven rather than just spotting a technicality.

xavierleroy · 2019-04-17T14:26:13Z

Thanks for the explanations and for the measurements. I buy the "necessary for Multicore OCaml" argument.

The increase in code size is less than I feared. Indeed, for ocamlopt.opt the whole TEXT section increases by less than 1%.

It is possible to reduce the increase to negligible (0.1% overall, 0 for asmcomp/x86_dsl.o) by adding and using a few alternate entry points for caml_call_gc:

In emit_call_gc:

  begin match gc.gc_size with
  | 16 -> emit_call "caml_call_gc_1"
  | 24 -> emit_call "caml_call_gc_2"
  | 32 -> emit_call "caml_call_gc_3"
  | 40 -> emit_call "caml_call_gc_4"
  | 48 -> emit_call "caml_call_gc_5"
  | _  -> I.add (int gc.gc_size) r15;
          emit_call "caml_call_gc"
  end;

In runtime/amd64.S:

FUNCTION(G(caml_call_gc_1))
CFI_STARTPROC
        addq    $16, %r15
        jmp     G(caml_call_gc)
CFI_ENDPROC

(and likewise for the other 4 special entry points).

5 entry points is probably overkill. But as you noticed I'm a stickler for code size.

gasche · 2019-04-17T15:00:26Z

(I think that @jhjourdan probably looked at this question when he implemented statmemprof, so I'm pinging him in case the changes here imply some required changes for his code -- possibly simplifications.)

stedolan · 2019-04-17T15:29:26Z

@xavierleroy

It is possible to reduce the increase to negligible (0.1% overall, 0 for asmcomp/x86_dsl.o) by adding and using a few alternate entry points for caml_call_gc:

Good idea, thanks!

5 entry points is probably overkill.

I've added entry points for 1/2/3-word allocations, since those are the sizes special-cased when optimising for size with -compact. (If changing 4/5-word allocations makes a significant difference, we should probably change -compact).

@gasche
This is no coincidence! I've started looking at the statmemprof code to try and get it merged (and am in contact with @jhjourdan). I don't think this particular PR should affect statmemprof, though.

xavierleroy · 2019-04-18T08:54:14Z

I had a very quick look at the other architectures. ARM, ARM64 and System Z can be fixed exactly like x86-64. PowerPC needs more work. I could look into it or we could decide that PowerPC is not worth the effort.

stedolan · 2019-04-18T17:19:13Z

I'd like to make this bug go away entirely, if possible. I'll have a go at fixing the other archs early next week. Is there something especially hairy about allocations on PowerPC?

xavierleroy · 2019-04-19T17:14:10Z

Is there something especially hairy about allocations on PowerPC?

Just look at asmcomp/powerpc/emit.mlp. The glue code that calls the GC

comes in 3 different versions (ppc32, ppc64, ppc64le)
is much bigger than usual (for ppc64 and ppc64le)
but is shared between all allocation sites, owing to the cool "conditional function call" instruction of PowerPC.

I can see plausible ways to fix the allocation pointer. But there's the extra issue that our ppc32/ppc64 test machine (a PowerMac G5) died recently, leaving only ppc64le as testable.

mshinwell · 2019-04-21T09:16:47Z

I think I have access to a system that can do ppc64 and possibly ppc32 as well.

Introduce one GC call point per allocation size. Each call point corrects the allocation pointer r31 before calling caml_call_gc. Tested on ppc64le (little-endian, ELF64v2 ABI). Alternate code generation for calling the GC

xavierleroy · 2019-04-22T10:51:51Z

I pushed a fix for PPC directly on this branch. The generated code is quite compact, I'm pleased :-) It was tested on ppc64le, but the changes are not specific to any of the 3 ppc variants, so with luck it should work for ppc32 and ppc64 as well.

asmcomp/power/emit.mlp

stedolan · 2019-04-23T09:07:56Z

I've just pushed an ARM64 patch. I don't have an ARM64 machine handy to test on right now, so I'd like to see whether this passes inria CI.

I'm using another approach: instead of mutating and then fixing young_ptr, this version only mutates young_ptr on successful allocations, doing the intermediate calculations on the result register instead. I think this is simpler, and it causes no code size increase on ARM because the 3-address format means it's just as easy to subtract from young_ptr and store the result elsewhere as to mutate young_ptr.

stedolan · 2019-04-23T09:09:10Z

Incidentally, the arm64 allocation code contains a special sequence to handle allocations of more than 0xfff bytes. The largest possible allocation is 0x808 bytes, so this can never run. Anyone mind if I delete this code path? Or there a possibility of Max_young_wosize increasing significantly?

xavierleroy · 2019-04-23T09:49:15Z

CI precheck in progress.

Concerning allocations of more than 0xffff bytes: I guess I didn't realize Max_young_wosize is much lower than 0xffff... I'm OK with removing the large allocation case and replace it with an assert.

xavierleroy · 2019-04-24T07:11:40Z

CI precheck is successful for arm64 and for ppc64le.

gasche · 2019-04-24T08:41:57Z

@xavierleroy: I don't know if you missed it, or you are secretly working on it, or you checked that there is no problem, but my comment above points at a potential bug in your PowerPC patch when Spacetime is enabled.

xavierleroy · 2019-04-24T17:10:31Z

For future reference, where should I look up s390x instructions? I got brcl 13 from here.

The full story is the Principles of Operation manual referenced here https://github.com/ocaml/ocaml/blob/trunk/asmcomp/s390x/NOTES.md . But it's a big manual.

I got the "12" by looking at how integer comparisons are compiled already, especially the branch_for_comparison function

ocaml/asmcomp/s390x/emit.mlp

Line 241 in f845abe

(* bit 0 = eq, bit 1 = lt, bit 2 = gt, bit 3 = overflow*)

For a "less than" comparison it gives code 4, and for a "less than or equal" comparison it gives code 12.

The one-line comment suggests what's going on: the branch conditional instruction uses the magic number as a mask against the (eq, lt, gt, ov) flags, branching if any of the flags with a 1 in the mask is set. So, 12 means "branch if eq or lt", and 13 means "branch if eq or lt or overflow". The "overflow" flag is important for FP comparisons, as it stands for "unordered", hence 13 means "branch if not greater than". For integer comparisons, it's not obvious to me whether the "overflow" flag is cleared or left unchanged, so I'd rather not test it.

jhjourdan · 2019-04-24T23:45:25Z

@gasche
This is no coincidence! I've started looking at the statmemprof code to try and get it merged (and am in contact with @jhjourdan). I don't think this particular PR should affect statmemprof, though.

I think it will, because statmemprof does a precise accounting of the allocation pointer in the minor heap in order to accurately sample memory blocks according to the desired probability distribution. I am certain that the corresponding change will be minimal, though.

dra27 · 2019-05-02T14:06:05Z

MSVC should now be fixed - it's running through precheck

dra27 · 2019-05-02T15:05:15Z

ARM is still failing:

List of failed tests:
    tests/regression/pr7798/'pr7798.ml' with 2 (native) 
    tests/lib-obj/'with_tag.ml' with 1 (native)

In the fast path for Ialloc, 4 extra bytes were allocated.

xavierleroy · 2019-05-03T08:08:33Z

I fixed the ARM code generator. 4 too many bytes were allocated in the fast path of Ialloc, turning the minor heap into Swiss cheese.

stedolan · 2019-05-03T10:35:19Z

Thanks for the fix!

xavierleroy · 2019-05-03T16:13:48Z

Running through precheck again.

As promised during review.

@stedolan

As noticed by @stedolan, the maximal size for an allocation in the minor heap is well under 0x1000, hence the generated code need not make provisions for bigger allocations.

xavierleroy · 2019-05-03T17:17:57Z

Precheck seems happy. However, I pushed two additional commits, as discussed during review,. 90f80dd adds a comment explaining the PowerPC code generation strategy. c6e64b2 simplifies the allocation sequence on ARM64, based on the fact that minor heap allocations are always < 0x10000 bytes in size. So, I'll run precheck again.

xavierleroy

CI is happy, and I read the whole diff once again. The changes are too large already for me to swear they are correct, but this is the best we can do, so let's merge.

gasche · 2019-05-07T12:24:54Z

(Release pseudo-management on report: this is arguably a bugfix but it is large and touchy, so we are not putting this in 4.09.)

If an allocation fails, the decrement of young_ptr should be undone before the GC is entered. This happened correctly on bytecode but not on native code. This commit (squash of pull request ocaml#8619) fixes it for all the platforms supported by ocamlopt. amd64: add alternate entry points caml_call_gc{1,2,3} for code size optimisation. powerpc: introduce one GC call point per allocation size per function. Each call point corrects the allocation pointer r31 before calling caml_call_gc. i386, arm, arm64, s390x: update the allocation pointer after the conditional branch to the GC, not before. arm64: simplify the code generator: Ialloc can assume that less than 0x1_0000 bytes are allocated, since the max allocation size for the minor heap is less than that. This is a partial cherry-pick of commit 8ceec on multicore.

amd64: remove caml_call_gc{1,2,3} and simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc. i386: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc. these functions do not need to preserve ebx. arm: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc. partial revert of ocaml#8619. arm64: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc. FIXME: temporarily, arm64 have fastcode_flag disabled.

amd64: remove caml_call_gc{1,2,3} and simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc. i386: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc. these functions do not need to preserve ebx. arm: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc. partial revert of ocaml#8619. arm64: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc. power: partial revert of ocaml#8619. avoid restarting allocation sequence after failure. FIXME: temporarily, arm64 have fastcode_flag disabled.

amd64: remove caml_call_gc{1,2,3} and simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc. i386: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc. these functions do not need to preserve ebx. arm: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc. partial revert of ocaml#8619. arm64: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc. power: partial revert of ocaml#8619. avoid restarting allocation sequence after failure. FIXME: temporarily, arm64 has fastcode_flag disabled.

amd64: remove caml_call_gc{1,2,3} and simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc. i386: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc. these functions do not need to preserve ebx. arm: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc. partial revert of ocaml#8619. arm64: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc. partial revert of ocaml#8619. power: partial revert of ocaml#8619. avoid restarting allocation sequence after failure. s390: partial revert of ocaml#8619. avoid restarting allocation seqeunce after failure.

stedolan force-pushed the young-ptr-reset branch from 73808c6 to 2178f1f Compare April 16, 2019 10:19

stedolan force-pushed the young-ptr-reset branch from 2178f1f to f065ca6 Compare April 16, 2019 10:47

Ensure that Gc.minor_words remains accurate after a GC (i386).

cb35ccd

xavierleroy reviewed Apr 16, 2019

View reviewed changes

stedolan added 2 commits April 17, 2019 16:23

Add caml_call_gc{1,2,3} for code size optimisation

a658862

update Changes

7eb9d79

typo in amd64.S

03bbc21

Fix the Power/PowerPC port

1258839

Introduce one GC call point per allocation size. Each call point corrects the allocation pointer r31 before calling caml_call_gc. Tested on ppc64le (little-endian, ELF64v2 ABI). Alternate code generation for calling the GC

gasche reviewed Apr 22, 2019

View reviewed changes

asmcomp/power/emit.mlp Show resolved Hide resolved

asmcomp/power/emit.mlp Show resolved Hide resolved

Ensure that Gc.minor_words remains accurate after a GC (arm64)

e392732

stedolan and others added 4 commits May 1, 2019 10:28

Fix i386 -compact camlallocN code

ec938c8

whitespace

476bdcf

Fix MSVC64

370eb7f

Fix MSVC32

4f0230d

dra27 force-pushed the young-ptr-reset branch from 4393ac1 to 4f0230d Compare May 2, 2019 14:03

Fix ARM code generator

edfb5f8

In the fast path for Ialloc, 4 extra bytes were allocated.

Changes

089dcfd

xavierleroy added 2 commits May 3, 2019 18:43

Add comment to explain the code generation strategy for calling the GC

90f80dd

As promised during review.

Simplify assembly_code_for_allocation

c6e64b2

As noticed by @stedolan, the maximal size for an allocation in the minor heap is well under 0x1000, hence the generated code need not make provisions for bigger allocations.

Update Changes for 8619

6fbb539

xavierleroy approved these changes May 4, 2019

View reviewed changes

xavierleroy merged commit c24e5b5 into ocaml:trunk May 4, 2019

stedolan mentioned this pull request Jul 16, 2019

Keep information about allocation sizes, for statmemprof, and use during GC. #8805

Merged

3 tasks

Conversation

stedolan commented Apr 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dra27 commented Apr 16, 2019

Uh oh!

xavierleroy commented Apr 16, 2019

Uh oh!

xavierleroy left a comment

Choose a reason for hiding this comment

Uh oh!

xavierleroy Apr 16, 2019

Choose a reason for hiding this comment

Uh oh!

xavierleroy commented Apr 16, 2019

Uh oh!

stedolan commented Apr 16, 2019

Uh oh!

stedolan commented Apr 16, 2019

Uh oh!

mshinwell commented Apr 16, 2019

Uh oh!

xavierleroy commented Apr 17, 2019

Uh oh!

gasche commented Apr 17, 2019

Uh oh!

stedolan commented Apr 17, 2019

Uh oh!

xavierleroy commented Apr 18, 2019

Uh oh!

stedolan commented Apr 18, 2019

Uh oh!

xavierleroy commented Apr 19, 2019

Uh oh!

mshinwell commented Apr 21, 2019

Uh oh!

xavierleroy commented Apr 22, 2019

Uh oh!

Uh oh!

Uh oh!

stedolan commented Apr 23, 2019

Uh oh!

stedolan commented Apr 23, 2019

Uh oh!

xavierleroy commented Apr 23, 2019

Uh oh!

xavierleroy commented Apr 24, 2019

Uh oh!

gasche commented Apr 24, 2019

Uh oh!

xavierleroy commented Apr 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhjourdan commented Apr 24, 2019

Uh oh!

dra27 commented May 2, 2019

Uh oh!

dra27 commented May 2, 2019

Uh oh!

xavierleroy commented May 3, 2019

Uh oh!

stedolan commented May 3, 2019

Uh oh!

xavierleroy commented May 3, 2019

Uh oh!

xavierleroy commented May 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xavierleroy left a comment

Choose a reason for hiding this comment

Uh oh!

gasche commented May 7, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

stedolan commented Apr 16, 2019 •

edited

Loading

xavierleroy commented Apr 24, 2019 •

edited

Loading

xavierleroy commented May 3, 2019 •

edited

Loading