Skip to content

Ensure that Gc.minor_words remains accurate after a GC.#8619

Merged
xavierleroy merged 21 commits intoocaml:trunkfrom
stedolan:young-ptr-reset
May 4, 2019
Merged

Ensure that Gc.minor_words remains accurate after a GC.#8619
xavierleroy merged 21 commits intoocaml:trunkfrom
stedolan:young-ptr-reset

Conversation

@stedolan
Copy link
Copy Markdown
Contributor

@stedolan stedolan commented Apr 16, 2019

If an allocation fails, the decrement of young_ptr should be undone before the GC is entered. This happened correctly on bytecode but not on native code. This commit fixes it for amd64 native.

This PR shouldn't be merged yet - it's probably broken on other native platforms as well. This patch adds a regression test, so we can see which platforms don't count minor_words correctly by seeing which builds are broken by this patch. I think this is ready to merge now!

Closes #7798.

If an allocation fails, the decrement of young_ptr should be
undone before the GC is entered. This happened correctly on bytecode
but not on native code. This commit fixes it for amd64 native.

This is a partial cherry-pick of commit 8ceec on multicore.
@dra27
Copy link
Copy Markdown
Member

dra27 commented Apr 16, 2019

Running through precheck - job #246

@xavierleroy
Copy link
Copy Markdown
Contributor

Precheck crashed, but I restarted it.

Copy link
Copy Markdown
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that this can have a negative impact on performance, see below why. Is it THAT important that Gc.minor_words is accurate?

assert Config.spacetime;
spacetime_before_uninstrumented_call ~node_ptr ~index
end;
I.add (int gc.gc_size) r15;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though this add is out-of-line and in a cold part of the code, you're increasing the overall code size and making the code cache less efficient. This will have to be benchmarked.

@xavierleroy
Copy link
Copy Markdown
Contributor

The test fails (as could be expected) on ARM, ARM64, PowerPC, and System Z.

@stedolan
Copy link
Copy Markdown
Contributor Author

I think yes, for some applications it is important to be able to accurately measure allocations. It's certainly important for the multicore GC, which needs a similar patch (without this patch, there are holes of uninitialised data in the minor heap, which confuse multicore's promotion).

I don't think the code size increase is significant. I compared the code size before and after on the compiler distribution (full data here). In total, .text sections got 1.7% bigger, with the biggest outlier being asmcomp/x86_dsl.o, whose .text section got 4% bigger.

@stedolan
Copy link
Copy Markdown
Contributor Author

The test fails (as could be expected) on ARM, ARM64, PowerPC, and System Z.

Thanks! Happy to fix those, but I'll wait until there's consensus on whether this should be merged.

@mshinwell
Copy link
Copy Markdown
Contributor

I would add that when we investigated this behaviour at Jane Street a couple of years ago, it was in response to users being confused about why the minor words measurement wasn't accurate -- so this was user-driven rather than just spotting a technicality.

@xavierleroy
Copy link
Copy Markdown
Contributor

Thanks for the explanations and for the measurements. I buy the "necessary for Multicore OCaml" argument.

The increase in code size is less than I feared. Indeed, for ocamlopt.opt the whole TEXT section increases by less than 1%.

It is possible to reduce the increase to negligible (0.1% overall, 0 for asmcomp/x86_dsl.o) by adding and using a few alternate entry points for caml_call_gc:

In emit_call_gc:

  begin match gc.gc_size with
  | 16 -> emit_call "caml_call_gc_1"
  | 24 -> emit_call "caml_call_gc_2"
  | 32 -> emit_call "caml_call_gc_3"
  | 40 -> emit_call "caml_call_gc_4"
  | 48 -> emit_call "caml_call_gc_5"
  | _  -> I.add (int gc.gc_size) r15;
          emit_call "caml_call_gc"
  end;

In runtime/amd64.S:

FUNCTION(G(caml_call_gc_1))
CFI_STARTPROC
        addq    $16, %r15
        jmp     G(caml_call_gc)
CFI_ENDPROC

(and likewise for the other 4 special entry points).

5 entry points is probably overkill. But as you noticed I'm a stickler for code size.

@gasche
Copy link
Copy Markdown
Member

gasche commented Apr 17, 2019

(I think that @jhjourdan probably looked at this question when he implemented statmemprof, so I'm pinging him in case the changes here imply some required changes for his code -- possibly simplifications.)

@stedolan
Copy link
Copy Markdown
Contributor Author

@xavierleroy

It is possible to reduce the increase to negligible (0.1% overall, 0 for asmcomp/x86_dsl.o) by adding and using a few alternate entry points for caml_call_gc:

Good idea, thanks!

5 entry points is probably overkill.

I've added entry points for 1/2/3-word allocations, since those are the sizes special-cased when optimising for size with -compact. (If changing 4/5-word allocations makes a significant difference, we should probably change -compact).

@gasche
This is no coincidence! I've started looking at the statmemprof code to try and get it merged (and am in contact with @jhjourdan). I don't think this particular PR should affect statmemprof, though.

@xavierleroy
Copy link
Copy Markdown
Contributor

I had a very quick look at the other architectures. ARM, ARM64 and System Z can be fixed exactly like x86-64. PowerPC needs more work. I could look into it or we could decide that PowerPC is not worth the effort.

@stedolan
Copy link
Copy Markdown
Contributor Author

I'd like to make this bug go away entirely, if possible. I'll have a go at fixing the other archs early next week. Is there something especially hairy about allocations on PowerPC?

@xavierleroy
Copy link
Copy Markdown
Contributor

Is there something especially hairy about allocations on PowerPC?

Just look at asmcomp/powerpc/emit.mlp. The glue code that calls the GC

  • comes in 3 different versions (ppc32, ppc64, ppc64le)
  • is much bigger than usual (for ppc64 and ppc64le)
  • but is shared between all allocation sites, owing to the cool "conditional function call" instruction of PowerPC.

I can see plausible ways to fix the allocation pointer. But there's the extra issue that our ppc32/ppc64 test machine (a PowerMac G5) died recently, leaving only ppc64le as testable.

@mshinwell
Copy link
Copy Markdown
Contributor

I think I have access to a system that can do ppc64 and possibly ppc32 as well.

Introduce one GC call point per allocation size.
Each call point corrects the allocation pointer r31 before calling caml_call_gc.
Tested on ppc64le (little-endian, ELF64v2 ABI).

Alternate code generation for calling the GC
@xavierleroy
Copy link
Copy Markdown
Contributor

I pushed a fix for PPC directly on this branch. The generated code is quite compact, I'm pleased :-) It was tested on ppc64le, but the changes are not specific to any of the 3 ppc variants, so with luck it should work for ppc32 and ppc64 as well.

@stedolan
Copy link
Copy Markdown
Contributor Author

I've just pushed an ARM64 patch. I don't have an ARM64 machine handy to test on right now, so I'd like to see whether this passes inria CI.

I'm using another approach: instead of mutating and then fixing young_ptr, this version only mutates young_ptr on successful allocations, doing the intermediate calculations on the result register instead. I think this is simpler, and it causes no code size increase on ARM because the 3-address format means it's just as easy to subtract from young_ptr and store the result elsewhere as to mutate young_ptr.

@stedolan
Copy link
Copy Markdown
Contributor Author

Incidentally, the arm64 allocation code contains a special sequence to handle allocations of more than 0xfff bytes. The largest possible allocation is 0x808 bytes, so this can never run. Anyone mind if I delete this code path? Or there a possibility of Max_young_wosize increasing significantly?

@xavierleroy
Copy link
Copy Markdown
Contributor

CI precheck in progress.

Concerning allocations of more than 0xffff bytes: I guess I didn't realize Max_young_wosize is much lower than 0xffff... I'm OK with removing the large allocation case and replace it with an assert.

@xavierleroy
Copy link
Copy Markdown
Contributor

CI precheck is successful for arm64 and for ppc64le.

@gasche
Copy link
Copy Markdown
Member

gasche commented Apr 24, 2019

@xavierleroy: I don't know if you missed it, or you are secretly working on it, or you checked that there is no problem, but my comment above points at a potential bug in your PowerPC patch when Spacetime is enabled.

@xavierleroy
Copy link
Copy Markdown
Contributor

xavierleroy commented Apr 24, 2019

For future reference, where should I look up s390x instructions? I got brcl 13 from here.

The full story is the Principles of Operation manual referenced here https://github.com/ocaml/ocaml/blob/trunk/asmcomp/s390x/NOTES.md . But it's a big manual.

I got the "12" by looking at how integer comparisons are compiled already, especially the branch_for_comparison function

(* bit 0 = eq, bit 1 = lt, bit 2 = gt, bit 3 = overflow*)

For a "less than" comparison it gives code 4, and for a "less than or equal" comparison it gives code 12.

The one-line comment suggests what's going on: the branch conditional instruction uses the magic number as a mask against the (eq, lt, gt, ov) flags, branching if any of the flags with a 1 in the mask is set. So, 12 means "branch if eq or lt", and 13 means "branch if eq or lt or overflow". The "overflow" flag is important for FP comparisons, as it stands for "unordered", hence 13 means "branch if not greater than". For integer comparisons, it's not obvious to me whether the "overflow" flag is cleared or left unchanged, so I'd rather not test it.

@jhjourdan
Copy link
Copy Markdown
Contributor

@gasche
This is no coincidence! I've started looking at the statmemprof code to try and get it merged (and am in contact with @jhjourdan). I don't think this particular PR should affect statmemprof, though.

I think it will, because statmemprof does a precise accounting of the allocation pointer in the minor heap in order to accurately sample memory blocks according to the desired probability distribution. I am certain that the corresponding change will be minimal, though.

@dra27 dra27 force-pushed the young-ptr-reset branch from 4393ac1 to 4f0230d Compare May 2, 2019 14:03
@dra27
Copy link
Copy Markdown
Member

dra27 commented May 2, 2019

MSVC should now be fixed - it's running through precheck

@dra27
Copy link
Copy Markdown
Member

dra27 commented May 2, 2019

ARM is still failing:

List of failed tests:
    tests/regression/pr7798/'pr7798.ml' with 2 (native) 
    tests/lib-obj/'with_tag.ml' with 1 (native)

In the fast path for Ialloc, 4 extra bytes were allocated.
@xavierleroy
Copy link
Copy Markdown
Contributor

I fixed the ARM code generator. 4 too many bytes were allocated in the fast path of Ialloc, turning the minor heap into Swiss cheese.

@stedolan
Copy link
Copy Markdown
Contributor Author

stedolan commented May 3, 2019

Thanks for the fix!

@xavierleroy
Copy link
Copy Markdown
Contributor

Running through precheck again.

As noticed by @stedolan, the maximal size for an allocation in the
minor heap is well under 0x1000, hence the generated code need not
make provisions for bigger allocations.
@xavierleroy
Copy link
Copy Markdown
Contributor

xavierleroy commented May 3, 2019

Precheck seems happy. However, I pushed two additional commits, as discussed during review,. 90f80dd adds a comment explaining the PowerPC code generation strategy. c6e64b2 simplifies the allocation sequence on ARM64, based on the fact that minor heap allocations are always < 0x10000 bytes in size. So, I'll run precheck again.

Copy link
Copy Markdown
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is happy, and I read the whole diff once again. The changes are too large already for me to swear they are correct, but this is the best we can do, so let's merge.

@xavierleroy xavierleroy merged commit c24e5b5 into ocaml:trunk May 4, 2019
@gasche
Copy link
Copy Markdown
Member

gasche commented May 7, 2019

(Release pseudo-management on report: this is arguably a bugfix but it is large and touchy, so we are not putting this in 4.09.)

smuenzel pushed a commit to smuenzel/ocaml that referenced this pull request Jun 26, 2019
If an allocation fails, the decrement of young_ptr should be undone
before the GC is entered. This happened correctly on bytecode but not
on native code.

This commit (squash of pull request ocaml#8619) fixes it for all the
platforms supported by ocamlopt.

amd64: add alternate entry points caml_call_gc{1,2,3} for code size
optimisation.

powerpc: introduce one GC call point per allocation size per function.
Each call point corrects the allocation pointer r31 before calling
caml_call_gc.

i386, arm, arm64, s390x: update the allocation pointer after the
conditional branch to the GC, not before.

arm64: simplify the code generator: Ialloc can assume that less than
0x1_0000 bytes are allocated, since the max allocation size for the
minor heap is less than that.

This is a partial cherry-pick of commit 8ceec on multicore.
stedolan added a commit to stedolan/ocaml that referenced this pull request Oct 22, 2019
amd64: remove caml_call_gc{1,2,3} and simplify caml_alloc{1,2,3,N}
       by tail-calling caml_call_gc.

i386:  simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
       these functions do not need to preserve ebx.

arm:   simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
       partial revert of ocaml#8619.

arm64: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.

FIXME: temporarily, arm64 have fastcode_flag disabled.
stedolan added a commit to stedolan/ocaml that referenced this pull request Oct 22, 2019
amd64: remove caml_call_gc{1,2,3} and simplify caml_alloc{1,2,3,N}
       by tail-calling caml_call_gc.

i386:  simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
       these functions do not need to preserve ebx.

arm:   simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
       partial revert of ocaml#8619.

arm64: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.

power: partial revert of ocaml#8619.
       avoid restarting allocation sequence after failure.

FIXME: temporarily, arm64 have fastcode_flag disabled.
stedolan added a commit to stedolan/ocaml that referenced this pull request Oct 22, 2019
amd64: remove caml_call_gc{1,2,3} and simplify caml_alloc{1,2,3,N}
       by tail-calling caml_call_gc.

i386:  simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
       these functions do not need to preserve ebx.

arm:   simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
       partial revert of ocaml#8619.

arm64: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.

power: partial revert of ocaml#8619.
       avoid restarting allocation sequence after failure.

FIXME: temporarily, arm64 has fastcode_flag disabled.
stedolan added a commit to stedolan/ocaml that referenced this pull request Oct 22, 2019
amd64: remove caml_call_gc{1,2,3} and simplify caml_alloc{1,2,3,N}
       by tail-calling caml_call_gc.

i386:  simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
       these functions do not need to preserve ebx.

arm:   simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
       partial revert of ocaml#8619.

arm64: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
       partial revert of ocaml#8619.

power: partial revert of ocaml#8619.
       avoid restarting allocation sequence after failure.

s390:  partial revert of ocaml#8619.
       avoid restarting allocation seqeunce after failure.
stedolan added a commit to stedolan/ocaml that referenced this pull request Oct 22, 2019
amd64: remove caml_call_gc{1,2,3} and simplify caml_alloc{1,2,3,N}
       by tail-calling caml_call_gc.

i386:  simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
       these functions do not need to preserve ebx.

arm:   simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
       partial revert of ocaml#8619.

arm64: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
       partial revert of ocaml#8619.

power: partial revert of ocaml#8619.
       avoid restarting allocation sequence after failure.

s390:  partial revert of ocaml#8619.
       avoid restarting allocation seqeunce after failure.
stedolan added a commit to stedolan/ocaml that referenced this pull request Oct 23, 2019
amd64: remove caml_call_gc{1,2,3} and simplify caml_alloc{1,2,3,N}
       by tail-calling caml_call_gc.

i386:  simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
       these functions do not need to preserve ebx.

arm:   simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
       partial revert of ocaml#8619.

arm64: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
       partial revert of ocaml#8619.

power: partial revert of ocaml#8619.
       avoid restarting allocation sequence after failure.

s390:  partial revert of ocaml#8619.
       avoid restarting allocation seqeunce after failure.
anmolsahoo25 pushed a commit to anmolsahoo25/ocaml that referenced this pull request Dec 9, 2019
amd64: remove caml_call_gc{1,2,3} and simplify caml_alloc{1,2,3,N}
       by tail-calling caml_call_gc.

i386:  simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
       these functions do not need to preserve ebx.

arm:   simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
       partial revert of ocaml#8619.

arm64: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
       partial revert of ocaml#8619.

power: partial revert of ocaml#8619.
       avoid restarting allocation sequence after failure.

s390:  partial revert of ocaml#8619.
       avoid restarting allocation seqeunce after failure.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

minor_words in gc stats sometimes double counts allocations

6 participants