Conversation
gasche
left a comment
There was a problem hiding this comment.
I had a quick look at this PR. I am far from an expert on this sort of bit-level C programming, so some of my questions/remarks may be naive, apologies. My overall feeling (as a non-expert) is that the proposed interface (a new creation function for Atomic.t) is very reasonable -- I would even say that I like its simplicity -- but that its implementation in the runtime is not ready.
runtime/caml/alloc.h
Outdated
|
|
||
| /* Alloc aligned for atomics */ | ||
| CAMLextern value caml_alloc_aligned(value); | ||
| CAMLextern value caml_is_aligned(value); |
There was a problem hiding this comment.
I'm not fond of the naming here, "aligned" and "is_aligned" are very generic names that I would expect to be parametrized over an arbitrary alignment (as caml_aligned_malloc does for example). Also, the naming just suggests that we are allocating an OCaml value (so I would expect the parameters to be a header and a size or something), but in fact we are specifically allocating a one-world block of tag 0.
Maybe caml_make_atomic_aligned or caml_make_aligned_atomic would be better names? (And then caml_atomic_is_aligned or something. But I agree that you could skip this part of the PR that is not as clearly motivated.)
runtime/caml/config.h
Outdated
| #define Default_runtime_events_log_wsize 16 | ||
|
|
||
| /* Memory-alignment for allocating atomics */ | ||
| #define Allocation_alignment 64 |
There was a problem hiding this comment.
Why 64? I guess that we are trying to use the cache-line size to avoid false sharing. This could be documented.
There was a problem hiding this comment.
Some quick googling suggests that the Apple M1 and POWER architecture processors have 128b cache line size, and z/architecture processors use 256b.
There was a problem hiding this comment.
I would also suggest adding documentation to explain what "64" is supposed to mean here. Also, consider renaming this to Allocation_alignment_bsize (if it indeed is representing size in bytes).
There was a problem hiding this comment.
Thank you for suggestions.
- Renamed it.
- Added better documentation.
- Made it default to 128b and added a 256b case for z/Arch.
runtime/shared_heap.c
Outdated
| value* next_obj; | ||
| caml_domain_state* owner; | ||
| sizeclass sz; | ||
| long p1; long p2; long p3; long p4; /* padding to align slabs */ |
There was a problem hiding this comment.
I'm no C expert, but my impression is that this code is making assumptions about the word size of long that may not hold under all support configurations (at least if we include bytecode-only targets). I suspect that there is a caml-runtime-specific type alias that is more appropriate, maybe just uintnat. But should we also redefine sizeclass to be uintnat if we want to be able to make such padding assumptions?
Also, should this be uintnat padding[4] instead?
Finally, does this padding-to-align logic also work on systems with 32bit machine words? I think that you are asserting that the first element in a pool is 64-bytes-aligned, and assuming that a 8-word headers ensures that, which may not hold on 32bit systems.
runtime/alloc.c
Outdated
| value var = caml_alloc_shr(7, 0); | ||
| CAMLassert((var - 8) % Allocation_alignment == 0); |
There was a problem hiding this comment.
Shouldn't this depend on Allocation_alignment?
| value var = caml_alloc_shr(7, 0); | |
| CAMLassert((var - 8) % Allocation_alignment == 0); | |
| mlsize_t wo_total_block_size = Allocation_alignment/sizeof(value); | |
| value var = caml_alloc_shr(wo_total_block_size-1, 0); | |
| CAMLassert((var - wo_total_block_size) % Allocation_alignment == 0); |
There was a problem hiding this comment.
This makes even less sense than the original. I think var -8 in the original code was trying to get the address of the block header. Here, var - wo_total_block_size subtracts a number of words from the integer representation of a pointer, and it doesn't mean anything to me.
There was a problem hiding this comment.
Ah yes, it should have been sizeof(value)
There was a problem hiding this comment.
In other words: 8 means at least two different things in the original code.
runtime/caml/sizeclasses.h
Outdated
| /* This file is generated by tools/gen_sizeclasses.ml */ | ||
| #define POOL_WSIZE 4096 | ||
| #define POOL_HEADER_WSIZE 4 | ||
| #define POOL_HEADER_WSIZE 8 |
There was a problem hiding this comment.
This is a generated file, I think you need to change tools/gen_sizeclasses.ml
There was a problem hiding this comment.
I believe this might be an issue that's breaking the tests, because other values depend on POOL_HEADER_WSIZE
462a830 to
0171197
Compare
|
Before we shred the proposed code to pieces, could we first agree on a design? I think the proposed API on the OCaml side makes sense. I don't understand how it interacts with the current shared-heap allocator. |
|
I marked this PR as "Draft", because I don't think it is ready for review. Let's articulate a design first, before reviewing an implementation. |
|
Thank you all for the feedback! I've moved the design from here to PR description. |
|
I think that some of your explanations say "byte", when you mean "word". Can you edit/update the design document? |
|
Thanks, corrected. |
Should this be |
|
Thank you, corrected. |
|
Thanks for the design notes.
Is this really a problem? EDIT: yes, it is; we get first fields of blocks (those containing the payloads of Atomic.t values) at addresses = 5 mod 8 (words), but addresses = 0 to 3 mod 8 (words) can contain data from a mutable block (but not an Atomic.t value!), which is enough to cause false sharing. |
|
At any rate, here is a simplified implementation. The remaining issue is to define |
|
Re: alignment of allocations in major heap pools, instead of adding fixed padding to the Lines 233 to 239 in 23dab79 so that the initial p is sz-aligned, at least if sz s a power of 2. Something like this:
or, if you like Hacker's Delight and want to stick to the spec, nothing but the spec: |
|
One thing to note, the proposed implementation of the compactor in #12193 adds a flag to the shared pool header: https://github.com/ocaml/ocaml/pull/12193/files#diff-b468e47c34ab18ea7b2cd8322cde20a2fa4d3e647422ee725f0fe7d84f768a32R54 which might make the decision of how to do alignment easier. |
0171197 to
200798b
Compare
|
I've pushed a number of changes:
One last thing I'm not sure about, is how to do fix the header size. The suggestion to do the change in Perhaps I could introduce a POOL_HEADER_MAX_WSIZE constant (or POOL_HEADER_RESERVED_WSIZE) that's treated as the de facto size of pool header (shifting the allocation of slabs as desired)? FWIW, another thing that seems to work is attaching |
c7a0377 to
90ba037
Compare
|
FWIW, it would be useful to add your branch to Sandmark nightly https://github.com/ocaml-bench/sandmark-nightly-config#adding-your-compiler-branch-to-the-nightly-runs to see if it affects the performance. CC @shakthimaan. |
| { | ||
| CAMLparam1(v); | ||
| const mlsize_t sz = Wosize_bhsize(Cache_line_bsize); | ||
| value res = caml_alloc_shr(sz, 0); |
There was a problem hiding this comment.
Do you not want sz - 1 here to account for the fact that we have a 1-word header? I remember that the earlier code did caml_alloc_shr(7,0). What am I missing?
There was a problem hiding this comment.
Answering the question myself -- Wosize_bhsize and not Wosize_bsize! 👍
tools/gen_sizeclasses.ml
Outdated
|
|
||
| Runtime has a constructor for atomics, which aligns them with cache lines | ||
| to avoid false sharing. The implementation relies on the fact that pools are | ||
| allocated with high alignment and slots of appropriate size are laid out in |
There was a problem hiding this comment.
What does "high alignment" mean?
There was a problem hiding this comment.
improved this comment
e89030c to
a6f309f
Compare
|
I've pushed an update:
|
|
The benchmarks look fine.
|
To be precise, the pool header is no longer larger, since wastage is reused as padding. However, with the changes to size of some sizeclasses, in the worst case some allocations may end up in slots 10% larger and that would fit. |
|
I find that this PR could benefit from a more detailed explanation of its motivation. Could you provide a compelling, typical use-case scenario for As an outcome it would be nice if we end up with a compelling example in the documentation for |
|
Examples:
I looked at Java and Go a while ago. They both let user add padding in the structure. Java has For C/C++, I think that padding is only required for arrays. For a single variable just using |
gadmm
left a comment
There was a problem hiding this comment.
Thanks for the pointers. Reading about @Contended and cpu.CacheLinePad, I agree that the present PR offers something like other language's toolbox. And instead of an example, it seems that there is a guideline along the lines of "first measure to see if you observe contention, and then decide if you benefit from alignment".
Regarding the specification, this seems less about being aligned than alone on a cache line (since alignment does not necessarily imply padding): "that is aligned on a cache line" => "that is alone on a cache line".
Apart from these two suggestions to improve the documentation, this PR looks good to me.
|
Agreed with both. |
|
I agree with @gadmm's remark on the naming focus on "alone/separate/contended" rather than "aligned". Should it go deeper than the documentation, and suggest a better name than
Once this question is resolved (possibly just by telling me that names are hard and we should go with the current version), I would be happy to approve the PR (I have only done a partial review so I would also approve on behalf of @gadmm). |
|
|
|
I've improved the comment. Feel free to push or suggest further changes. As for the name I find Java-ish make_contended to be the clearest. It references the use-case rather than the implementation details (unlike make_padded, make_aligned) and uses typical vocab for the area (unlike make_isolated, make_separated). Other than that, for padded vs aligned, these are the cases I've seen:
|
|
I like |
|
Sg, renamed it. |
|
@bartoszmodelski could you fill the Changes entry, so that we could merge this? A drive-by comment: C++17 has introduced a variable hardware_destructive_interference_size to get the required constant in a portable way. If the C compiler we rely on supports this, it may be a better default than our own home-grown logic. (I am not suggesting that we change the current PR to support this, it could be a follow-up and it requires more configure-fu than most people have.) |
Cache-aligned atomic to avoid false-sharing in lock-free algorithms. This patch relies on the slab allocator in shared heap, which is cache-line-aligned by design.
12c74ba to
ca974d0
Compare
|
Thanks for the ping. Pushed Changes just now. Feel free to amend the entry if needed. |
Cache-aligned atomic to avoid false-sharing in lock-free algorithms. This PR adds another constructor for
'a Atomic.t, which allocates a 7-word block directly in the shared heap, which is memory aligned by design.The design
My main goal here is to get cache-aligned allocations in a way that fits best with the rest of runtime. On the issue, @stedolan suggested to rely on the fact that pools in the slab allocator are already cache line aligned (#11986 (comment)). Since there's a sizeclass for 8-word (1+7) allocations only, each of the allocations there spans exactly cache line, and thus all allocations there are aligned by design.
Some experimenting has confirmed the general idea above, i.e. pools are aligned and consecutive 8-word allocations are spaced out as desired. However, they are not aligned. Pool begins with a 4-word header and that throws the allocations off. We end up with the following:
Therefore, the natural thing to do here seem to double the size of the header. That lands us with the right-looking outcome as far as my tests on x86-64.
Update May 12
As pointed out by @gadmm, rather than inflate the header size, we can move the trailing wastage area to the front and things work out fine. Structure of the entire pool looks as follows:
Inflating the header to shift cells is correct but the header has to be 256b (largest cache line sz) for all sizeclasses. Thus, every pool contains ~0.8% wasted space to begin with. We can do the following instead:
With this approach the blocks of 8w, 16w, 32w size are now aligned to the end of the pool and hence aligned to cache lines. We also keep the wastage to the minimum allowed by given sizeclass.
Compaction
In this PR I assume that compaction should not impact this. Intuitively, even on compaction an allocation should be moved to the slab with the same sizeclass, what gives us the same properties. Happy to try to dig more.
Alignment
As @smuenzel pointed out, 64b alignment is not enough in the general case.
Based on this, I think that we want at least 128b alignment as minimum. Do we want to push it to 256b?
Two details here:
Related issue #11986.
cc @stedolan @sadiqj @polytypic