Skip to content

Add cache-aligned atomic#12212

Merged
kayceesrk merged 2 commits intoocaml:trunkfrom
bartoszmodelski:aligned-alloc
Jul 3, 2023
Merged

Add cache-aligned atomic#12212
kayceesrk merged 2 commits intoocaml:trunkfrom
bartoszmodelski:aligned-alloc

Conversation

@bartoszmodelski
Copy link
Copy Markdown
Contributor

@bartoszmodelski bartoszmodelski commented Apr 26, 2023

Cache-aligned atomic to avoid false-sharing in lock-free algorithms. This PR adds another constructor for 'a Atomic.t, which allocates a 7-word block directly in the shared heap, which is memory aligned by design.

The design

My main goal here is to get cache-aligned allocations in a way that fits best with the rest of runtime. On the issue, @stedolan suggested to rely on the fact that pools in the slab allocator are already cache line aligned (#11986 (comment)). Since there's a sizeclass for 8-word (1+7) allocations only, each of the allocations there spans exactly cache line, and thus all allocations there are aligned by design.

n                 n+8w               n+16w  
| variable 1      | variable 2       | ... 

Some experimenting has confirmed the general idea above, i.e. pools are aligned and consecutive 8-word allocations are spaced out as desired. However, they are not aligned. Pool begins with a 4-word header and that throws the allocations off. We end up with the following:

n                 n+8w               n+16w  
| header | variable 1       | variable 2         

Therefore, the natural thing to do here seem to double the size of the header. That lands us with the right-looking outcome as far as my tests on x86-64.

Update May 12

As pointed out by @gadmm, rather than inflate the header size, we can move the trailing wastage area to the front and things work out fine. Structure of the entire pool looks as follows:

n                 n+8w               n+16w     ...       n+3996w    n+4kw
| header   | variable 1      | variable 2 |    ...       |  waste   |     

Inflating the header to shift cells is correct but the header has to be 256b (largest cache line sz) for all sizeclasses. Thus, every pool contains ~0.8% wasted space to begin with. We can do the following instead:

n                 n+8w               n+16w        n+24w    ...      n+4kw
| header | waste  | variable 1       | variable 2 |        ...      |     

With this approach the blocks of 8w, 16w, 32w size are now aligned to the end of the pool and hence aligned to cache lines. We also keep the wastage to the minimum allowed by given sizeclass.

Compaction

In this PR I assume that compaction should not impact this. Intuitively, even on compaction an allocation should be moved to the slab with the same sizeclass, what gives us the same properties. Happy to try to dig more.

Alignment

As @smuenzel pointed out, 64b alignment is not enough in the general case.

  • M1's and others have 128b cache lines.
  • z/Arch processors have up to 256b (e.g. IBM z15).
  • Apparently, 128b might be beneficial on 64bit in some cases as well (stackoverflow).

Based on this, I think that we want at least 128b alignment as minimum. Do we want to push it to 256b?

Two details here:

  • We could increase the header to 256b and do the actual allocations based on detected archicture/cache line. I cannot speak for how hard it is going to be to get right for all the cases.
  • Currently, there is no exclusive sizeclass for 256b. Again, the strongest claim I'm going to make is that adding it manually makes my tests pass.

Related issue #11986.

cc @stedolan @sadiqj @polytypic

Copy link
Copy Markdown
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a quick look at this PR. I am far from an expert on this sort of bit-level C programming, so some of my questions/remarks may be naive, apologies. My overall feeling (as a non-expert) is that the proposed interface (a new creation function for Atomic.t) is very reasonable -- I would even say that I like its simplicity -- but that its implementation in the runtime is not ready.


/* Alloc aligned for atomics */
CAMLextern value caml_alloc_aligned(value);
CAMLextern value caml_is_aligned(value);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not fond of the naming here, "aligned" and "is_aligned" are very generic names that I would expect to be parametrized over an arbitrary alignment (as caml_aligned_malloc does for example). Also, the naming just suggests that we are allocating an OCaml value (so I would expect the parameters to be a header and a size or something), but in fact we are specifically allocating a one-world block of tag 0.

Maybe caml_make_atomic_aligned or caml_make_aligned_atomic would be better names? (And then caml_atomic_is_aligned or something. But I agree that you could skip this part of the PR that is not as clearly motivated.)

#define Default_runtime_events_log_wsize 16

/* Memory-alignment for allocating atomics */
#define Allocation_alignment 64
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 64? I guess that we are trying to use the cache-line size to avoid false sharing. This could be documented.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some quick googling suggests that the Apple M1 and POWER architecture processors have 128b cache line size, and z/architecture processors use 256b.

Copy link
Copy Markdown
Contributor

@kayceesrk kayceesrk Apr 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also suggest adding documentation to explain what "64" is supposed to mean here. Also, consider renaming this to Allocation_alignment_bsize (if it indeed is representing size in bytes).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for suggestions.

  • Renamed it.
  • Added better documentation.
  • Made it default to 128b and added a 256b case for z/Arch.

value* next_obj;
caml_domain_state* owner;
sizeclass sz;
long p1; long p2; long p3; long p4; /* padding to align slabs */
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm no C expert, but my impression is that this code is making assumptions about the word size of long that may not hold under all support configurations (at least if we include bytecode-only targets). I suspect that there is a caml-runtime-specific type alias that is more appropriate, maybe just uintnat. But should we also redefine sizeclass to be uintnat if we want to be able to make such padding assumptions?

Also, should this be uintnat padding[4] instead?

Finally, does this padding-to-align logic also work on systems with 32bit machine words? I think that you are asserting that the first element in a pool is 64-bytes-aligned, and assuming that a 8-word headers ensures that, which may not hold on 32bit systems.

runtime/alloc.c Outdated
Comment on lines +382 to +383
value var = caml_alloc_shr(7, 0);
CAMLassert((var - 8) % Allocation_alignment == 0);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this depend on Allocation_alignment?

Suggested change
value var = caml_alloc_shr(7, 0);
CAMLassert((var - 8) % Allocation_alignment == 0);
mlsize_t wo_total_block_size = Allocation_alignment/sizeof(value);
value var = caml_alloc_shr(wo_total_block_size-1, 0);
CAMLassert((var - wo_total_block_size) % Allocation_alignment == 0);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes even less sense than the original. I think var -8 in the original code was trying to get the address of the block header. Here, var - wo_total_block_size subtracts a number of words from the integer representation of a pointer, and it doesn't mean anything to me.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, it should have been sizeof(value)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words: 8 means at least two different things in the original code.

/* This file is generated by tools/gen_sizeclasses.ml */
#define POOL_WSIZE 4096
#define POOL_HEADER_WSIZE 4
#define POOL_HEADER_WSIZE 8
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a generated file, I think you need to change tools/gen_sizeclasses.ml

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this might be an issue that's breaking the tests, because other values depend on POOL_HEADER_WSIZE

@xavierleroy
Copy link
Copy Markdown
Contributor

Before we shred the proposed code to pieces, could we first agree on a design? I think the proposed API on the OCaml side makes sense. I don't understand how it interacts with the current shared-heap allocator.

@xavierleroy xavierleroy marked this pull request as draft April 27, 2023 13:28
@xavierleroy
Copy link
Copy Markdown
Contributor

xavierleroy commented Apr 27, 2023

I marked this PR as "Draft", because I don't think it is ready for review. Let's articulate a design first, before reviewing an implementation.

@bartoszmodelski
Copy link
Copy Markdown
Contributor Author

bartoszmodelski commented Apr 28, 2023

Thank you all for the feedback!

I've moved the design from here to PR description.

@smuenzel
Copy link
Copy Markdown
Contributor

I think that some of your explanations say "byte", when you mean "word". Can you edit/update the design document?

@bartoszmodelski
Copy link
Copy Markdown
Contributor Author

Thanks, corrected.

@kayceesrk
Copy link
Copy Markdown
Contributor

which allocates a 7-byte block

Should this be 7-word?

@bartoszmodelski
Copy link
Copy Markdown
Contributor Author

Thank you, corrected.

@xavierleroy
Copy link
Copy Markdown
Contributor

xavierleroy commented Apr 29, 2023

Thanks for the design notes.

Some experimenting has confirmed the general idea above, i.e. pools are aligned and consecutive 8-word allocations are spaced out as desired. However, they are not aligned. Pool begins with a 4-word header and that throws the allocations off.

Is this really a problem? EDIT: yes, it is; we get first fields of blocks (those containing the payloads of Atomic.t values) at addresses = 5 mod 8 (words), but addresses = 0 to 3 mod 8 (words) can contain data from a mutable block (but not an Atomic.t value!), which is enough to cause false sharing.

@xavierleroy
Copy link
Copy Markdown
Contributor

xavierleroy commented Apr 29, 2023

At any rate, here is a simplified implementation. The remaining issue is to define Cache_line_bsize (size of cache lines in bytes) appropriately for the target processor.

#define Cache_line_bsize 64  /* or whatever is appropriate for the platform */

CAMLprim value caml_atomic_make_aligned (value v)
{
  CAMLparam1(v);
  const mlsize_t sz = Wosize_bhsize(Cache_line_bsize);
  value res = caml_alloc_shr(sz, 0);
  caml_initialize(&Field(res, 0), v);
  for (mlsize_t i = 1; i < sz; i++) Field(res, i) = Val_unit;
  CAMLreturn(res);
}

@xavierleroy
Copy link
Copy Markdown
Contributor

Re: alignment of allocations in major heap pools, instead of adding fixed padding to the pool structure, you could perhaps change pool_initialize

ocaml/runtime/shared_heap.c

Lines 233 to 239 in 23dab79

Caml_inline void pool_initialize(pool* r,
sizeclass sz,
caml_domain_state* owner)
{
mlsize_t wh = wsize_sizeclass[sz];
value* p = (value*)((char*)r + POOL_HEADER_SZ);
value* end = (value*)((char*)r + Bsize_wsize(POOL_WSIZE));

so that the initial p is sz-aligned, at least if sz s a power of 2. Something like this:

  mlsize_t ofs0 = sz > POOL_HEADER_SZ ? sz : POOL_HEADER_SZ;
  value* p = (value*)((char*)r + ofs0);

or, if you like Hacker's Delight and want to stick to the spec, nothing but the spec:

  mlsize_t ofs0 = sz > POOL_HEADER_SZ && sz & (sz - 1) == 0 ? sz : POOL_HEADER_SZ;
  value* p = (value*)((char*)r + ofs0);

@sadiqj
Copy link
Copy Markdown
Contributor

sadiqj commented Apr 29, 2023

One thing to note, the proposed implementation of the compactor in #12193 adds a flag to the shared pool header: https://github.com/ocaml/ocaml/pull/12193/files#diff-b468e47c34ab18ea7b2cd8322cde20a2fa4d3e647422ee725f0fe7d84f768a32R54 which might make the decision of how to do alignment easier.

@bartoszmodelski
Copy link
Copy Markdown
Contributor Author

I've pushed a number of changes:

  • Use @xavierleroy's implementation for atomic_make_aligned.
  • Move is_aligned to testsuite.
  • Use 128 bytes as the default alignment, and 256 bytes for z/Arch.
    • Increase header size to 256 bytes.
    • Add sizeclass of size 32 words. Tweaked max_overhead slightly, to keep the number of sizeclasses at 32. (Please let me know if I should re-run any benchmarks).
  • Better names, comments, etc.

One last thing I'm not sure about, is how to do fix the header size. The suggestion to do the change in pool_initialize to shift slabs to alignment looks good but I think that gen_sizeclasses.ml needs to be aware of this to properly compute wastage (and possibly other things).

Perhaps I could introduce a POOL_HEADER_MAX_WSIZE constant (or POOL_HEADER_RESERVED_WSIZE) that's treated as the de facto size of pool header (shifting the allocation of slabs as desired)?

FWIW, another thing that seems to work is attaching CAMLalign on the first field of pool struct. But we'd have to wrap it to work with words rather than bytes.

@bartoszmodelski bartoszmodelski force-pushed the aligned-alloc branch 3 times, most recently from c7a0377 to 90ba037 Compare May 2, 2023 17:00
@kayceesrk
Copy link
Copy Markdown
Contributor

FWIW, it would be useful to add your branch to Sandmark nightly https://github.com/ocaml-bench/sandmark-nightly-config#adding-your-compiler-branch-to-the-nightly-runs to see if it affects the performance. CC @shakthimaan.

{
CAMLparam1(v);
const mlsize_t sz = Wosize_bhsize(Cache_line_bsize);
value res = caml_alloc_shr(sz, 0);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you not want sz - 1 here to account for the fact that we have a 1-word header? I remember that the earlier code did caml_alloc_shr(7,0). What am I missing?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answering the question myself -- Wosize_bhsize and not Wosize_bsize! 👍


Runtime has a constructor for atomics, which aligns them with cache lines
to avoid false sharing. The implementation relies on the fact that pools are
allocated with high alignment and slots of appropriate size are laid out in
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "high alignment" mean?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

improved this comment

@bartoszmodelski bartoszmodelski force-pushed the aligned-alloc branch 2 times, most recently from e89030c to a6f309f Compare May 12, 2023 15:22
@bartoszmodelski
Copy link
Copy Markdown
Contributor Author

I've pushed an update:

  • Implemented the suggestion to move wastage block to the front. I've updated the design in PR description to reflect this.
  • Changed Cache_line_bsize to 64 for x64, riscv and non-native
  • Fixed the langauge in atomic.mli

@kayceesrk
Copy link
Copy Markdown
Contributor

The benchmarks look fine.

  • Sequential on Navajo
    • There is a small variance in both directions for the running time. About 4% in either direction. This I think is within expected range
    • There is a 10-20% increase in maxRSS on the cubicle benchmarks, which is odd. I suspect that this is due to secondary effects of making the pool headers larger. The "top heap words" doesn't vary much between trunk and this PR. So I don't expect to find anything interesting to investigate here.
  • Parallel on Navajo

@bartoszmodelski
Copy link
Copy Markdown
Contributor Author

  • There is a 10-20% increase in maxRSS on the cubicle benchmarks, which is odd. I suspect that this is due to secondary effects of making the pool headers larger. The "top heap words" doesn't vary much between trunk and this PR. So I don't expect to find anything interesting to investigate here.

To be precise, the pool header is no longer larger, since wastage is reused as padding. However, with the changes to size of some sizeclasses, in the worst case some allocations may end up in slots 10% larger and that would fit.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented May 17, 2023

I find that this PR could benefit from a more detailed explanation of its motivation. Could you provide a compelling, typical use-case scenario for Atomic.create_aligned? The example of "making a list with an atomic reference at each cell" leaves me unconvinced that it would be a typical use-case; your experience with lock-free data structures is probably more compelling. Specifically, could you clarify whether you optimally need an atomic that does not interfere with any other memory cell, or an array of atomics that do not interfere with each other? The solution for the latter would be different from a change in size classes. Could you tell us what is provided in other languages for this problem (leaving aside C/C++ and other languages with tight control on memory layout)?

As an outcome it would be nice if we end up with a compelling example in the documentation for create_aligned.

@bartoszmodelski
Copy link
Copy Markdown
Contributor Author

Examples:

  • For a single data structure performance Anmol's benchmarks on a simple queue with two shared atomic pointers show a significant improvement coming from enforcing the alignment Aligned int repo. While the benchmark is simple, the "array with two pointers" design lets us build all kinds of bounded single/multi-producer/consumer stacks/queues/bags.
  • There's also a different issue in that the amount of false-sharing program gets may change in unexpected ways. Thus making the benchmarks far noisier and reactive to seemingly unrelated changes for micro-architectural reasons. Avoid false-sharing #11986 (comment)

I looked at Java and Go a while ago. They both let user add padding in the structure. Java has @Contended keyword, while Golang cpu.CacheLinePad.

For C/C++, I think that padding is only required for arrays. For a single variable just using aligned_alloc works (which as far as I remember just allocates a larger block and chops it to alignment).

Copy link
Copy Markdown
Contributor

@gadmm gadmm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointers. Reading about @Contended and cpu.CacheLinePad, I agree that the present PR offers something like other language's toolbox. And instead of an example, it seems that there is a guideline along the lines of "first measure to see if you observe contention, and then decide if you benefit from alignment".

Regarding the specification, this seems less about being aligned than alone on a cache line (since alignment does not necessarily imply padding): "that is aligned on a cache line" => "that is alone on a cache line".

Apart from these two suggestions to improve the documentation, this PR looks good to me.

@bartoszmodelski
Copy link
Copy Markdown
Contributor Author

bartoszmodelski commented May 26, 2023

Agreed with both.

@gasche
Copy link
Copy Markdown
Member

gasche commented May 26, 2023

I agree with @gadmm's remark on the naming focus on "alone/separate/contended" rather than "aligned". Should it go deeper than the documentation, and suggest a better name than caml_atomic_make_aligned and Atomic.make_aligned? What about, for example:

  • Atomic.make_isolated
  • Atomic.make_separated
  • Atomic.make_cache_alone
  • Atomic.cache_contended?

Once this question is resolved (possibly just by telling me that names are hard and we should go with the current version), I would be happy to approve the PR (I have only done a partial review so I would also approve on behalf of @gadmm).

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented May 26, 2023

make_aligned is ok as a colloquial name. make_padded is perhaps a bit more accurate.

@bartoszmodelski
Copy link
Copy Markdown
Contributor Author

bartoszmodelski commented May 31, 2023

I've improved the comment. Feel free to push or suggest further changes.

As for the name I find Java-ish make_contended to be the clearest. It references the use-case rather than the implementation details (unlike make_padded, make_aligned) and uses typical vocab for the area (unlike make_isolated, make_separated).

Other than that, for padded vs aligned, these are the cases I've seen:

  • padding means just padding (Go, Java)
  • aligned may entail padding (aligned_alloc requires size to be a multiple of alignment).

@gasche
Copy link
Copy Markdown
Member

gasche commented May 31, 2023

I like make_contended slightly better than make_aligned and would be in favor of the rename if you also like it. (Hopefully no one dislikes the new name!)

@bartoszmodelski
Copy link
Copy Markdown
Contributor Author

Sg, renamed it.

Copy link
Copy Markdown
Contributor

@kayceesrk kayceesrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gasche
Copy link
Copy Markdown
Member

gasche commented Jun 28, 2023

@bartoszmodelski could you fill the Changes entry, so that we could merge this?

A drive-by comment: C++17 has introduced a variable hardware_destructive_interference_size to get the required constant in a portable way. If the C compiler we rely on supports this, it may be a better default than our own home-grown logic. (I am not suggesting that we change the current PR to support this, it could be a follow-up and it requires more configure-fu than most people have.)

Cache-aligned atomic to avoid false-sharing in lock-free algorithms.
This patch relies on the slab allocator in shared heap, which is
cache-line-aligned by design.
@bartoszmodelski
Copy link
Copy Markdown
Contributor Author

Thanks for the ping. Pushed Changes just now. Feel free to amend the entry if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Performance PR or issues affecting runtime performance of the compiled programs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants