Add cache-aligned atomic by bartoszmodelski · Pull Request #12212 · ocaml/ocaml

bartoszmodelski · 2023-04-26T18:30:47Z

Cache-aligned atomic to avoid false-sharing in lock-free algorithms. This PR adds another constructor for 'a Atomic.t, which allocates a 7-word block directly in the shared heap, which is memory aligned by design.

The design

My main goal here is to get cache-aligned allocations in a way that fits best with the rest of runtime. On the issue, @stedolan suggested to rely on the fact that pools in the slab allocator are already cache line aligned (#11986 (comment)). Since there's a sizeclass for 8-word (1+7) allocations only, each of the allocations there spans exactly cache line, and thus all allocations there are aligned by design.

n                 n+8w               n+16w  
| variable 1      | variable 2       | ...

Some experimenting has confirmed the general idea above, i.e. pools are aligned and consecutive 8-word allocations are spaced out as desired. However, they are not aligned. Pool begins with a 4-word header and that throws the allocations off. We end up with the following:

n                 n+8w               n+16w  
| header | variable 1       | variable 2

Therefore, the natural thing to do here seem to double the size of the header. That lands us with the right-looking outcome as far as my tests on x86-64.

Update May 12

As pointed out by @gadmm, rather than inflate the header size, we can move the trailing wastage area to the front and things work out fine. Structure of the entire pool looks as follows:

n                 n+8w               n+16w     ...       n+3996w    n+4kw
| header   | variable 1      | variable 2 |    ...       |  waste   |

Inflating the header to shift cells is correct but the header has to be 256b (largest cache line sz) for all sizeclasses. Thus, every pool contains ~0.8% wasted space to begin with. We can do the following instead:

n                 n+8w               n+16w        n+24w    ...      n+4kw
| header | waste  | variable 1       | variable 2 |        ...      |

With this approach the blocks of 8w, 16w, 32w size are now aligned to the end of the pool and hence aligned to cache lines. We also keep the wastage to the minimum allowed by given sizeclass.

Compaction

In this PR I assume that compaction should not impact this. Intuitively, even on compaction an allocation should be moved to the slab with the same sizeclass, what gives us the same properties. Happy to try to dig more.

Alignment

As @smuenzel pointed out, 64b alignment is not enough in the general case.

M1's and others have 128b cache lines.
z/Arch processors have up to 256b (e.g. IBM z15).
Apparently, 128b might be beneficial on 64bit in some cases as well (stackoverflow).

Based on this, I think that we want at least 128b alignment as minimum. Do we want to push it to 256b?

Two details here:

We could increase the header to 256b and do the actual allocations based on detected archicture/cache line. I cannot speak for how hard it is going to be to get right for all the cases.
Currently, there is no exclusive sizeclass for 256b. Again, the strongest claim I'm going to make is that adding it manually makes my tests pass.

Related issue #11986.

cc @stedolan @sadiqj @polytypic

gasche

I had a quick look at this PR. I am far from an expert on this sort of bit-level C programming, so some of my questions/remarks may be naive, apologies. My overall feeling (as a non-expert) is that the proposed interface (a new creation function for Atomic.t) is very reasonable -- I would even say that I like its simplicity -- but that its implementation in the runtime is not ready.

runtime/alloc.c

gasche · 2023-04-27T09:56:12Z

runtime/caml/alloc.h


+/* Alloc aligned for atomics */
+CAMLextern value caml_alloc_aligned(value);
+CAMLextern value caml_is_aligned(value);


I'm not fond of the naming here, "aligned" and "is_aligned" are very generic names that I would expect to be parametrized over an arbitrary alignment (as caml_aligned_malloc does for example). Also, the naming just suggests that we are allocating an OCaml value (so I would expect the parameters to be a header and a size or something), but in fact we are specifically allocating a one-world block of tag 0.

Maybe caml_make_atomic_aligned or caml_make_aligned_atomic would be better names? (And then caml_atomic_is_aligned or something. But I agree that you could skip this part of the PR that is not as clearly motivated.)

gasche · 2023-04-27T09:57:05Z

runtime/caml/config.h

 #define Default_runtime_events_log_wsize 16

+/* Memory-alignment for allocating atomics */
+#define Allocation_alignment 64


Why 64? I guess that we are trying to use the cache-line size to avoid false sharing. This could be documented.

Some quick googling suggests that the Apple M1 and POWER architecture processors have 128b cache line size, and z/architecture processors use 256b.

I would also suggest adding documentation to explain what "64" is supposed to mean here. Also, consider renaming this to Allocation_alignment_bsize (if it indeed is representing size in bytes).

Thank you for suggestions.

Renamed it.

Added better documentation.

Made it default to 128b and added a 256b case for z/Arch.

gasche · 2023-04-27T10:08:12Z

runtime/shared_heap.c

  value* next_obj;
  caml_domain_state* owner;
  sizeclass sz;
+  long p1; long p2; long p3; long p4; /* padding to align slabs */


I'm no C expert, but my impression is that this code is making assumptions about the word size of long that may not hold under all support configurations (at least if we include bytecode-only targets). I suspect that there is a caml-runtime-specific type alias that is more appropriate, maybe just uintnat. But should we also redefine sizeclass to be uintnat if we want to be able to make such padding assumptions?

Also, should this be uintnat padding[4] instead?

Finally, does this padding-to-align logic also work on systems with 32bit machine words? I think that you are asserting that the first element in a pool is 64-bytes-aligned, and assuming that a 8-word headers ensures that, which may not hold on 32bit systems.

testsuite/tests/callback/callback_effects_gc.ml

smuenzel · 2023-04-27T13:19:03Z

runtime/alloc.c

+  value var = caml_alloc_shr(7, 0);
+  CAMLassert((var - 8) % Allocation_alignment == 0);


Shouldn't this depend on Allocation_alignment?

Suggested change

value var = caml_alloc_shr(7, 0);

CAMLassert((var - 8) % Allocation_alignment == 0);

mlsize_t wo_total_block_size = Allocation_alignment/sizeof(value);

value var = caml_alloc_shr(wo_total_block_size-1, 0);

CAMLassert((var - wo_total_block_size) % Allocation_alignment == 0);

This makes even less sense than the original. I think var -8 in the original code was trying to get the address of the block header. Here, var - wo_total_block_size subtracts a number of words from the integer representation of a pointer, and it doesn't mean anything to me.

Ah yes, it should have been sizeof(value)

In other words: 8 means at least two different things in the original code.

smuenzel · 2023-04-27T13:21:44Z

runtime/caml/sizeclasses.h

 /* This file is generated by tools/gen_sizeclasses.ml */
 #define POOL_WSIZE 4096
-#define POOL_HEADER_WSIZE 4
+#define POOL_HEADER_WSIZE 8


This is a generated file, I think you need to change tools/gen_sizeclasses.ml

I believe this might be an issue that's breaking the tests, because other values depend on POOL_HEADER_WSIZE

xavierleroy · 2023-04-27T13:25:19Z

Before we shred the proposed code to pieces, could we first agree on a design? I think the proposed API on the OCaml side makes sense. I don't understand how it interacts with the current shared-heap allocator.

xavierleroy · 2023-04-27T13:29:49Z

I marked this PR as "Draft", because I don't think it is ready for review. Let's articulate a design first, before reviewing an implementation.

bartoszmodelski · 2023-04-28T12:31:44Z

Thank you all for the feedback!

I've moved the design from here to PR description.

smuenzel · 2023-04-28T12:56:17Z

I think that some of your explanations say "byte", when you mean "word". Can you edit/update the design document?

bartoszmodelski · 2023-04-28T13:15:05Z

Thanks, corrected.

kayceesrk · 2023-04-29T04:22:02Z

which allocates a 7-byte block

Should this be 7-word?

bartoszmodelski · 2023-04-29T08:21:15Z

Thank you, corrected.

runtime/alloc.c

runtime/shared_heap.c

stdlib/atomic.mli

xavierleroy · 2023-04-29T15:44:28Z

Thanks for the design notes.

Some experimenting has confirmed the general idea above, i.e. pools are aligned and consecutive 8-word allocations are spaced out as desired. However, they are not aligned. Pool begins with a 4-word header and that throws the allocations off.

Is this really a problem? EDIT: yes, it is; we get first fields of blocks (those containing the payloads of Atomic.t values) at addresses = 5 mod 8 (words), but addresses = 0 to 3 mod 8 (words) can contain data from a mutable block (but not an Atomic.t value!), which is enough to cause false sharing.

xavierleroy · 2023-04-29T16:01:18Z

At any rate, here is a simplified implementation. The remaining issue is to define Cache_line_bsize (size of cache lines in bytes) appropriately for the target processor.

#define Cache_line_bsize 64  /* or whatever is appropriate for the platform */

CAMLprim value caml_atomic_make_aligned (value v)
{
  CAMLparam1(v);
  const mlsize_t sz = Wosize_bhsize(Cache_line_bsize);
  value res = caml_alloc_shr(sz, 0);
  caml_initialize(&Field(res, 0), v);
  for (mlsize_t i = 1; i < sz; i++) Field(res, i) = Val_unit;
  CAMLreturn(res);
}

xavierleroy · 2023-04-29T16:24:24Z

Re: alignment of allocations in major heap pools, instead of adding fixed padding to the pool structure, you could perhaps change pool_initialize

ocaml/runtime/shared_heap.c

Lines 233 to 239 in 23dab79

    
           Caml_inline void pool_initialize(pool* r, 
        
                                            sizeclass sz, 
        
                                            caml_domain_state* owner) 
        
           { 
        
             mlsize_t wh = wsize_sizeclass[sz]; 
        
             value* p = (value*)((char*)r + POOL_HEADER_SZ); 
        
             value* end = (value*)((char*)r + Bsize_wsize(POOL_WSIZE));

so that the initial p is sz-aligned, at least if sz s a power of 2. Something like this:

  mlsize_t ofs0 = sz > POOL_HEADER_SZ ? sz : POOL_HEADER_SZ;
  value* p = (value*)((char*)r + ofs0);

or, if you like Hacker's Delight and want to stick to the spec, nothing but the spec:

  mlsize_t ofs0 = sz > POOL_HEADER_SZ && sz & (sz - 1) == 0 ? sz : POOL_HEADER_SZ;
  value* p = (value*)((char*)r + ofs0);

sadiqj · 2023-04-29T17:46:46Z

One thing to note, the proposed implementation of the compactor in #12193 adds a flag to the shared pool header: https://github.com/ocaml/ocaml/pull/12193/files#diff-b468e47c34ab18ea7b2cd8322cde20a2fa4d3e647422ee725f0fe7d84f768a32R54 which might make the decision of how to do alignment easier.

bartoszmodelski · 2023-05-02T16:31:07Z

I've pushed a number of changes:

Use @xavierleroy's implementation for atomic_make_aligned.
Move is_aligned to testsuite.
Use 128 bytes as the default alignment, and 256 bytes for z/Arch.
- Increase header size to 256 bytes.
- Add sizeclass of size 32 words. Tweaked max_overhead slightly, to keep the number of sizeclasses at 32. (Please let me know if I should re-run any benchmarks).
Better names, comments, etc.

One last thing I'm not sure about, is how to do fix the header size. The suggestion to do the change in pool_initialize to shift slabs to alignment looks good but I think that gen_sizeclasses.ml needs to be aware of this to properly compute wastage (and possibly other things).

Perhaps I could introduce a POOL_HEADER_MAX_WSIZE constant (or POOL_HEADER_RESERVED_WSIZE) that's treated as the de facto size of pool header (shifting the allocation of slabs as desired)?

FWIW, another thing that seems to work is attaching CAMLalign on the first field of pool struct. But we'd have to wrap it to work with words rather than bytes.

runtime/caml/config.h

kayceesrk · 2023-05-03T02:47:06Z

FWIW, it would be useful to add your branch to Sandmark nightly https://github.com/ocaml-bench/sandmark-nightly-config#adding-your-compiler-branch-to-the-nightly-runs to see if it affects the performance. CC @shakthimaan.

kayceesrk · 2023-05-03T03:39:58Z

runtime/alloc.c

+{
+  CAMLparam1(v);
+  const mlsize_t sz = Wosize_bhsize(Cache_line_bsize);
+  value res = caml_alloc_shr(sz, 0);


Do you not want sz - 1 here to account for the fact that we have a 1-word header? I remember that the earlier code did caml_alloc_shr(7,0). What am I missing?

Answering the question myself -- Wosize_bhsize and not Wosize_bsize! 👍

tools/gen_sizeclasses.ml

kayceesrk · 2023-05-03T04:07:55Z

tools/gen_sizeclasses.ml

+
+  Runtime has a constructor for atomics, which aligns them with cache lines
+  to avoid false sharing. The implementation relies on the fact that pools are
+  allocated with high alignment and slots of appropriate size are laid out in


What does "high alignment" mean?

improved this comment

bartoszmodelski · 2023-05-12T15:53:47Z

I've pushed an update:

Implemented the suggestion to move wastage block to the front. I've updated the design in PR description to reflect this.
Changed Cache_line_bsize to 64 for x64, riscv and non-native
Fixed the langauge in atomic.mli

kayceesrk · 2023-05-17T06:44:47Z

The benchmarks look fine.

Sequential on Navajo
- There is a small variance in both directions for the running time. About 4% in either direction. This I think is within expected range
- There is a 10-20% increase in maxRSS on the cubicle benchmarks, which is odd. I suspect that this is due to secondary effects of making the pool headers larger. The "top heap words" doesn't vary much between trunk and this PR. So I don't expect to find anything interesting to investigate here.
Parallel on Navajo

bartoszmodelski · 2023-05-17T12:20:12Z

There is a 10-20% increase in maxRSS on the cubicle benchmarks, which is odd. I suspect that this is due to secondary effects of making the pool headers larger. The "top heap words" doesn't vary much between trunk and this PR. So I don't expect to find anything interesting to investigate here.

To be precise, the pool header is no longer larger, since wastage is reused as padding. However, with the changes to size of some sizeclasses, in the worst case some allocations may end up in slots 10% larger and that would fit.

gadmm · 2023-05-17T16:19:16Z

I find that this PR could benefit from a more detailed explanation of its motivation. Could you provide a compelling, typical use-case scenario for Atomic.create_aligned? The example of "making a list with an atomic reference at each cell" leaves me unconvinced that it would be a typical use-case; your experience with lock-free data structures is probably more compelling. Specifically, could you clarify whether you optimally need an atomic that does not interfere with any other memory cell, or an array of atomics that do not interfere with each other? The solution for the latter would be different from a change in size classes. Could you tell us what is provided in other languages for this problem (leaving aside C/C++ and other languages with tight control on memory layout)?

As an outcome it would be nice if we end up with a compelling example in the documentation for create_aligned.

bartoszmodelski · 2023-05-26T13:04:03Z

Examples:

For a single data structure performance Anmol's benchmarks on a simple queue with two shared atomic pointers show a significant improvement coming from enforcing the alignment Aligned int repo. While the benchmark is simple, the "array with two pointers" design lets us build all kinds of bounded single/multi-producer/consumer stacks/queues/bags.
There's also a different issue in that the amount of false-sharing program gets may change in unexpected ways. Thus making the benchmarks far noisier and reactive to seemingly unrelated changes for micro-architectural reasons. Avoid false-sharing #11986 (comment)

I looked at Java and Go a while ago. They both let user add padding in the structure. Java has @Contended keyword, while Golang cpu.CacheLinePad.

For C/C++, I think that padding is only required for arrays. For a single variable just using aligned_alloc works (which as far as I remember just allocates a larger block and chops it to alignment).

gadmm

Thanks for the pointers. Reading about @Contended and cpu.CacheLinePad, I agree that the present PR offers something like other language's toolbox. And instead of an example, it seems that there is a guideline along the lines of "first measure to see if you observe contention, and then decide if you benefit from alignment".

Regarding the specification, this seems less about being aligned than alone on a cache line (since alignment does not necessarily imply padding): "that is aligned on a cache line" => "that is alone on a cache line".

Apart from these two suggestions to improve the documentation, this PR looks good to me.

bartoszmodelski · 2023-05-26T15:21:12Z

Agreed with both.

gasche · 2023-05-26T15:44:29Z

I agree with @gadmm's remark on the naming focus on "alone/separate/contended" rather than "aligned". Should it go deeper than the documentation, and suggest a better name than caml_atomic_make_aligned and Atomic.make_aligned? What about, for example:

Atomic.make_isolated
Atomic.make_separated
Atomic.make_cache_alone
Atomic.cache_contended?

Once this question is resolved (possibly just by telling me that names are hard and we should go with the current version), I would be happy to approve the PR (I have only done a partial review so I would also approve on behalf of @gadmm).

gadmm · 2023-05-26T15:59:13Z

make_aligned is ok as a colloquial name. make_padded is perhaps a bit more accurate.

bartoszmodelski · 2023-05-31T12:56:08Z

I've improved the comment. Feel free to push or suggest further changes.

As for the name I find Java-ish make_contended to be the clearest. It references the use-case rather than the implementation details (unlike make_padded, make_aligned) and uses typical vocab for the area (unlike make_isolated, make_separated).

Other than that, for padded vs aligned, these are the cases I've seen:

padding means just padding (Go, Java)
aligned may entail padding (aligned_alloc requires size to be a multiple of alignment).

gasche · 2023-05-31T14:26:45Z

I like make_contended slightly better than make_aligned and would be in favor of the rename if you also like it. (Hopefully no one dislikes the new name!)

bartoszmodelski · 2023-06-01T11:35:04Z

Sg, renamed it.

kayceesrk

LGTM

Changes

gasche · 2023-06-28T14:38:07Z

@bartoszmodelski could you fill the Changes entry, so that we could merge this?

A drive-by comment: C++17 has introduced a variable hardware_destructive_interference_size to get the required constant in a portable way. If the C compiler we rely on supports this, it may be a better default than our own home-grown logic. (I am not suggesting that we change the current PR to support this, it could be a follow-up and it requires more configure-fu than most people have.)

Cache-aligned atomic to avoid false-sharing in lock-free algorithms. This patch relies on the slab allocator in shared heap, which is cache-line-aligned by design.

bartoszmodelski · 2023-07-02T21:56:10Z

Thanks for the ping. Pushed Changes just now. Feel free to amend the entry if needed.

gasche reviewed Apr 27, 2023

View reviewed changes

smuenzel reviewed Apr 27, 2023

View reviewed changes

bartoszmodelski force-pushed the aligned-alloc branch from 462a830 to 0171197 Compare April 27, 2023 13:24

xavierleroy marked this pull request as draft April 27, 2023 13:28

kayceesrk reviewed Apr 29, 2023

View reviewed changes

runtime/alloc.c Outdated Show resolved Hide resolved

kayceesrk reviewed Apr 29, 2023

View reviewed changes

runtime/alloc.c Outdated Show resolved Hide resolved

kayceesrk reviewed Apr 29, 2023

View reviewed changes

runtime/shared_heap.c Show resolved Hide resolved

kayceesrk reviewed Apr 29, 2023

View reviewed changes

stdlib/atomic.mli Outdated Show resolved Hide resolved

bartoszmodelski force-pushed the aligned-alloc branch from 0171197 to 200798b Compare May 2, 2023 16:27

bartoszmodelski commented May 2, 2023

View reviewed changes

runtime/caml/config.h Show resolved Hide resolved

bartoszmodelski force-pushed the aligned-alloc branch 3 times, most recently from c7a0377 to 90ba037 Compare May 2, 2023 17:00

kayceesrk reviewed May 3, 2023

View reviewed changes

tools/gen_sizeclasses.ml Outdated Show resolved Hide resolved

kayceesrk reviewed May 3, 2023

View reviewed changes

bartoszmodelski force-pushed the aligned-alloc branch 2 times, most recently from e89030c to a6f309f Compare May 12, 2023 15:22

gadmm approved these changes May 26, 2023

View reviewed changes

kayceesrk approved these changes Jun 1, 2023

View reviewed changes

gasche reviewed Jun 1, 2023

View reviewed changes

Changes Outdated Show resolved Hide resolved

bartoszmodelski added 2 commits July 2, 2023 15:52

Add cache-aligned atomic

58356c7

Cache-aligned atomic to avoid false-sharing in lock-free algorithms. This patch relies on the slab allocator in shared heap, which is cache-line-aligned by design.

update Changes

ca974d0

bartoszmodelski force-pushed the aligned-alloc branch from 12c74ba to ca974d0 Compare July 2, 2023 21:52

kayceesrk merged commit 6842ffc into ocaml:trunk Jul 3, 2023

sadiqj mentioned this pull request Jul 9, 2023

Compaction #12193

Merged

This was referenced Jul 28, 2023

Support padding to avoid false sharing ocaml-multicore/kcas#116

Merged

Pad to cache line aligned size based on architecture. ocaml-multicore/multicore-magic#4

Merged

Sudha247 mentioned this pull request Sep 12, 2023

Cache-aligned atomic var ocaml-multicore/saturn#19

Closed

riaqn mentioned this pull request Dec 4, 2023

Ports upstream #12212 oxcaml/oxcaml#2118

Merged

gasche mentioned this pull request May 8, 2024

Avoid false-sharing #11986

Closed

xavierleroy mentioned this pull request Jun 3, 2024

Simplify CAMLalign and improve alignment of caml_stat_block with C11 max_align_t and aligned_alloc #13139

Merged

kayceesrk added the Performance PR or issues affecting runtime performance of the compiled programs label Oct 16, 2024

		value var = caml_alloc_shr(7, 0);
		CAMLassert((var - 8) % Allocation_alignment == 0);

-  value var = caml_alloc_shr(7, 0);
-  CAMLassert((var - 8) % Allocation_alignment == 0);
+  mlsize_t wo_total_block_size = Allocation_alignment/sizeof(value);
+  value var = caml_alloc_shr(wo_total_block_size-1, 0);
+  CAMLassert((var - wo_total_block_size) % Allocation_alignment == 0);

Conversation

bartoszmodelski commented Apr 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The design

Update May 12

Compaction

Alignment

Uh oh!

gasche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kayceesrk Apr 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xavierleroy commented Apr 27, 2023

Uh oh!

xavierleroy commented Apr 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bartoszmodelski commented Apr 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smuenzel commented Apr 28, 2023

Uh oh!

bartoszmodelski commented Apr 28, 2023

Uh oh!

kayceesrk commented Apr 29, 2023

Uh oh!

bartoszmodelski commented Apr 29, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xavierleroy commented Apr 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xavierleroy commented Apr 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xavierleroy commented Apr 29, 2023

Uh oh!

sadiqj commented Apr 29, 2023

Uh oh!

bartoszmodelski commented May 2, 2023

Uh oh!

Uh oh!

kayceesrk commented May 3, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

bartoszmodelski commented Apr 26, 2023 •

edited

Loading

kayceesrk Apr 29, 2023 •

edited

Loading

xavierleroy commented Apr 27, 2023 •

edited

Loading

bartoszmodelski commented Apr 28, 2023 •

edited

Loading

xavierleroy commented Apr 29, 2023 •

edited

Loading

xavierleroy commented Apr 29, 2023 •

edited

Loading

gadmm commented May 17, 2023 •

edited

Loading

bartoszmodelski commented May 26, 2023 •

edited

Loading

bartoszmodelski commented May 31, 2023 •

edited

Loading