Skip to content

index: optimize go GC performance#4352

Merged
MichaelEischer merged 11 commits intorestic:masterfrom
MichaelEischer:pointerless-index
Jun 16, 2023
Merged

index: optimize go GC performance#4352
MichaelEischer merged 11 commits intorestic:masterfrom
MichaelEischer:pointerless-index

Conversation

@MichaelEischer
Copy link
Copy Markdown
Member

@MichaelEischer MichaelEischer commented Jun 2, 2023

What does this PR change? What problem does it solve?

The current index implementation using a custom indexmap is not particular garbage-collection friendly. As each indexEntry and the bucket list contain pointers, this requires the garbage collector to chase pointers to random locations throughout the index.

This PR moves the indexEntries into an array and only adds the array index instead of a pointer. Thereby, the indexEntries and bucket list no longer contain pointers and are not scanned by the garbage collector. The array containing all indexEntries is implemented as https://en.wikipedia.org/wiki/Hashed_array_tree to minimize the memory overhead.

The top-layer of the hashed array tree (HAT) also has a size of O(sqrt(n)), which should make it cache efficient. The top-layer should be small enough to easily fit into the CPU cache and thus only adds little overhead compared to directly accessing an index entry via a pointer.

As a bonus, the indexmap.Each/Grow functions are now able to iterate over the indexEntries array instead of following random pointers. This significantly speeds up both functions.

Results

The indirection through the HAT increases the index lookup duration by 20-40% (the average is probably between 20-30% but the results are pretty noisy as a result of the varying map entry layout due to the maphash randomization), although the MasterIndex based tests show numbers between 0 to 35% overhead.

Iterating over the index is now up to 8 times faster, which also translates to slightly faster index encoding.
GC is now nearly for free.

grow() is also pretty interesting, the performance changes are likely a result of a different order of the chained indexEntries.

Commits:

  • fast-grow index: implement indexmap.grow() without random access
  • fast-each index: let indexmap.Each iterate in allocation order
  • hat-opt index: remove redundant storage of indexmap size
  • hat-inline index: Allow inlining of HAT
  • hat-base index: store indexEntries in hashed array tree
  • remove-pointers index: Remove pointers from within indexentrys
  • base - index: add garbage collection benchmark
benchstat base remove-pointers
name                                              old time/op    new time/op    delta
IndexMapHash-12                                     1.25µs ± 3%    1.27µs ± 1%    +1.62%  (p=0.000 n=15+15)
DecodeIndex-12                                       2.38s ± 4%     2.35s ± 1%      ~     (p=0.367 n=15+15)
DecodeIndexParallel-12                               2.31s ± 5%     2.38s ± 1%    +2.70%  (p=0.041 n=15+12)
EncodeIndex/100-12                                   618µs ± 3%     621µs ± 3%      ~     (p=0.233 n=15+15)
EncodeIndex/1000-12                                 6.05ms ± 3%    6.11ms ± 2%      ~     (p=0.074 n=15+15)
EncodeIndex/10000-12                                64.5ms ± 2%    63.9ms ± 2%      ~     (p=0.134 n=15+14)
IndexHasUnknown-12                                  25.2ns ± 9%    26.7ns ±16%      ~     (p=0.083 n=14+15)
IndexHasKnown-12                                    26.8ns ± 7%    27.7ns ±11%    +3.35%  (p=0.030 n=13+15)
IndexAlloc-12                                        301ms ± 1%     338ms ± 2%   +12.34%  (p=0.000 n=15+15)
IndexAllocParallel-12                                107ms ± 3%      92ms ± 2%   -13.80%  (p=0.000 n=15+14)
MasterIndexAlloc-12                                  133ms ± 2%     197ms ± 3%   +48.36%  (p=0.000 n=15+15)
MasterIndexLookupSingleIndex-12                     60.3ns ±10%    61.3ns ± 9%      ~     (p=0.305 n=15+15)
MasterIndexLookupMultipleIndex-12                   60.4ns ± 8%    54.9ns ± 6%    -9.22%  (p=0.000 n=15+14)
MasterIndexLookupSingleIndexUnknown-12              31.9ns ±11%    33.9ns ±12%    +6.27%  (p=0.015 n=15+15)
MasterIndexLookupMultipleIndexUnknown-12            30.4ns ± 9%    31.3ns ± 6%      ~     (p=0.083 n=15+15)
MasterIndexLookupParallel/known,indices=25-12        163ns ± 9%     165ns ±12%      ~     (p=0.723 n=14+15)
MasterIndexLookupParallel/unknown,indices=25-12      129ns ±13%     133ns ±19%      ~     (p=0.330 n=15+15)
MasterIndexLookupParallel/known,indices=50-12        144ns ±14%     166ns ±13%   +15.12%  (p=0.000 n=15+15)
MasterIndexLookupParallel/unknown,indices=50-12      128ns ±11%     133ns ±14%      ~     (p=0.075 n=14+15)
MasterIndexLookupParallel/known,indices=100-12       144ns ±15%     160ns ±18%   +10.89%  (p=0.004 n=15+15)
MasterIndexLookupParallel/unknown,indices=100-12     130ns ± 5%     130ns ±13%      ~     (p=0.892 n=13+15)
MasterIndexLookupBlobSize-12                        30.3ns ± 6%    31.2ns ± 7%      ~     (p=0.087 n=15+15)
MasterIndexEach-12                                   734ms ± 1%     669ms ± 3%    -8.97%  (p=0.000 n=15+15)
MasterIndexGC-12                                     116ms ±36%       0ms ± 2%   -99.74%  (p=0.000 n=15+12)

name                                              old speed      new speed      delta
IndexMapHash-12                                   3.29GB/s ± 3%  3.23GB/s ± 1%    -1.60%  (p=0.000 n=15+15)

name                                              old alloc/op   new alloc/op   delta
IndexMapHash-12                                      0.00B          0.00B           ~     (all equal)
EncodeIndex/100-12                                   263kB ± 0%     263kB ± 0%      ~     (p=0.926 n=15+15)
EncodeIndex/1000-12                                 2.61MB ± 0%    2.61MB ± 0%      ~     (p=0.516 n=15+12)
EncodeIndex/10000-12                                25.7MB ± 2%    25.7MB ± 5%      ~     (p=0.775 n=15+15)
IndexAlloc-12                                        451MB ± 0%     954MB ± 0%  +111.49%  (p=0.000 n=13+15)
IndexAllocParallel-12                                451MB ± 0%     954MB ± 0%  +111.46%  (p=0.000 n=11+14)
MasterIndexAlloc-12                                  162MB ± 0%     347MB ± 0%  +114.17%  (p=0.000 n=15+13)

name                                              old allocs/op  new allocs/op  delta
IndexMapHash-12                                       0.00           0.00           ~     (all equal)
EncodeIndex/100-12                                   2.39k ± 0%     2.39k ± 0%      ~     (all equal)
EncodeIndex/1000-12                                  22.9k ± 0%     22.9k ± 0%      ~     (p=0.061 n=15+13)
EncodeIndex/10000-12                                  229k ± 0%      229k ± 0%      ~     (p=0.690 n=15+15)
IndexAlloc-12                                        1.53M ± 0%     1.10M ± 0%   -28.25%  (p=0.000 n=14+15)
IndexAllocParallel-12                                1.53M ± 0%     1.10M ± 0%   -28.25%  (p=0.000 n=11+15)
MasterIndexAlloc-12                                   585k ± 0%      435k ± 0%   -25.67%  (p=0.000 n=15+13)
benchstat base hat-base
name                                              old time/op    new time/op    delta
IndexMapHash-12                                     1.25µs ± 3%    1.22µs ± 2%   -1.86%  (p=0.000 n=15+14)
DecodeIndex-12                                       2.38s ± 4%     2.28s ± 1%   -4.14%  (p=0.000 n=15+14)
DecodeIndexParallel-12                               2.31s ± 5%     2.35s ± 4%     ~     (p=0.056 n=15+15)
EncodeIndex/100-12                                   618µs ± 3%     623µs ± 3%     ~     (p=0.074 n=15+15)
EncodeIndex/1000-12                                 6.05ms ± 3%    6.06ms ± 1%     ~     (p=0.591 n=15+14)
EncodeIndex/10000-12                                64.5ms ± 2%    64.6ms ± 1%     ~     (p=0.186 n=15+14)
IndexHasUnknown-12                                  25.2ns ± 9%    35.0ns ±29%  +39.01%  (p=0.000 n=14+15)
IndexHasKnown-12                                    26.8ns ± 7%    32.5ns ±19%  +21.18%  (p=0.000 n=13+15)
IndexAlloc-12                                        301ms ± 1%     288ms ± 2%   -4.25%  (p=0.000 n=15+15)
IndexAllocParallel-12                                107ms ± 3%      70ms ± 1%  -33.90%  (p=0.000 n=15+14)
MasterIndexAlloc-12                                  133ms ± 2%     147ms ± 3%  +10.52%  (p=0.000 n=15+15)
MasterIndexLookupSingleIndex-12                     60.3ns ±10%    67.2ns ±17%  +11.33%  (p=0.000 n=15+15)
MasterIndexLookupMultipleIndex-12                   60.4ns ± 8%    60.4ns ±11%     ~     (p=0.645 n=15+15)
MasterIndexLookupSingleIndexUnknown-12              31.9ns ±11%    40.2ns ±21%  +25.83%  (p=0.000 n=15+13)
MasterIndexLookupMultipleIndexUnknown-12            30.4ns ± 9%    36.9ns ±25%  +21.36%  (p=0.000 n=15+14)
MasterIndexLookupParallel/known,indices=25-12        163ns ± 9%     178ns ±18%   +8.89%  (p=0.003 n=14+15)
MasterIndexLookupParallel/unknown,indices=25-12      129ns ±13%     136ns ±19%     ~     (p=0.083 n=15+15)
MasterIndexLookupParallel/known,indices=50-12        144ns ±14%     178ns ±15%  +23.18%  (p=0.000 n=15+15)
MasterIndexLookupParallel/unknown,indices=50-12      128ns ±11%     140ns ± 7%   +9.33%  (p=0.000 n=14+13)
MasterIndexLookupParallel/known,indices=100-12       144ns ±15%     174ns ±16%  +20.81%  (p=0.000 n=15+15)
MasterIndexLookupParallel/unknown,indices=100-12     130ns ± 5%     140ns ±15%   +7.60%  (p=0.011 n=13+15)
MasterIndexLookupBlobSize-12                        30.3ns ± 6%    35.3ns ±16%  +16.50%  (p=0.000 n=15+14)
MasterIndexEach-12                                   734ms ± 1%     706ms ± 4%   -3.85%  (p=0.000 n=15+15)
MasterIndexGC-12                                     116ms ±36%       0ms ±25%  -99.73%  (p=0.000 n=15+13)

name                                              old speed      new speed      delta
IndexMapHash-12                                   3.29GB/s ± 3%  3.35GB/s ± 2%   +1.89%  (p=0.000 n=15+14)

name                                              old alloc/op   new alloc/op   delta
IndexMapHash-12                                      0.00B          0.00B          ~     (all equal)
EncodeIndex/100-12                                   263kB ± 0%     263kB ± 0%   +0.02%  (p=0.041 n=15+15)
EncodeIndex/1000-12                                 2.61MB ± 0%    2.61MB ± 1%     ~     (p=0.690 n=15+15)
EncodeIndex/10000-12                                25.7MB ± 2%    25.4MB ± 0%     ~     (p=0.129 n=15+12)
IndexAlloc-12                                        451MB ± 0%     496MB ± 0%   +9.95%  (p=0.000 n=13+15)
IndexAllocParallel-12                                451MB ± 0%     496MB ± 0%   +9.95%  (p=0.000 n=11+15)
MasterIndexAlloc-12                                  162MB ± 0%     213MB ± 0%  +31.58%  (p=0.000 n=15+13)

name                                              old allocs/op  new allocs/op  delta
IndexMapHash-12                                       0.00           0.00          ~     (all equal)
EncodeIndex/100-12                                   2.39k ± 0%     2.39k ± 0%     ~     (all equal)
EncodeIndex/1000-12                                  22.9k ± 0%     22.9k ± 0%     ~     (p=0.074 n=15+12)
EncodeIndex/10000-12                                  229k ± 0%      229k ± 0%     ~     (p=0.927 n=15+15)
IndexAlloc-12                                        1.53M ± 0%     1.10M ± 0%  -28.13%  (p=0.000 n=14+15)
IndexAllocParallel-12                                1.53M ± 0%     1.10M ± 0%  -28.13%  (p=0.000 n=11+15)
MasterIndexAlloc-12                                   585k ± 0%      507k ± 0%  -13.33%  (p=0.000 n=15+14)
benchstat base hat-inline
name                                              old time/op    new time/op    delta
IndexMapHash-12                                     1.25µs ± 3%    1.25µs ± 2%     ~     (p=0.751 n=15+15)
DecodeIndex-12                                       2.38s ± 4%     2.28s ± 1%   -4.17%  (p=0.000 n=15+14)
DecodeIndexParallel-12                               2.31s ± 5%     2.33s ± 5%     ~     (p=0.267 n=15+15)
EncodeIndex/100-12                                   618µs ± 3%     624µs ± 5%     ~     (p=0.137 n=15+15)
EncodeIndex/1000-12                                 6.05ms ± 3%    6.04ms ± 1%     ~     (p=0.983 n=15+14)
EncodeIndex/10000-12                                64.5ms ± 2%    65.1ms ± 3%   +0.90%  (p=0.033 n=15+15)
IndexHasUnknown-12                                  25.2ns ± 9%    33.4ns ±39%  +32.65%  (p=0.000 n=14+15)
IndexHasKnown-12                                    26.8ns ± 7%    34.4ns ±36%  +28.24%  (p=0.000 n=13+15)
IndexAlloc-12                                        301ms ± 1%     286ms ± 2%   -4.81%  (p=0.000 n=15+15)
IndexAllocParallel-12                                107ms ± 3%      70ms ± 2%  -34.48%  (p=0.000 n=15+15)
MasterIndexAlloc-12                                  133ms ± 2%     146ms ± 3%  +10.06%  (p=0.000 n=15+15)
MasterIndexLookupSingleIndex-12                     60.3ns ±10%    67.5ns ±16%  +11.92%  (p=0.001 n=15+14)
MasterIndexLookupMultipleIndex-12                   60.4ns ± 8%    60.9ns ±14%     ~     (p=0.775 n=15+15)
MasterIndexLookupSingleIndexUnknown-12              31.9ns ±11%    42.6ns ±37%  +33.55%  (p=0.000 n=15+15)
MasterIndexLookupMultipleIndexUnknown-12            30.4ns ± 9%    34.8ns ±19%  +14.52%  (p=0.000 n=15+15)
MasterIndexLookupParallel/known,indices=25-12        163ns ± 9%     173ns ±15%   +6.01%  (p=0.010 n=14+15)
MasterIndexLookupParallel/unknown,indices=25-12      129ns ±13%     140ns ±18%   +8.97%  (p=0.003 n=15+15)
MasterIndexLookupParallel/known,indices=50-12        144ns ±14%     171ns ±16%  +19.01%  (p=0.000 n=15+15)
MasterIndexLookupParallel/unknown,indices=50-12      128ns ±11%     148ns ±13%  +15.12%  (p=0.000 n=14+14)
MasterIndexLookupParallel/known,indices=100-12       144ns ±15%     170ns ±16%  +17.83%  (p=0.000 n=15+15)
MasterIndexLookupParallel/unknown,indices=100-12     130ns ± 5%     142ns ±21%   +9.01%  (p=0.004 n=13+15)
MasterIndexLookupBlobSize-12                        30.3ns ± 6%    36.0ns ±18%  +18.87%  (p=0.000 n=15+15)
MasterIndexEach-12                                   734ms ± 1%     710ms ± 5%   -3.33%  (p=0.000 n=15+15)
MasterIndexGC-12                                     116ms ±36%       0ms ±28%  -99.64%  (p=0.000 n=15+15)

name                                              old speed      new speed      delta
IndexMapHash-12                                   3.29GB/s ± 3%  3.28GB/s ± 2%     ~     (p=0.784 n=15+15)

name                                              old alloc/op   new alloc/op   delta
IndexMapHash-12                                      0.00B          0.00B          ~     (all equal)
EncodeIndex/100-12                                   263kB ± 0%     263kB ± 0%     ~     (p=0.862 n=15+15)
EncodeIndex/1000-12                                 2.61MB ± 0%    2.61MB ± 1%     ~     (p=0.624 n=15+15)
EncodeIndex/10000-12                                25.7MB ± 2%    25.8MB ± 5%     ~     (p=0.683 n=15+15)
IndexAlloc-12                                        451MB ± 0%     496MB ± 0%   +9.95%  (p=0.000 n=13+15)
IndexAllocParallel-12                                451MB ± 0%     496MB ± 0%   +9.95%  (p=0.000 n=11+15)
MasterIndexAlloc-12                                  162MB ± 0%     213MB ± 0%  +31.58%  (p=0.000 n=15+11)

name                                              old allocs/op  new allocs/op  delta
IndexMapHash-12                                       0.00           0.00          ~     (all equal)
EncodeIndex/100-12                                   2.39k ± 0%     2.39k ± 0%     ~     (all equal)
EncodeIndex/1000-12                                  22.9k ± 0%     22.9k ± 0%     ~     (p=0.061 n=15+13)
EncodeIndex/10000-12                                  229k ± 0%      229k ± 0%     ~     (p=0.719 n=15+15)
IndexAlloc-12                                        1.53M ± 0%     1.10M ± 0%  -28.13%  (p=0.000 n=14+15)
IndexAllocParallel-12                                1.53M ± 0%     1.10M ± 0%  -28.13%  (p=0.000 n=11+15)
MasterIndexAlloc-12                                   585k ± 0%      507k ± 0%  -13.33%  (p=0.000 n=15+11)
benchstat base hat-opt
name                                              old time/op    new time/op    delta
IndexMapHash-12                                     1.25µs ± 3%    1.24µs ± 1%     ~     (p=0.082 n=15+14)
DecodeIndex-12                                       2.38s ± 4%     2.28s ± 1%   -4.04%  (p=0.000 n=15+15)
DecodeIndexParallel-12                               2.31s ± 5%     2.34s ± 5%     ~     (p=0.148 n=15+15)
EncodeIndex/100-12                                   618µs ± 3%     627µs ± 2%   +1.55%  (p=0.002 n=15+15)
EncodeIndex/1000-12                                 6.05ms ± 3%    6.05ms ± 2%     ~     (p=0.870 n=15+15)
EncodeIndex/10000-12                                64.5ms ± 2%    64.4ms ± 2%     ~     (p=0.496 n=15+13)
IndexHasUnknown-12                                  25.2ns ± 9%    30.6ns ±13%  +21.68%  (p=0.000 n=14+14)
IndexHasKnown-12                                    26.8ns ± 7%    33.4ns ±23%  +24.61%  (p=0.000 n=13+15)
IndexAlloc-12                                        301ms ± 1%     286ms ± 2%   -4.88%  (p=0.000 n=15+14)
IndexAllocParallel-12                                107ms ± 3%      70ms ± 3%  -34.63%  (p=0.000 n=15+15)
MasterIndexAlloc-12                                  133ms ± 2%     144ms ± 2%   +8.70%  (p=0.000 n=15+15)
MasterIndexLookupSingleIndex-12                     60.3ns ±10%    66.2ns ±16%   +9.69%  (p=0.004 n=15+15)
MasterIndexLookupMultipleIndex-12                   60.4ns ± 8%    60.6ns ±13%     ~     (p=0.992 n=15+15)
MasterIndexLookupSingleIndexUnknown-12              31.9ns ±11%    40.9ns ±21%  +28.16%  (p=0.000 n=15+15)
MasterIndexLookupMultipleIndexUnknown-12            30.4ns ± 9%    36.6ns ±27%  +20.32%  (p=0.000 n=15+15)
MasterIndexLookupParallel/known,indices=25-12        163ns ± 9%     173ns ±19%   +5.90%  (p=0.014 n=14+15)
MasterIndexLookupParallel/unknown,indices=25-12      129ns ±13%     143ns ±13%  +11.12%  (p=0.000 n=15+14)
MasterIndexLookupParallel/known,indices=50-12        144ns ±14%     176ns ±14%  +22.23%  (p=0.000 n=15+14)
MasterIndexLookupParallel/unknown,indices=50-12      128ns ±11%     132ns ±19%     ~     (p=0.598 n=14+15)
MasterIndexLookupParallel/known,indices=100-12       144ns ±15%     176ns ±14%  +21.67%  (p=0.000 n=15+15)
MasterIndexLookupParallel/unknown,indices=100-12     130ns ± 5%     138ns ±17%     ~     (p=0.060 n=13+15)
MasterIndexLookupBlobSize-12                        30.3ns ± 6%    37.5ns ±22%  +23.92%  (p=0.000 n=15+15)
MasterIndexEach-12                                   734ms ± 1%     696ms ± 6%   -5.26%  (p=0.000 n=15+15)
MasterIndexGC-12                                     116ms ±36%       0ms ±25%  -99.68%  (p=0.000 n=15+15)

name                                              old speed      new speed      delta
IndexMapHash-12                                   3.29GB/s ± 3%  3.31GB/s ± 1%     ~     (p=0.085 n=15+14)

name                                              old alloc/op   new alloc/op   delta
IndexMapHash-12                                      0.00B          0.00B          ~     (all equal)
EncodeIndex/100-12                                   263kB ± 0%     263kB ± 0%     ~     (p=0.158 n=15+15)
EncodeIndex/1000-12                                 2.61MB ± 0%    2.61MB ± 1%     ~     (p=0.461 n=15+15)
EncodeIndex/10000-12                                25.7MB ± 2%    25.4MB ± 0%   -1.16%  (p=0.029 n=15+13)
IndexAlloc-12                                        451MB ± 0%     496MB ± 0%   +9.95%  (p=0.000 n=13+15)
IndexAllocParallel-12                                451MB ± 0%     496MB ± 0%   +9.95%  (p=0.000 n=11+15)
MasterIndexAlloc-12                                  162MB ± 0%     213MB ± 0%  +31.58%  (p=0.000 n=15+11)

name                                              old allocs/op  new allocs/op  delta
IndexMapHash-12                                       0.00           0.00          ~     (all equal)
EncodeIndex/100-12                                   2.39k ± 0%     2.39k ± 0%     ~     (all equal)
EncodeIndex/1000-12                                  22.9k ± 0%     22.9k ± 0%     ~     (p=1.000 n=15+15)
EncodeIndex/10000-12                                  229k ± 0%      229k ± 0%     ~     (p=0.796 n=15+15)
IndexAlloc-12                                        1.53M ± 0%     1.10M ± 0%  -28.13%  (p=0.000 n=14+15)
IndexAllocParallel-12                                1.53M ± 0%     1.10M ± 0%  -28.13%  (p=0.000 n=11+15)
MasterIndexAlloc-12                                   585k ± 0%      507k ± 0%  -13.33%  (p=0.000 n=15+11)
benchstat hat-opt fast-each
name                                              old time/op    new time/op    delta
IndexMapHash-12                                     1.24µs ± 1%    1.24µs ± 1%     ~     (p=0.059 n=14+14)
DecodeIndex-12                                       2.28s ± 1%     2.29s ± 3%     ~     (p=0.233 n=15+15)
DecodeIndexParallel-12                               2.34s ± 5%     2.28s ± 1%   -2.44%  (p=0.007 n=15+12)
EncodeIndex/100-12                                   627µs ± 2%     609µs ± 3%   -2.98%  (p=0.000 n=15+15)
EncodeIndex/1000-12                                 6.05ms ± 2%    5.83ms ± 3%   -3.58%  (p=0.000 n=15+15)
EncodeIndex/10000-12                                64.4ms ± 2%    58.3ms ± 4%   -9.47%  (p=0.000 n=13+15)
IndexHasUnknown-12                                  30.6ns ±13%    32.2ns ±24%     ~     (p=0.974 n=14+15)
IndexHasKnown-12                                    33.4ns ±23%    32.2ns ±27%     ~     (p=0.158 n=15+15)
IndexAlloc-12                                        286ms ± 2%     286ms ± 3%     ~     (p=0.847 n=14+15)
IndexAllocParallel-12                               69.6ms ± 3%    69.5ms ± 2%     ~     (p=0.983 n=15+14)
MasterIndexAlloc-12                                  144ms ± 2%     117ms ± 2%  -18.67%  (p=0.000 n=15+15)
MasterIndexLookupSingleIndex-12                     66.2ns ±16%    66.6ns ±15%     ~     (p=0.935 n=15+15)
MasterIndexLookupMultipleIndex-12                   60.6ns ±13%    60.7ns ±13%     ~     (p=0.926 n=15+15)
MasterIndexLookupSingleIndexUnknown-12              40.9ns ±21%    40.1ns ±20%     ~     (p=0.631 n=15+15)
MasterIndexLookupMultipleIndexUnknown-12            36.6ns ±27%    35.9ns ±29%     ~     (p=0.705 n=15+15)
MasterIndexLookupParallel/known,indices=25-12        173ns ±19%     171ns ±18%     ~     (p=0.829 n=15+15)
MasterIndexLookupParallel/unknown,indices=25-12      143ns ±13%     139ns ±25%     ~     (p=0.644 n=14+15)
MasterIndexLookupParallel/known,indices=50-12        176ns ±14%     167ns ±17%   -5.08%  (p=0.045 n=14+15)
MasterIndexLookupParallel/unknown,indices=50-12      132ns ±19%     141ns ±15%     ~     (p=0.157 n=15+15)
MasterIndexLookupParallel/known,indices=100-12       176ns ±14%     174ns ±14%     ~     (p=0.911 n=15+15)
MasterIndexLookupParallel/unknown,indices=100-12     138ns ±17%     136ns ±17%     ~     (p=0.661 n=15+15)
MasterIndexLookupBlobSize-12                        37.5ns ±22%    35.3ns ±20%     ~     (p=0.055 n=15+14)
MasterIndexEach-12                                   696ms ± 6%      90ms ± 1%  -87.05%  (p=0.000 n=15+13)
MasterIndexGC-12                                     373µs ±25%     387µs ±24%     ~     (p=0.683 n=15+15)

name                                              old speed      new speed      delta
IndexMapHash-12                                   3.31GB/s ± 1%  3.29GB/s ± 1%     ~     (p=0.062 n=14+14)

name                                              old alloc/op   new alloc/op   delta
IndexMapHash-12                                      0.00B          0.00B          ~     (all equal)
EncodeIndex/100-12                                   263kB ± 0%     263kB ± 0%   -0.02%  (p=0.036 n=15+15)
EncodeIndex/1000-12                                 2.61MB ± 1%    2.61MB ± 0%   -0.10%  (p=0.002 n=15+9)
EncodeIndex/10000-12                                25.4MB ± 0%    25.7MB ± 2%     ~     (p=0.058 n=13+15)
IndexAlloc-12                                        496MB ± 0%     496MB ± 0%     ~     (p=0.157 n=15+10)
IndexAllocParallel-12                                496MB ± 0%     496MB ± 0%     ~     (p=0.847 n=15+14)
MasterIndexAlloc-12                                  213MB ± 0%     213MB ± 0%     ~     (p=0.763 n=11+13)

name                                              old allocs/op  new allocs/op  delta
IndexMapHash-12                                       0.00           0.00          ~     (all equal)
EncodeIndex/100-12                                   2.39k ± 0%     2.39k ± 0%     ~     (all equal)
EncodeIndex/1000-12                                  22.9k ± 0%     22.9k ± 0%     ~     (p=0.700 n=15+15)
EncodeIndex/10000-12                                  229k ± 0%      229k ± 0%     ~     (p=0.192 n=15+15)
IndexAlloc-12                                        1.10M ± 0%     1.10M ± 0%     ~     (p=0.156 n=15+12)
IndexAllocParallel-12                                1.10M ± 0%     1.10M ± 0%     ~     (p=0.778 n=15+14)
MasterIndexAlloc-12                                   507k ± 0%      507k ± 0%   +0.00%  (p=0.003 n=11+13)
benchstat fast-each fast-grow
name                                              old time/op    new time/op    delta
IndexMapHash-12                                     1.24µs ± 1%    1.24µs ± 3%     ~     (p=0.058 n=14+15)
DecodeIndex-12                                       2.29s ± 3%     2.21s ± 1%   -3.65%  (p=0.000 n=15+13)
DecodeIndexParallel-12                               2.28s ± 1%     2.21s ± 2%   -3.18%  (p=0.000 n=12+12)
EncodeIndex/100-12                                   609µs ± 3%     606µs ± 4%     ~     (p=0.533 n=15+14)
EncodeIndex/1000-12                                 5.83ms ± 3%    5.84ms ± 2%     ~     (p=0.653 n=15+15)
EncodeIndex/10000-12                                58.3ms ± 4%    58.7ms ± 3%     ~     (p=0.305 n=15+15)
IndexHasUnknown-12                                  32.2ns ±24%    33.6ns ±19%     ~     (p=0.190 n=15+14)
IndexHasKnown-12                                    32.2ns ±27%    36.5ns ±27%  +13.54%  (p=0.034 n=15+15)
IndexAlloc-12                                        286ms ± 3%     205ms ± 2%  -28.26%  (p=0.000 n=15+15)
IndexAllocParallel-12                               69.5ms ± 2%    60.4ms ± 2%  -13.08%  (p=0.000 n=14+15)
MasterIndexAlloc-12                                  117ms ± 2%     115ms ± 3%   -2.25%  (p=0.000 n=15+15)
MasterIndexLookupSingleIndex-12                     66.6ns ±15%    68.1ns ±17%     ~     (p=0.436 n=15+15)
MasterIndexLookupMultipleIndex-12                   60.7ns ±13%    60.9ns ± 9%     ~     (p=0.902 n=15+15)
MasterIndexLookupSingleIndexUnknown-12              40.1ns ±20%    42.2ns ±25%     ~     (p=0.418 n=15+15)
MasterIndexLookupMultipleIndexUnknown-12            35.9ns ±29%    35.2ns ±18%     ~     (p=0.926 n=15+15)
MasterIndexLookupParallel/known,indices=25-12        171ns ±18%     172ns ±16%     ~     (p=0.959 n=15+15)
MasterIndexLookupParallel/unknown,indices=25-12      139ns ±25%     134ns ±20%     ~     (p=0.454 n=15+15)
MasterIndexLookupParallel/known,indices=50-12        167ns ±17%     166ns ±20%     ~     (p=0.846 n=15+15)
MasterIndexLookupParallel/unknown,indices=50-12      141ns ±15%     135ns ±11%     ~     (p=0.191 n=15+15)
MasterIndexLookupParallel/known,indices=100-12       174ns ±14%     169ns ±16%     ~     (p=0.486 n=15+15)
MasterIndexLookupParallel/unknown,indices=100-12     136ns ±17%     135ns ±18%     ~     (p=0.911 n=15+15)
MasterIndexLookupBlobSize-12                        35.3ns ±20%    30.8ns ± 1%  -12.54%  (p=0.000 n=14+13)
MasterIndexEach-12                                  90.1ms ± 1%    90.5ms ± 1%     ~     (p=0.118 n=13+15)
MasterIndexGC-12                                     387µs ±24%     378µs ±25%     ~     (p=0.775 n=15+15)

name                                              old speed      new speed      delta
IndexMapHash-12                                   3.29GB/s ± 1%  3.30GB/s ± 3%     ~     (p=0.063 n=14+15)

name                                              old alloc/op   new alloc/op   delta
IndexMapHash-12                                      0.00B          0.00B          ~     (all equal)
EncodeIndex/100-12                                   263kB ± 0%     263kB ± 0%   +0.04%  (p=0.003 n=15+15)
EncodeIndex/1000-12                                 2.61MB ± 0%    2.61MB ± 0%     ~     (p=0.055 n=9+15)
EncodeIndex/10000-12                                25.7MB ± 2%    25.8MB ± 5%     ~     (p=0.653 n=15+15)
IndexAlloc-12                                        496MB ± 0%     496MB ± 0%   -0.01%  (p=0.000 n=10+9)
IndexAllocParallel-12                                496MB ± 0%     496MB ± 0%   -0.00%  (p=0.000 n=14+15)
MasterIndexAlloc-12                                  213MB ± 0%     213MB ± 0%     ~     (p=0.723 n=13+15)

name                                              old allocs/op  new allocs/op  delta
IndexMapHash-12                                       0.00           0.00          ~     (all equal)
EncodeIndex/100-12                                   2.39k ± 0%     2.39k ± 0%     ~     (all equal)
EncodeIndex/1000-12                                  22.9k ± 0%     22.9k ± 0%     ~     (p=1.000 n=15+15)
EncodeIndex/10000-12                                  229k ± 0%      229k ± 0%     ~     (p=0.333 n=15+15)
IndexAlloc-12                                        1.10M ± 0%     1.10M ± 0%   -0.00%  (p=0.000 n=12+12)
IndexAllocParallel-12                                1.10M ± 0%     1.10M ± 0%   -0.00%  (p=0.000 n=14+15)
MasterIndexAlloc-12                                   507k ± 0%      507k ± 0%     ~     (p=0.566 n=13+15)
benchstat base fast-grow
name                                              old time/op    new time/op    delta
IndexMapHash-12                                     1.25µs ± 3%    1.24µs ± 3%     ~     (p=0.095 n=15+15)
DecodeIndex-12                                       2.38s ± 4%     2.21s ± 1%   -6.97%  (p=0.000 n=15+13)
DecodeIndexParallel-12                               2.31s ± 5%     2.21s ± 2%   -4.42%  (p=0.000 n=15+12)
EncodeIndex/100-12                                   618µs ± 3%     606µs ± 4%   -1.87%  (p=0.000 n=15+14)
EncodeIndex/1000-12                                 6.05ms ± 3%    5.84ms ± 2%   -3.42%  (p=0.000 n=15+15)
EncodeIndex/10000-12                                64.5ms ± 2%    58.7ms ± 3%   -9.01%  (p=0.000 n=15+15)
IndexHasUnknown-12                                  25.2ns ± 9%    33.6ns ±19%  +33.49%  (p=0.000 n=14+14)
IndexHasKnown-12                                    26.8ns ± 7%    36.5ns ±27%  +36.18%  (p=0.000 n=13+15)
IndexAlloc-12                                        301ms ± 1%     205ms ± 2%  -31.75%  (p=0.000 n=15+15)
IndexAllocParallel-12                                107ms ± 3%      60ms ± 2%  -43.31%  (p=0.000 n=15+15)
MasterIndexAlloc-12                                  133ms ± 2%     115ms ± 3%  -13.59%  (p=0.000 n=15+15)
MasterIndexLookupSingleIndex-12                     60.3ns ±10%    68.1ns ±17%  +12.93%  (p=0.001 n=15+15)
MasterIndexLookupMultipleIndex-12                   60.4ns ± 8%    60.9ns ± 9%     ~     (p=0.806 n=15+15)
MasterIndexLookupSingleIndexUnknown-12              31.9ns ±11%    42.2ns ±25%  +32.32%  (p=0.000 n=15+15)
MasterIndexLookupMultipleIndexUnknown-12            30.4ns ± 9%    35.2ns ±18%  +15.94%  (p=0.000 n=15+15)
MasterIndexLookupParallel/known,indices=25-12        163ns ± 9%     172ns ±16%     ~     (p=0.146 n=14+15)
MasterIndexLookupParallel/unknown,indices=25-12      129ns ±13%     134ns ±20%     ~     (p=0.161 n=15+15)
MasterIndexLookupParallel/known,indices=50-12        144ns ±14%     166ns ±20%  +15.09%  (p=0.000 n=15+15)
MasterIndexLookupParallel/unknown,indices=50-12      128ns ±11%     135ns ±11%     ~     (p=0.065 n=14+15)
MasterIndexLookupParallel/known,indices=100-12       144ns ±15%     169ns ±16%  +16.82%  (p=0.000 n=15+15)
MasterIndexLookupParallel/unknown,indices=100-12     130ns ± 5%     135ns ±18%     ~     (p=0.159 n=13+15)
MasterIndexLookupBlobSize-12                        30.3ns ± 6%    30.8ns ± 1%     ~     (p=0.333 n=15+13)
MasterIndexEach-12                                   734ms ± 1%      91ms ± 1%  -87.67%  (p=0.000 n=15+15)
MasterIndexGC-12                                     116ms ±36%       0ms ±25%  -99.67%  (p=0.000 n=15+15)

name                                              old speed      new speed      delta
IndexMapHash-12                                   3.29GB/s ± 3%  3.30GB/s ± 3%     ~     (p=0.116 n=15+15)

name                                              old alloc/op   new alloc/op   delta
IndexMapHash-12                                      0.00B          0.00B          ~     (all equal)
EncodeIndex/100-12                                   263kB ± 0%     263kB ± 0%   +0.04%  (p=0.009 n=15+15)
EncodeIndex/1000-12                                 2.61MB ± 0%    2.61MB ± 0%     ~     (p=0.148 n=15+15)
EncodeIndex/10000-12                                25.7MB ± 2%    25.8MB ± 5%     ~     (p=0.595 n=15+15)
IndexAlloc-12                                        451MB ± 0%     496MB ± 0%   +9.94%  (p=0.000 n=13+9)
IndexAllocParallel-12                                451MB ± 0%     496MB ± 0%   +9.95%  (p=0.000 n=11+15)
MasterIndexAlloc-12                                  162MB ± 0%     213MB ± 0%  +31.58%  (p=0.000 n=15+15)

name                                              old allocs/op  new allocs/op  delta
IndexMapHash-12                                       0.00           0.00          ~     (all equal)
EncodeIndex/100-12                                   2.39k ± 0%     2.39k ± 0%     ~     (all equal)
EncodeIndex/1000-12                                  22.9k ± 0%     22.9k ± 0%     ~     (p=1.000 n=15+15)
EncodeIndex/10000-12                                  229k ± 0%      229k ± 0%     ~     (p=0.642 n=15+15)
IndexAlloc-12                                        1.53M ± 0%     1.10M ± 0%  -28.13%  (p=0.000 n=14+12)
IndexAllocParallel-12                                1.53M ± 0%     1.10M ± 0%  -28.13%  (p=0.000 n=11+15)
MasterIndexAlloc-12                                   585k ± 0%      507k ± 0%  -13.33%  (p=0.000 n=15+15)

Summary

All in all, the drastically reduced GC overhead makes the slightly slower index lookups a worthwhile tradeoff. In particular, this should allow us to set GOGC=50 by default without regressing performance too much, as most of the data kept in memory by restic can simply be skipped by the garbage collector.

Was the change previously discussed in an issue or on the forum?

No.

Checklist

  • I have read the contribution guidelines.
  • I have enabled maintainer edits.
  • I have added tests for all code changes.
  • [ ] I have added documentation for relevant changes (in the manual).
  • There's a new file in changelog/unreleased/ that describes the changes for our users (see template).
  • I have run gofmt on the code in all commits.
  • All commit messages are formatted in the same style as the other commits in the repo.
  • I'm done! This pull request is ready for review.

Allocates an index and repeatedly triggers the GC.
The indexEntry objects are now allocated in a separate array. References
to an indexEntry are now stored as array indices. This has the benefit
of allowing the garbage collector to ignore the indexEntry objects as
these do not contain pointers and are part of a single large allocation.
This data structure reduces the wasted memory to O(sqrt(n)). The
top-layer of the hashed array tree (HAT) also has a size of O(sqrt(n)),
which makes it cache efficient. The top-layer should be small enough to
easily fit into the CPU cache and thus only adds little overhead
compared to directly accessing an index entry via a pointer.
Iterating through the indexmap according to the bucket order has the
problem that all indexEntries are accessed in random order which is
rather cache inefficient.

As we already keep a list of all allocated blocks, just iterate through
it. This allows iterating through a batch of indexEntries without random
memory accesses. In addition, the packID will likely remain similar
across multiple blobs as all blobs of a pack file are added as a single
batch.
@MichaelEischer
Copy link
Copy Markdown
Member Author

MichaelEischer commented Jun 2, 2023

To complement the above measurements, here are the number of a full system backup dry-run with 140GB in 1 Million files (5 million blobs total in the repository):

master - 106,02s user 12,96s system 265% cpu 44,855 total
pointerless-index - 87,52s user 12,64s system 232% cpu 43,112 total

That is, with this PR the backup is faster and requires 17% less CPU! For larger repositories, the improvement is likely even larger.

The corresponding output of setting GODEBUG=gctrace=1 is:

master - gc 30 @33.561s 0%: 0.032+54+0.017 ms clock, 0.52+0.12/204/519+0.27 ms cpu, 927->929->478 MB, 945 MB goal, 0 MB stacks, 0 MB globals, 16 P
pointerless-index - gc 33 @33.573s 0%: 0.032+0.70+0.020 ms clock, 0.52+0.027/1.7/1.9+0.33 ms cpu, 933->933->469 MB, 937 MB goal, 0 MB stacks, 0 MB globals, 16 P

The gc mark and sweep phase duration drops from 0.12/204/519 ms to 0.027/1.7/1.9 ms (assist/background/idle), essentially making the GC for free.

This was referenced Jun 2, 2023
This reverts commit f1c388c.

For an uninitialized indexmap the returned size was `-1` which is
unexpected and could cause problems.
@MichaelEischer
Copy link
Copy Markdown
Member Author

LGTM.

@MichaelEischer MichaelEischer merged commit dd1ef13 into restic:master Jun 16, 2023
@MichaelEischer MichaelEischer deleted the pointerless-index branch June 16, 2023 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants