index: optimize go GC performance#4352
Conversation
Allocates an index and repeatedly triggers the GC.
The indexEntry objects are now allocated in a separate array. References to an indexEntry are now stored as array indices. This has the benefit of allowing the garbage collector to ignore the indexEntry objects as these do not contain pointers and are part of a single large allocation.
This data structure reduces the wasted memory to O(sqrt(n)). The top-layer of the hashed array tree (HAT) also has a size of O(sqrt(n)), which makes it cache efficient. The top-layer should be small enough to easily fit into the CPU cache and thus only adds little overhead compared to directly accessing an index entry via a pointer.
Iterating through the indexmap according to the bucket order has the problem that all indexEntries are accessed in random order which is rather cache inefficient. As we already keep a list of all allocated blocks, just iterate through it. This allows iterating through a batch of indexEntries without random memory accesses. In addition, the packID will likely remain similar across multiple blobs as all blobs of a pack file are added as a single batch.
|
To complement the above measurements, here are the number of a full system backup dry-run with 140GB in 1 Million files (5 million blobs total in the repository): master - 106,02s user 12,96s system 265% cpu 44,855 total That is, with this PR the backup is faster and requires 17% less CPU! For larger repositories, the improvement is likely even larger. The corresponding output of setting master - The gc mark and sweep phase duration drops from |
This reverts commit f1c388c. For an uninitialized indexmap the returned size was `-1` which is unexpected and could cause problems.
|
LGTM. |
What does this PR change? What problem does it solve?
The current index implementation using a custom indexmap is not particular garbage-collection friendly. As each indexEntry and the bucket list contain pointers, this requires the garbage collector to chase pointers to random locations throughout the index.
This PR moves the indexEntries into an array and only adds the array index instead of a pointer. Thereby, the indexEntries and bucket list no longer contain pointers and are not scanned by the garbage collector. The array containing all indexEntries is implemented as https://en.wikipedia.org/wiki/Hashed_array_tree to minimize the memory overhead.
The top-layer of the hashed array tree (HAT) also has a size of O(sqrt(n)), which should make it cache efficient. The top-layer should be small enough to easily fit into the CPU cache and thus only adds little overhead compared to directly accessing an index entry via a pointer.
As a bonus, the indexmap.Each/Grow functions are now able to iterate over the indexEntries array instead of following random pointers. This significantly speeds up both functions.
Results
The indirection through the HAT increases the index lookup duration by 20-40% (the average is probably between 20-30% but the results are pretty noisy as a result of the varying map entry layout due to the maphash randomization), although the MasterIndex based tests show numbers between 0 to 35% overhead.
Iterating over the index is now up to 8 times faster, which also translates to slightly faster index encoding.
GC is now nearly for free.
grow() is also pretty interesting, the performance changes are likely a result of a different order of the chained indexEntries.
Commits:
index: implement indexmap.grow() without random accessindex: let indexmap.Each iterate in allocation orderindex: remove redundant storage of indexmap sizeindex: Allow inlining of HATindex: store indexEntries in hashed array treeindex: Remove pointers from within indexentrysindex: add garbage collection benchmarkbenchstat base remove-pointers
benchstat base hat-base
benchstat base hat-inline
benchstat base hat-opt
benchstat hat-opt fast-each
benchstat fast-each fast-grow
benchstat base fast-grow
Summary
All in all, the drastically reduced GC overhead makes the slightly slower index lookups a worthwhile tradeoff. In particular, this should allow us to set GOGC=50 by default without regressing performance too much, as most of the data kept in memory by restic can simply be skipped by the garbage collector.
Was the change previously discussed in an issue or on the forum?
No.
Checklist
[ ] I have added documentation for relevant changes (in the manual).changelog/unreleased/that describes the changes for our users (see template).gofmton the code in all commits.