Optimized SFL cluster linearization#34023
Conversation
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code Coverage & BenchmarksFor details see: https://corecheck.dev/bitcoin/bitcoin/pulls/34023. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please copy-paste ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
|
🚧 At least one of the CI tasks failed. HintsTry to run the tests locally, according to the documentation. However, a CI failure may still
Leave a comment here, if you need help tracking down a confusing failure. |
c3b0b64 to
73c3ea3
Compare
447c370 to
fd6d05b
Compare
5ba9b9b to
e136646
Compare
|
🚧 At least one of the CI tasks failed. HintsTry to run the tests locally, according to the documentation. However, a CI failure may still
Leave a comment here, if you need help tracking down a confusing failure. |
ad3cc10 to
6a04276
Compare
instagibbs
left a comment
There was a problem hiding this comment.
ACK 87d74c1
Changes make sense, did not run fuzzer myself. Non-blocking nits only
in the chart's x-asis, what's the denominator?
|
Addressed @instagibbs' comments, added a commit to drop the
"benchmark count", I guess. Each benchmark takes up equals x-axis space, sorted by master's runtime. EDIT: oh, in the label? Those are "ntx/ndeps" for the test. |
ajtowns
left a comment
There was a problem hiding this comment.
Pushed an update that splits the "get rid of DepData" commit in two:
Yeah, I'm still not smart enough to understand things unless they're way more bite-size than that. I broke the first of those commits down into 12 commits before it really made sense. OTOH, I think that found a couple of bugs/typos, so yay?
I didn't really see why you swapped the parent/child ordering in Activate (return child_chunk_idx), but I don't see any harm in it either.
ACK 7a0309b though probably should do the fix ups.
Since the deterministic ordering change, SpanningForestState holds a reference to the DepGraph it is linearizing. So this means we do not need to pass it to SanityCheck() as an argument anymore.
This splits the chunk_deps variable in LoadLinearization in two, one for tracking tx dependencies and one for chunk dependencies. This is a preparation for a later commit, where chunks won't be identified anymore by a representative transaction in them, but by a separate index. With that, it seems weird to keep them both in the same structure if they will be indexed in an unrelated way. Note that the changes in src/test/util/cluster_linearize.h to the table of worst observed iteration counts are due to switching to a different data set, and are unrelated to the changes in this commit.
This small optimization avoids the need to loop over the parents of each transaction when initializing the dependency-counting structures inside GetLinearization().
This is a preparation for the next commit, where chunks will no longer be identified using a representative transaction, but using a set index. Reduce the load of line changes by doing this rename ahead of time. -BEGIN VERIFY SCRIPT- sed --in-place 's/_rep/_idx/g' src/cluster_linearize.h -END VERIFY SCRIPT-
This significantly changes the data structures used in SFL, based on the observation that the DepData::top_setinfo fields are quite wasteful: there is one per dependency (up to n^2/4), but we only ever need one per active dependency (of which there at most n-1). In total, the number of chunks plus the number of active dependencies is always exactly equal to the number of transactions, so it makes sense to have a shared pool of SetInfos, which are used for both chunks and top sets. To that effect, introduce a separate m_set_info variable, which stores a SetInfo per transaction. Some of these are used for chunk sets, and some for active dependencies' top sets. Every activation transforms the parent's chunk into the top set for the new dependency. Every deactivation transforms the top set into the new parent chunk. With indexes into m_set_data (SetIdx) becoming bounded by the number of transactions, we can use a SetType to represent sets of SetIdxs. Specifically, an m_chunk_idxs is added which contains all SetIdx referring to chunks. This leads to a much more natural way of iterating over chunks. Also use this opportunity to normalize many variable names.
With the earlier change to pool SetInfo objects, there is little need for DepData anymore. Use parent/child TxIdxs to refer to dependencies, and find their top set by having a child TxIdx-indexed vector in each TxData, rather than a list of dependencies. This makes code for iterating over dependencies more natural and simpler.
The combined size of TxData::dep_top_idx can be 16 KiB with 64 transactions and SetIdx = uint32_t. Use a smaller type where possible to reduce memory footprint and improve cache locality of m_tx_data. Also switch from an std::vector to an std::array, reducing allocation overhead and indirections.
This is a simple refactor to make the code more readable.
The current process consists of iterating over the transactions of the chunk one by one, and then for each figuring out which of its parents/children are in unprocessed chunks. Simplify this (and speed it up slightly) by splitting this process into two phases: first determine the union of all parents/children, and then find which chunks those belong to.
Instead of computing the set of reachable transactions inside PickMergeCandidate, make the information precomputed, and updated in Activate (by merging the two chunks' reachable sets) and Deactivate (by recomputing). This is a small performance gain on itself, but also a preparation for future optimizations that rely on quickly testing whether dependencies between chunks exist.
Future changes will rely on knowing the chunk indexes of the two created chunks after a split. It is natural to return this information from Deactivate, which also simplifies MergeSequence.
After a split, if the top part has a dependency on the bottom part, the first MergeSequence will always perform this merge and then stop. This is referred to as a self-merge. We can special case these by detecting self-merges early, and avoiding the overhead of a full MergeSequence which involves two PickMergeCandidate calls (a succesful and an unsuccesful one).
This means we can iterate over all active dependencies in a cluster/chunk in O(ntx) time rather than O(ndeps) (*), as the number of active dependencies in a set of transactions of size is at most ntx-1. (*) Asymptotically, this is not actually true, as for large transaction counts, iterating over a BitSet still scales with ntx. In practice however, where BitSets are represented by a constant number of integers, it holds.
This avoids adding them a second time to m_suboptimal_chunks when they happen to already be there.
It suffices to initially only attempt one direction of merges in MakeTopological(), and only try both directions on chunks that are the result of other merges.
The two calls to UpdateChunk, in Activate and Deactive each, are subtly different: the top one needs to update the chunk_idx of iterated transactions, while the bottom one leaves it unchanged. To exploit this difference, inline the four function calls, getting rid of UpdateChunks. This is also a preparation for a future improvement that inlines the recomputation of reachable sets in the same loop in Deactivate.
Avoid two full iterations over all of a chunks' transactions to recompute the reachable sets, by inlining them into the dependency-updating loops. Note that there is no need to do the same for Activate, because the reachable sets after merging can be computed directly from the input chunks' reachable sets. Deactivate needs to recompute them, however.
|
reACK c2fcf25 |
Yay! Do you want me to try to perform some of those splits here? I think the introduction/usage of
The reason is that on Activation, the top chunk becomes the new dependency's top set, and thus the other one - the bottom chunk - needs to become the merged chunk's set (and thus returned by |
I think it's useful for understanding the changes incrementally, but I'm not sure anyone's likely to want to do that (vs just understanding the latest implementation from scratch) outside of reviewing this PR, in which case they can look at my branch if they want?
Ah, essentially saving swapping two (24 byte?) SetInfo's I guess. reACK c2fcf25 |
Follow-up to #32545, part of #30289.
This contains many of the optimizations that were originally part of #32545 itself.
Here is a list of commits and the geometric average of the benchmark timings. Note that these aren't worst cases, but because of the nature of the optimizations here, I do expect them to apply roughly equally to all kinds of clusters. In other words, the relative improvement shown by these numbers should be representative:
And to show that they apply to all clusters roughly similarly: