Skip to content

Optimized SFL cluster linearization#34023

Merged
fanquake merged 20 commits intobitcoin:masterfrom
sipa:202512_sfl_opt
Feb 18, 2026
Merged

Optimized SFL cluster linearization#34023
fanquake merged 20 commits intobitcoin:masterfrom
sipa:202512_sfl_opt

Conversation

@sipa
Copy link
Member

@sipa sipa commented Dec 6, 2025

Follow-up to #32545, part of #30289.

This contains many of the optimizations that were originally part of #32545 itself.

Here is a list of commits and the geometric average of the benchmark timings. Note that these aren't worst cases, but because of the nature of the optimizations here, I do expect them to apply roughly equally to all kinds of clusters. In other words, the relative improvement shown by these numbers should be representative:

commit title ns per optimal linearization
clusterlin: split tx/chunk dep counting (preparation) 24,760.30
clusterlin: count chunk deps without loop (optimization) 24,677.64
scripted-diff: rename _rep -> _idx in SFL 24,640.08
clusterlin: get rid of DepData, reuse sets (optimization) 24,389.01
clusterlin: improve TxData::dep_top_idx type (optimization) 22,578.58
clusterlin: abstract out functions from MergeStep (refactor) 22,577.15
clusterlin: split up OptimizeStep (refactor) 22,981.11
clusterlin: simplify PickMergeCandidate (optimization) 22,018.63
clusterlin: precompute reachable sets (optimization) 21,194.91
clusterlin: make MergeSequence take SetIdx (simplification) 21,135.60
clusterlin: special-case self-merges (optimization) 20,588.13
clusterlin: keep track of active children (optimization) 13,911.22
clusterlin: track suboptimal chunks (optimization) 13,629.42
clusterlin: unidirectional MakeTopological initially (optimization) 12,796.57
clusterlin: inline UpdateChunk into (De)Activate (optimization) 12,706.35
clusterlin: inline GetReachable into Deactivate (optimization) 12,525.66

And to show that they apply to all clusters roughly similarly:

output(1)

@DrahtBot
Copy link
Contributor

DrahtBot commented Dec 6, 2025

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/34023.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK instagibbs, ajtowns

If your review is incorrectly listed, please copy-paste <!--meta-tag:bot-skip--> into the comment that the bot should ignore.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #34138 (Cluster mempool: more accurate cost model for SFL by sipa)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@sipa sipa mentioned this pull request Dec 6, 2025
33 tasks
@DrahtBot
Copy link
Contributor

DrahtBot commented Dec 6, 2025

🚧 At least one of the CI tasks failed.
Task 32 bit ARM: https://github.com/bitcoin/bitcoin/actions/runs/19993781856/job/57338323718
LLM reason (✨ experimental): Cluster linearize tests fail due to a size mismatch (linearization size != depgraph TxCount) triggering assertion failures and aborted tests.

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

  • Possibly due to a silent merge conflict (the changes in this pull request being
    incompatible with the current code in the target branch). If so, make sure to rebase on the latest
    commit of the target branch.

  • A sanitizer issue, which can only be found by compiling with the sanitizer and running the
    affected test.

  • An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

@sipa sipa force-pushed the 202512_sfl_opt branch 4 times, most recently from c3b0b64 to 73c3ea3 Compare December 7, 2025 02:09
@DrahtBot DrahtBot removed the CI failed label Dec 7, 2025
@sipa sipa force-pushed the 202512_sfl_opt branch 3 times, most recently from 447c370 to fd6d05b Compare December 11, 2025 22:11
@sipa sipa force-pushed the 202512_sfl_opt branch 2 times, most recently from 5ba9b9b to e136646 Compare December 12, 2025 23:05
@DrahtBot
Copy link
Contributor

🚧 At least one of the CI tasks failed.
Task Windows-cross to x86_64, ucrt: https://github.com/bitcoin/bitcoin/actions/runs/20182154259/job/57944831923
LLM reason (✨ experimental): Compiler error: -Werror treats a range-for over a tuple as an error in cluster_linearize.h, causing the build to fail.

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

  • Possibly due to a silent merge conflict (the changes in this pull request being
    incompatible with the current code in the target branch). If so, make sure to rebase on the latest
    commit of the target branch.

  • A sanitizer issue, which can only be found by compiling with the sanitizer and running the
    affected test.

  • An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

@sipa sipa force-pushed the 202512_sfl_opt branch 4 times, most recently from ad3cc10 to 6a04276 Compare December 14, 2025 22:46
Copy link
Member

@instagibbs instagibbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 87d74c1

Changes make sense, did not run fuzzer myself. Non-blocking nits only

in the chart's x-asis, what's the denominator?

@sipa
Copy link
Member Author

sipa commented Feb 16, 2026

Addressed @instagibbs' comments, added a commit to drop the DepGraph& argument to SpanningForestState::SanityCheck(), and added extra comments and assertions suggested by Claude (reviewed/applied/edited by me).

in the chart's x-asis, what's the denominator?

"benchmark count", I guess. Each benchmark takes up equals x-axis space, sorted by master's runtime.

EDIT: oh, in the label? Those are "ntx/ndeps" for the test.

Copy link
Contributor

@ajtowns ajtowns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed an update that splits the "get rid of DepData" commit in two:

Yeah, I'm still not smart enough to understand things unless they're way more bite-size than that. I broke the first of those commits down into 12 commits before it really made sense. OTOH, I think that found a couple of bugs/typos, so yay?

I didn't really see why you swapped the parent/child ordering in Activate (return child_chunk_idx), but I don't see any harm in it either.

ACK 7a0309b though probably should do the fix ups.

sipa and others added 20 commits February 17, 2026 09:04
Since the deterministic ordering change, SpanningForestState holds a
reference to the DepGraph it is linearizing. So this means we do not
need to pass it to SanityCheck() as an argument anymore.
This splits the chunk_deps variable in LoadLinearization in two, one for
tracking tx dependencies and one for chunk dependencies. This is a
preparation for a later commit, where chunks won't be identified anymore
by a representative transaction in them, but by a separate index. With
that, it seems weird to keep them both in the same structure if they
will be indexed in an unrelated way.

Note that the changes in src/test/util/cluster_linearize.h to the table
of worst observed iteration counts are due to switching to a different
data set, and are unrelated to the changes in this commit.
This small optimization avoids the need to loop over the parents of each
transaction when initializing the dependency-counting structures inside
GetLinearization().
This is a preparation for the next commit, where chunks will no longer
be identified using a representative transaction, but using a set index.
Reduce the load of line changes by doing this rename ahead of time.

-BEGIN VERIFY SCRIPT-
sed --in-place 's/_rep/_idx/g' src/cluster_linearize.h
-END VERIFY SCRIPT-
This significantly changes the data structures used in SFL, based on the
observation that the DepData::top_setinfo fields are quite wasteful:
there is one per dependency (up to n^2/4), but we only ever need one per
active dependency (of which there at most n-1). In total, the number of
chunks plus the number of active dependencies is always exactly equal to
the number of transactions, so it makes sense to have a shared pool of
SetInfos, which are used for both chunks and top sets.

To that effect, introduce a separate m_set_info variable, which stores a
SetInfo per transaction. Some of these are used for chunk sets, and some
for active dependencies' top sets. Every activation transforms the
parent's chunk into the top set for the new dependency. Every
deactivation transforms the top set into the new parent chunk.

With indexes into m_set_data (SetIdx) becoming bounded by the number of
transactions, we can use a SetType to represent sets of SetIdxs.
Specifically, an m_chunk_idxs is added which contains all SetIdx
referring to chunks. This leads to a much more natural way of iterating
over chunks.

Also use this opportunity to normalize many variable names.
With the earlier change to pool SetInfo objects, there is little need
for DepData anymore. Use parent/child TxIdxs to refer to dependencies,
and find their top set by having a child TxIdx-indexed vector in each
TxData, rather than a list of dependencies. This makes code for
iterating over dependencies more natural and simpler.
The combined size of TxData::dep_top_idx can be 16 KiB with 64
transactions and SetIdx = uint32_t. Use a smaller type where possible to
reduce memory footprint and improve cache locality of m_tx_data.

Also switch from an std::vector to an std::array, reducing allocation
overhead and indirections.
This is a simple refactor to make the code more readable.
The current process consists of iterating over the transactions of the
chunk one by one, and then for each figuring out which of its
parents/children are in unprocessed chunks.

Simplify this (and speed it up slightly) by splitting this process into
two phases: first determine the union of all parents/children, and then
find which chunks those belong to.
Instead of computing the set of reachable transactions inside
PickMergeCandidate, make the information precomputed, and updated in
Activate (by merging the two chunks' reachable sets) and Deactivate (by
recomputing).

This is a small performance gain on itself, but also a preparation for
future optimizations that rely on quickly testing whether dependencies
between chunks exist.
Future changes will rely on knowing the chunk indexes of the two created
chunks after a split. It is natural to return this information from
Deactivate, which also simplifies MergeSequence.
After a split, if the top part has a dependency on the bottom part, the
first MergeSequence will always perform this merge and then stop. This
is referred to as a self-merge.

We can special case these by detecting self-merges early, and avoiding
the overhead of a full MergeSequence which involves two
PickMergeCandidate calls (a succesful and an unsuccesful one).
This means we can iterate over all active dependencies in a
cluster/chunk in O(ntx) time rather than O(ndeps) (*), as the number of
active dependencies in a set of transactions of size is at most ntx-1.

(*) Asymptotically, this is not actually true, as for large transaction
counts, iterating over a BitSet still scales with ntx. In practice
however, where BitSets are represented by a constant number of integers,
it holds.
This avoids adding them a second time to m_suboptimal_chunks when they
happen to already be there.
It suffices to initially only attempt one direction of merges in
MakeTopological(), and only try both directions on chunks that are the
result of other merges.
The two calls to UpdateChunk, in Activate and Deactive each, are subtly
different: the top one needs to update the chunk_idx of iterated
transactions, while the bottom one leaves it unchanged. To exploit this
difference, inline the four function calls, getting rid of UpdateChunks.

This is also a preparation for a future improvement that inlines the
recomputation of reachable sets in the same loop in Deactivate.
Avoid two full iterations over all of a chunks' transactions to
recompute the reachable sets, by inlining them into the
dependency-updating loops.

Note that there is no need to do the same for Activate, because the
reachable sets after merging can be computed directly from the input
chunks' reachable sets. Deactivate needs to recompute them, however.
@instagibbs
Copy link
Member

reACK c2fcf25

@sipa
Copy link
Member Author

sipa commented Feb 17, 2026

OTOH, I think that found a couple of bugs/typos, so yay?

Yay!

Do you want me to try to perform some of those splits here? I think the introduction/usage of m_chunk_idxs may be reasonable.

I didn't really see why you swapped the parent/child ordering in Activate (return child_chunk_idx), but I don't see any harm in it either.

The reason is that on Activation, the top chunk becomes the new dependency's top set, and thus the other one - the bottom chunk - needs to become the merged chunk's set (and thus returned by Activate). By reusing the top chunk as top set, we avoid the need for updating things. On Deactivation, the reverse is done (the initial chunk becomes the bottom chunk, the dependency top set becomes the top chunk).

@ajtowns
Copy link
Contributor

ajtowns commented Feb 17, 2026

OTOH, I think that found a couple of bugs/typos, so yay?

Yay!

Do you want me to try to perform some of those splits here? I think the introduction/usage of m_chunk_idxs may be reasonable.

I think it's useful for understanding the changes incrementally, but I'm not sure anyone's likely to want to do that (vs just understanding the latest implementation from scratch) outside of reviewing this PR, in which case they can look at my branch if they want?

I didn't really see why you swapped the parent/child ordering in Activate (return child_chunk_idx), but I don't see any harm in it either.
By reusing the top chunk as top set, we avoid the need for updating things.

Ah, essentially saving swapping two (24 byte?) SetInfo's I guess.

reACK c2fcf25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants