Skip to content

Conversation

@furszy
Copy link
Member

@furszy furszy commented Jan 25, 2023

The current procedure for building the block filter index involves processing filters one at a time;
Reading blocks, undo data, and previous headers from disk sequentially.

This PR introduces a new mechanism to perform the work concurrently. Dividing the filters
generation workload among a pool of workers that can be configured by the user,
significantly increasing the speed of the index construction process.

The same concurrent processing model has been applied to the transactions index as well.

The newly introduced init flag -indexworkers=<n> enables the concurrent sync
behavior.
Where "n" is the number of worker threads that will be spawned at startup to create ranges
of block filters during the initial sync process. Destroying the workers pool once the
initial sync completes.
Note: by default, the parallelized sync process is not enabled.

Now the juicy part:
In my computer, with the node in debug mode and on IBD, with -indexworkers=4, the
block filter index generation took less than an hour. While, in master, the sync took more than 7 hours.

Important Note:
As the access to the block data on disk is protected by cs_main, this new feature runs substantially
faster when the node is not in IBD.

Testing Notes:

  1. Sync your node without any index.

  2. Restart the node with one of the indexes (-blockfilterindex or -txindex) and -connect=0 (to sync only the index, without running the net/validation threads. Since threads won't be competing for cs_main, this will give you a more accurate result).

    You’ll see a "[index name] is enabled at height [height]" log entry once it finishes. Then it’s just a matter of subtracting the index startup time from the "index synced" log time.

Keep in mind that threads are shared among all indexes you start. So if you run both indexes at the same time, your benchmark results cannot be compared against single-index runs.

Fuzz Test Coverage Report

Coverage diff for the fuzz test can be found on https://brunoerg.xyz/bitcoin-core-coverage/26966/coverage_report/

@DrahtBot
Copy link
Contributor

DrahtBot commented Jan 25, 2023

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/26966.

Reviews

See the guideline for information on the review process.

Type Reviewers
Concept ACK w0xlt, TheCharlatan, ismaelsadeeq, mzumsande
Approach ACK ryanofsky
Approach NACK l0rinc
Stale ACK pinheadmz

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #33689 (http: replace WorkQueue and single threads handling for ThreadPool by furszy)
  • #31308 (ci, iwyu: Treat warnings as errors for specific directories by hebasto)
  • #31260 (scripted-diff: Type-safe settings retrieval by ryanofsky)
  • #29770 (index: Check all necessary block data is available before starting to sync by fjahr)
  • #29700 (kernel, refactor: return error status on all fatal errors by ryanofsky)
  • #17783 (common: Disallow calling IsArgSet() on ALLOW_LIST options by ryanofsky)
  • #17581 (refactor: Remove settings merge reverse precedence code by ryanofsky)
  • #17580 (refactor: Add ALLOW_LIST flags and enforce usage in CheckArgFlags by ryanofsky)
  • #17493 (util: Forbid ambiguous multiple assignments in config file by ryanofsky)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

LLM Linter (✨ experimental)

Possible typos and grammar issues:

  • a posterior -> later [“a posterior” is incorrect here; use “later” (or “posterior”) to convey a subsequent check]

drahtbot_id_5_m

@Sjors
Copy link
Member

Sjors commented Jan 26, 2023

Cool, will take it for a spin...

@furszy furszy force-pushed the 2022_parallelize_blockfilter_index_2 branch from d8b0f5e to 65dc850 Compare January 28, 2023 15:04
@furszy
Copy link
Member Author

furszy commented Jan 28, 2023

Cool @Sjors, just pushed a small update. Found a little bug.

Going to add an important note to the PR description (because otherwise testing results will vary a lot):

As the access to the block data on disk is protected by cs_main, this new feature runs substantially
faster when the node is not in IBD.
(where "substantially" here means full index sync, with 5 workers, in less than 20 minutes in my computer).

@furszy furszy force-pushed the 2022_parallelize_blockfilter_index_2 branch from 65dc850 to 09cc56a Compare January 28, 2023 15:40
Copy link
Contributor

@mzumsande mzumsande left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably much easier suggested than done, but did you attempt to implement parallelization in a more general way so that other indices could benefit from it as well? On first glance, txindex, and the indices suggested in PRs (#24539, #26951) seem to be parallelizable as well (not coinstatsindex though).

@furszy
Copy link
Member Author

furszy commented Jan 30, 2023

It's probably much easier suggested than done, but did you attempt to implement parallelization in a more general way so that other indices could benefit from it as well?

Yeah, that is part of the plan. I started with the block filter index because it requires an special treatment that txindex parallelization does not require (block/undo data reading and block filters creation can be parallelized but writing must be done sequentially due the need to link filter headers to their predecessors to create the filters-chain on disk).

My idea was to start reviewing this one, so the process gets as clean as possible, and then move forward with the generalization step. It's usually more natural to abstract processes when the specific cases are well-defined.

@furszy furszy force-pushed the 2022_parallelize_blockfilter_index_2 branch from 09cc56a to 43e6237 Compare January 30, 2023 16:15
Copy link
Contributor

@w0xlt w0xlt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK

Perhaps it could have separate parallelization and index logic. So it could be reused in other indexes.

@furszy furszy force-pushed the 2022_parallelize_blockfilter_index_2 branch 2 times, most recently from b2b62f5 to 4e1fba1 Compare January 30, 2023 20:55
@Sjors
Copy link
Member

Sjors commented Feb 7, 2023

Building the index on (AMD Ryzen 7950X, blocks stored on SSD):

  • master @ fe86616: 35'15" (mostly 1 alternating CPU thread)
  • this PR (rebased)
    • n=8: 5'20" (uses about 8 of 32 CPU threads as expected)
    • n=32: 4'26" (pleasantly close to 100% CPU usage with a dip every 10 seconds, but it drops to only 1 CPU in the last minute or two)

I made sure to not load any wallets and disabled other indexes.

I didn't test if the index was correct.

I wonder if, for users without this index, it would be faster to generate the index, rescan the wallet and then delete it again. Combined with #26951 you would only have to generate filters up to the age of the wallet (IIUC, cc @pstratem).

Note to self for future benchmarks: -txindex takes 1:07'14", -coinstatsindex takes 3.5 hours.

@furszy
Copy link
Member Author

furszy commented Feb 7, 2023

Great results @Sjors!.

Could also give it a run rebased on top #27006.
On master, the index initial sync is slower when the node is in IBD because the index thread has to compete for access to block data on disk through cs_main acquisition.

I didn't test if the index was correct.

The PR contain a test verifying it.


Side note:
I'm working on generalizing the parallelization flow so other indexes, like the txindex and #26951 can make use of it too.

@furszy
Copy link
Member Author

furszy commented Oct 14, 2025

I also measured the txindex with -indexworkers=15 on a 32 vcpu machine and it was slightly slower than master. I guess this makes sense since each thread needs to write the index data to leveldb, and the actual work done by the CPU is minimal (serializing then computing offsets). So each thread is waiting for the other threads to finish writing. For blockfilter the filter computation is more intensive, and then each filter is written to a separate file. Only after the filter is written the location of the filter is written to the shared leveldb.

Edit: Just tried again on latest push f130332 on a node synced to 913866:

./build/bin/bitcoind -connect=0 -disablewallet -txindex=1 -indexworkers=5

branch (f130332) - time was 90 minutes 41 seconds master (becf150) - time was 84 minutes 42 seconds

So master was ~7% faster than this PR with 5 workers on an SSD.

Maybe we don't enable parallel sync for txindex?

I'm still working on the current approach but synced the tx index on the same environment as before; bare hardware running on an SSD, synchronizing Signet up to block height 240593 (following the testing steps in the PR description). Results:

Parallel Mode (5 workers): ~5 minutes.

2025-10-13T19:54:55Z initload thread exit
2025-10-13T19:55:28Z Syncing txindex with block chain from height 149999
2025-10-13T19:55:59Z Syncing txindex with block chain from height 197999
2025-10-13T19:56:32Z Syncing txindex with block chain from height 224999
2025-10-13T19:57:19Z Syncing txindex with block chain from height 225999
2025-10-13T19:58:13Z Syncing txindex with block chain from height 231999
2025-10-13T19:58:57Z Syncing txindex with block chain from height 236999
2025-10-13T19:59:09Z txindex is enabled at height 240593

Sequential Mode: 13 minutes.

2025-10-13T20:00:02Z initload thread exit
2025-10-13T20:00:34Z Syncing txindex with block chain from height 64999
2025-10-13T20:01:04Z Syncing txindex with block chain from height 84999
2025-10-13T20:01:35Z Syncing txindex with block chain from height 130999
2025-10-13T20:02:05Z Syncing txindex with block chain from height 182999
2025-10-13T20:02:36Z Syncing txindex with block chain from height 192999
2025-10-13T20:03:16Z Syncing txindex with block chain from height 200999
2025-10-13T20:03:47Z Syncing txindex with block chain from height 212999
2025-10-13T20:04:20Z Syncing txindex with block chain from height 223999
2025-10-13T20:05:21Z Syncing txindex with block chain from height 226999
2025-10-13T20:05:54Z Syncing txindex with block chain from height 227999
2025-10-13T20:06:29Z Syncing txindex with block chain from height 228999
2025-10-13T20:06:59Z Syncing txindex with block chain from height 229999
2025-10-13T20:07:39Z Syncing txindex with block chain from height 230999
2025-10-13T20:08:39Z Syncing txindex with block chain from height 231999
2025-10-13T20:09:22Z Syncing txindex with block chain from height 232999
2025-10-13T20:10:09Z Syncing txindex with block chain from height 233999
2025-10-13T20:10:48Z Syncing txindex with block chain from height 234999
2025-10-13T20:11:38Z Syncing txindex with block chain from height 236999
2025-10-13T20:12:14Z Syncing txindex with block chain from height 237999
2025-10-13T20:13:02Z Syncing txindex with block chain from height 239999
2025-10-13T20:13:06Z txindex is enabled at height 240593

———

Furthermore, this is something I haven't pushed but this is from the batching DB writes branch:

Parallel Mode (5 workers) batching db writes: ~2.30 minutes

2025-10-13T20:30:54Z initload thread exit
2025-10-13T20:31:25Z Syncing txindex with block chain from height 181999
2025-10-13T20:31:56Z Syncing txindex with block chain from height 217999
2025-10-13T20:32:29Z Syncing txindex with block chain from height 225999
2025-10-13T20:33:14Z Syncing txindex with block chain from height 230999
2025-10-13T20:33:50Z Syncing txindex with block chain from height 236999
2025-10-13T20:33:54Z txindex is enabled at height 240593

So I’d guess that it’s not only related to disk access, but the difference might also be tied to your CPU virtualization technology? @andrewtoth.

Moreover, I mentioned this somewhere earlier in this PR, but it probably got lost over the years. This PR also prepares the ground for batching DB writes across blocks, which will improve sequential runs as well. Which every user will benefit.

Also, parallelization is disabled by default. I don’t think it’s worth removing it for the tx index just because some users might not be able to take advantage of it. It seems to me that the benefits for those who can run it greatly outweigh the limitations for those who can’t (they’re not losing anything with this PR, since sequential mode is still available).

@furszy furszy force-pushed the 2022_parallelize_blockfilter_index_2 branch 2 times, most recently from fcdd89b to f7cf59e Compare October 14, 2025 20:10
@l0rinc
Copy link
Contributor

l0rinc commented Oct 14, 2025

what's the reason for the frequent pushes here recently?

@furszy
Copy link
Member Author

furszy commented Oct 14, 2025

what's the reason for the frequent pushes here recently?

I wrote it above, #26966 (comment). I'm working on a few changes that improves parallelism across indexes and the structure of the code, which work properly locally but are not stable on the CI yet. Will update the state with the modifications upon finishing.

@furszy furszy force-pushed the 2022_parallelize_blockfilter_index_2 branch 2 times, most recently from 3dede1f to 831dec8 Compare October 15, 2025 16:58
@furszy
Copy link
Member Author

furszy commented Oct 15, 2025

Way forward

Given that this PR has struggled to attract reviewers over the past two years, I think we need a different approach.
This is a risky change that touches critical index infrastructure - let's do it in tiny, focused steps instead.

I'd like to propose keeping this PR open as a draft tracking PR that contains the full end-to-end implementation for reference and discussion. Meanwhile, we can merge the work through a series of smaller, focused PRs:

  • Start with a minimal implementation: single index (blockfilterindex), hardcoded 2 threads, minimal configurability
  • Get that reviewed, tested, and merged - let's see what users think
  • Then add the thread pool infrastructure in a separate PR (if still needed - we might find simpler per-index solutions work better)
  • Then add txindex and coinstatsindex (and eventually chainstate) support once we have confidence in the approach (and fix the current implementation issues).

For the "hardcoded 2 threads" step, I wrote this above but adding 2 threads or n threads represents the exact same work. The code complexity comes from the two-step procedure that allows sequential dumps after parallel block data construction (a restriction that comes from the filters headers-chain structure; headers are linked to each other). It is not related to the number of threads introduced, that's is simply controlled by the thread pool introduced in the first commit. That line simply doesn't make sense to me.

Furthermore, the description of this as "critical index infrastructure" is not entirely accurate. We are not dealing with consensus-related code here - this code lives on a different level. Existing tests already verify both its behavior and outcomes, and the previous sequential sync behavior is retained. This PR introduces a new optional feature which, in the worst-case scenario, can run slower than sequential if it is badly configured by the user (something that could also happen with other configurations, such as setting a high number of block sigs verification threads).

Also, about the "Given that this PR has struggled to attract reviewers over the past two years":
I don't think it is fair to say that at the moment. This PR has gotten quite nice reviews lately and has been ramping up quite nicely. That's mainly why I ended up modifying part of it and simplifying some workflows (an update will come in the next comment).
I think the problem in the past years was mainly related to the lack of urgency for having this new feature and its requirement of understanding not only what indexes are and how they are compounded, but also the underlying structure of the block filters chain.
I think it is just a matter of coordination between people who have worked on this code before and anyone else who wants to spend time understanding it deeply. I think we are happily starting to roll the ball on that direction now.

@furszy
Copy link
Member Author

furszy commented Oct 15, 2025

Now that the CI is happy again. Update regarding the introduced changes:

Before we had queues at two different levels, one at the thread pool level and another one at the index level. The initial rationale behind it was to granularly control the way on which the process interrupts them when needed. Not draining the index's pending_tasks queue as soon as the first task fails to process or a shutdown was triggered. But, after talking with mzumsande last week, I realized that this was actually over-engineering for not much gain and was harming parallelism across indexes. Top level index queues were racing with each other to take control of the underlying worker threads.
By unifying this two queues, we let the thread pool decide how/when to handle the tasks alone, with no index inference at all. Which allows for much better parallelism among any piece of code using the same workers (not just indexes) and also for further improvements along the road; like modifying a single spot with a lock-free structure or even the prioritization of certain tasks at the workers level - this former one was an idea coming from l0rinc two weeks ago.

So, in short, the code should be cleaner and run faster now, using less mutexes. Also, there is a possible follow-up improvement on moving the final post-processing round to the last task being executed (in case more than one task finishes at the same time by the end of sync). I didn't do it here because it comes with some extra code complexity that I don't think is worth adding to this PR; the same reason I don't think parallelizing the coinstats index here is a good idea, nor batching DB writes. Better to go in steps, as the gains introduced here are already a pretty solid step forward.

@furszy
Copy link
Member Author

furszy commented Oct 21, 2025

Small update, per discussion, will push the thread pool on an isolated PR in the coming week. Replacing the current http server WorkQueue and all its related code; it is an easy dedup that adds further test coverage to a never unit nor fuzz tested area. Plus a way to start using this class in the repo that allow others to use it.

Also, I'm investigating the tx index slow down with @andrewtoth. Trying to understand the differences between us. Worst-case scenario, we can disable parallel sync for it (will update accordantly in the coming week).
That being said, this refactoring enables batching DB writes as well, which will speedup tx index anyway. But prefer to do it in a follow-up to not continue expanding this PR.

Last note; I reworked the large commit, splitting it into smaller chunks. We can come back to this one after (or while) the http server thread handling PR moves on. But will focus on moving forward there first per feedback.

furszy and others added 10 commits October 24, 2025 01:26
No behavior changes.

This separation lays the groundwork for parallelizing the block data
digest phase, while preserving sequential consistency for indexes
that requires it before dumping the data to disk.
And it will also allow us to batch database and file writes across
multiple blocks.

Essentially, it introduces a two-phase block processing mechanism using
'CustomProcessBlock()' and 'CustomPostProcessBlocks()':

1) 'CustomProcessBlock()' will be in charge of digesting block data and
   returning a custom result object per index.
   The goal is to encapsulate all procedures that can safely run
   concurrently here.

2) 'CustomPostProcessBlocks()' receives all digested objects in order,
   allowing the index to perform any subsequent steps that require sequentiality.
   For example, the block filter index requires sequentiality in the headers-chain
   construction, as each record depends on its predecessor hash.

Important note: at present, the code executes entirely synchronously.
No behavior changes. The index still processes one block at a time.
The code was refactored to allow handling multiple blocks on the
processing phase.

Introducing the concept of a "Task" which merely represents a range
of blocks and their resulting digests.

In the next commit, we will take advantage of this to process batches
of blocks concurrently.
Move progress related timers (logging and locator write) into a shared
SyncContext struct with atomic members. This does not change behavior
but prepares the code for safe multi-threaded execution in the next
commit.
No-behavior changes.

This refactor moves the existing processing logic from BaseIndex::Sync
into BaseIndex::RunTask, simplifying the main sync loop and laying
groundwork for parallelization of block processing in the upcoming commit.
The goal is to have BaseIndex::RunTask running concurrently.
And add option to customize thread pool workers count
This introduces parallel sync for the indexes initial sync,
distributing the workload across multiple threads.

When enabled, the chain is divided into fixed-size block ranges called
"tasks". Worker threads consume tasks concurrently, calling to the
CustomProcessBlock over their assigned range, storing results in a
shared context while opportunistically batching and dumping the collected
information to disk sequentially (when needed).

Since large reorgs are improbable during initial sync (headers-chain PoW
dictates the valid chain), reorgs are detected only before syncing begins
and once it completes. Any new blocks connected during the process are
caught up sequentially at the end. This, together with the fact that we
no longer depend on an intermediate "index best block" that might be out
of sync with the index m_best_block_index, allows us to remove the
index_reorg_crash test, as it is no longer possible to call Rewind on a
block that is not the index tip.
It also adds coverage for initial sync from a particular block.
Mimicking a node restart.
Co-authored-by: furszy <matiasfurszyfer@protonmail.com>
@furszy furszy force-pushed the 2022_parallelize_blockfilter_index_2 branch from c38a5db to 0ad89bf Compare October 24, 2025 05:34
@DrahtBot
Copy link
Contributor

🐙 This pull request conflicts with the target branch and needs rebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.