-
Notifications
You must be signed in to change notification settings - Fork 38.7k
index: initial sync speedup, parallelize process #26966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
index: initial sync speedup, parallelize process #26966
Conversation
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code Coverage & BenchmarksFor details see: https://corecheck.dev/bitcoin/bitcoin/pulls/26966. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. LLM Linter (✨ experimental)Possible typos and grammar issues:
drahtbot_id_5_m |
deb2e6e to
d8b0f5e
Compare
|
Cool, will take it for a spin... |
d8b0f5e to
65dc850
Compare
|
Cool @Sjors, just pushed a small update. Found a little bug. Going to add an important note to the PR description (because otherwise testing results will vary a lot): As the access to the block data on disk is protected by |
65dc850 to
09cc56a
Compare
mzumsande
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably much easier suggested than done, but did you attempt to implement parallelization in a more general way so that other indices could benefit from it as well? On first glance, txindex, and the indices suggested in PRs (#24539, #26951) seem to be parallelizable as well (not coinstatsindex though).
Yeah, that is part of the plan. I started with the block filter index because it requires an special treatment that My idea was to start reviewing this one, so the process gets as clean as possible, and then move forward with the generalization step. It's usually more natural to abstract processes when the specific cases are well-defined. |
09cc56a to
43e6237
Compare
w0xlt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK
Perhaps it could have separate parallelization and index logic. So it could be reused in other indexes.
b2b62f5 to
4e1fba1
Compare
|
Building the index on (AMD Ryzen 7950X, blocks stored on SSD):
I made sure to not load any wallets and disabled other indexes. I didn't test if the index was correct. I wonder if, for users without this index, it would be faster to generate the index, rescan the wallet and then delete it again. Combined with #26951 you would only have to generate filters up to the age of the wallet (IIUC, cc @pstratem). Note to self for future benchmarks: |
|
Great results @Sjors!. Could also give it a run rebased on top #27006.
The PR contain a test verifying it. Side note: |
1da78b2 to
f264ed5
Compare
201a90f to
21479ee
Compare
I'm still working on the current approach but synced the tx index on the same environment as before; bare hardware running on an SSD, synchronizing Signet up to block height 240593 (following the testing steps in the PR description). Results: Parallel Mode (5 workers): ~5 minutes. Sequential Mode: 13 minutes. ——— Furthermore, this is something I haven't pushed but this is from the batching DB writes branch: Parallel Mode (5 workers) batching db writes: ~2.30 minutes So I’d guess that it’s not only related to disk access, but the difference might also be tied to your CPU virtualization technology? @andrewtoth. Moreover, I mentioned this somewhere earlier in this PR, but it probably got lost over the years. This PR also prepares the ground for batching DB writes across blocks, which will improve sequential runs as well. Which every user will benefit. Also, parallelization is disabled by default. I don’t think it’s worth removing it for the tx index just because some users might not be able to take advantage of it. It seems to me that the benefits for those who can run it greatly outweigh the limitations for those who can’t (they’re not losing anything with this PR, since sequential mode is still available). |
fcdd89b to
f7cf59e
Compare
|
what's the reason for the frequent pushes here recently? |
I wrote it above, #26966 (comment). I'm working on a few changes that improves parallelism across indexes and the structure of the code, which work properly locally but are not stable on the CI yet. Will update the state with the modifications upon finishing. |
3dede1f to
831dec8
Compare
For the "hardcoded 2 threads" step, I wrote this above but adding 2 threads or Furthermore, the description of this as "critical index infrastructure" is not entirely accurate. We are not dealing with consensus-related code here - this code lives on a different level. Existing tests already verify both its behavior and outcomes, and the previous sequential sync behavior is retained. This PR introduces a new optional feature which, in the worst-case scenario, can run slower than sequential if it is badly configured by the user (something that could also happen with other configurations, such as setting a high number of block sigs verification threads). Also, about the "Given that this PR has struggled to attract reviewers over the past two years": |
|
Now that the CI is happy again. Update regarding the introduced changes: Before we had queues at two different levels, one at the thread pool level and another one at the index level. The initial rationale behind it was to granularly control the way on which the process interrupts them when needed. Not draining the index's So, in short, the code should be cleaner and run faster now, using less mutexes. Also, there is a possible follow-up improvement on moving the final post-processing round to the last task being executed (in case more than one task finishes at the same time by the end of sync). I didn't do it here because it comes with some extra code complexity that I don't think is worth adding to this PR; the same reason I don't think parallelizing the coinstats index here is a good idea, nor batching DB writes. Better to go in steps, as the gains introduced here are already a pretty solid step forward. |
831dec8 to
c38a5db
Compare
|
Small update, per discussion, will push the thread pool on an isolated PR in the coming week. Replacing the current http server Also, I'm investigating the tx index slow down with @andrewtoth. Trying to understand the differences between us. Worst-case scenario, we can disable parallel sync for it (will update accordantly in the coming week). Last note; I reworked the large commit, splitting it into smaller chunks. We can come back to this one after (or while) the http server thread handling PR moves on. But will focus on moving forward there first per feedback. |
No behavior changes. This separation lays the groundwork for parallelizing the block data digest phase, while preserving sequential consistency for indexes that requires it before dumping the data to disk. And it will also allow us to batch database and file writes across multiple blocks. Essentially, it introduces a two-phase block processing mechanism using 'CustomProcessBlock()' and 'CustomPostProcessBlocks()': 1) 'CustomProcessBlock()' will be in charge of digesting block data and returning a custom result object per index. The goal is to encapsulate all procedures that can safely run concurrently here. 2) 'CustomPostProcessBlocks()' receives all digested objects in order, allowing the index to perform any subsequent steps that require sequentiality. For example, the block filter index requires sequentiality in the headers-chain construction, as each record depends on its predecessor hash. Important note: at present, the code executes entirely synchronously.
No behavior changes. The index still processes one block at a time. The code was refactored to allow handling multiple blocks on the processing phase. Introducing the concept of a "Task" which merely represents a range of blocks and their resulting digests. In the next commit, we will take advantage of this to process batches of blocks concurrently.
Move progress related timers (logging and locator write) into a shared SyncContext struct with atomic members. This does not change behavior but prepares the code for safe multi-threaded execution in the next commit.
No-behavior changes. This refactor moves the existing processing logic from BaseIndex::Sync into BaseIndex::RunTask, simplifying the main sync loop and laying groundwork for parallelization of block processing in the upcoming commit. The goal is to have BaseIndex::RunTask running concurrently.
And add option to customize thread pool workers count
This introduces parallel sync for the indexes initial sync, distributing the workload across multiple threads. When enabled, the chain is divided into fixed-size block ranges called "tasks". Worker threads consume tasks concurrently, calling to the CustomProcessBlock over their assigned range, storing results in a shared context while opportunistically batching and dumping the collected information to disk sequentially (when needed). Since large reorgs are improbable during initial sync (headers-chain PoW dictates the valid chain), reorgs are detected only before syncing begins and once it completes. Any new blocks connected during the process are caught up sequentially at the end. This, together with the fact that we no longer depend on an intermediate "index best block" that might be out of sync with the index m_best_block_index, allows us to remove the index_reorg_crash test, as it is no longer possible to call Rewind on a block that is not the index tip.
It also adds coverage for initial sync from a particular block. Mimicking a node restart.
Co-authored-by: furszy <matiasfurszyfer@protonmail.com>
c38a5db to
0ad89bf
Compare
|
🐙 This pull request conflicts with the target branch and needs rebase. |
The current procedure for building the block filter index involves processing filters one at a time;
Reading blocks, undo data, and previous headers from disk sequentially.
This PR introduces a new mechanism to perform the work concurrently. Dividing the filters
generation workload among a pool of workers that can be configured by the user,
significantly increasing the speed of the index construction process.
The same concurrent processing model has been applied to the transactions index as well.
The newly introduced init flag
-indexworkers=<n>enables the concurrent syncbehavior.
Where "n" is the number of worker threads that will be spawned at startup to create ranges
of block filters during the initial sync process. Destroying the workers pool once the
initial sync completes.
Note: by default, the parallelized sync process is not enabled.
Now the juicy part:
In my computer, with the node in debug mode and on IBD, with
-indexworkers=4, theblock filter index generation took less than an hour. While, in master, the sync took more than 7 hours.
Important Note:
As the access to the block data on disk is protected by
cs_main, this new feature runs substantiallyfaster when the node is not in IBD.
Testing Notes:
Sync your node without any index.
Restart the node with one of the indexes (
-blockfilterindexor-txindex) and-connect=0(to sync only the index, without running the net/validation threads. Since threads won't be competing forcs_main, this will give you a more accurate result).You’ll see a "[index name] is enabled at height [height]" log entry once it finishes. Then it’s just a matter of subtracting the index startup time from the "index synced" log time.
Keep in mind that threads are shared among all indexes you start. So if you run both indexes at the same time, your benchmark results cannot be compared against single-index runs.
Fuzz Test Coverage Report
Coverage diff for the fuzz test can be found on https://brunoerg.xyz/bitcoin-core-coverage/26966/coverage_report/