index: batch db writes during initial sync#34489
index: batch db writes during initial sync#34489furszy wants to merge 10 commits intobitcoin:masterfrom
Conversation
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please copy-paste ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. LLM Linter (✨ experimental)Possible typos and grammar issues:
Possible places where named args for integral literals may be used (e.g.
2026-03-08 18:11:08 |
4734e8d to
cf065aa
Compare
|
🚧 At least one of the CI tasks failed. HintsTry to run the tests locally, according to the documentation. However, a CI failure may still
Leave a comment here, if you need help tracking down a confusing failure. |
|
Concept ACK, I prefer this over #33306 - and will investigate if this solves that problem once my benchmarking servers free up.
I experimented with something similar in https://github.com/l0rinc/bitcoin/pull/37/changes, will compare it against this change. I wrote:
Most likely since writing isn't the bottleneck but MuHash calculations were. |
@fjahr the comment was merely an excuse to decouple the DB writes batching code out of #26966 rather than me having a formed opinion in favor or against #33306. That's why I didn't mention it in the PR description. Maybe the changes complement each other. To be crystal clear, just updated the PR description with further details on why this change worth alone, independently on #33306. To summarize it here, the goal of the changes are:
|
dd76491 to
a2f8447
Compare
|
Updated per feedback from @hebasto (thanks!). |
UPDATE: Fixed in #34498. |
|
Can you run all unit and functional tests on all commits? Or do they time out? |
a2f8447 to
8826900
Compare
Bad squash, my bad. Thanks for the heads up. Fixed. |
|
Builds clean on ARM64 (Pi 5, Debian Bookworm). All index-related functional tests and blockfilter_index unit tests pass. |
8826900 to
6f66ff7
Compare
|
I am confused as to how this batching relates to the batching logic of the orginal code. |
The block data is written inside We're not replacing locator update time nor merging it with something else here, we're only batching something that previously wasn't batched at all. Also, using a block window instead of reusing the existing time-based window is intentional. Time-based flushing makes memory usage hard to predict, and difficult to test (see #32878), because the amount of buffered data depends on runtime speed, which does not necessarily correlate with available memory (batched data stays in memory until it's flushed..). A fixed block window provides predictable memory usage and deterministic behavior, which is overall better and also integrates nicely with multiple workers sync (#26966). I would argue in favor of ditching the time window in the future, but.. that's not something related to this PR. |
6f66ff7 to
d0f793a
Compare
|
Updated per feedback, plus added test coverage for the introduced changes. Thanks @optout21. |
|
Thanks for the explanations, and apologies for asking trivial/irrelevant questions.
I was puzzled by the fact that there is a loop in Another approach would be to keep the outer loop in
|
Yes. That would still lock |
|
CI failure is unrelated. |
This PR only touches indexes code and nothing else. CI was failing in |
Could add some debug logging to see where it's getting stuck. At first glance, it looks like it stalls during block creation. Also, just to be clear to other reviewers. Since the issue happens there as well, we know for sure it is not related to this PR. |
|
looks like CI passed now |
yes, review can continue. |
| * | ||
| * This index is used to serve BIP 157 net requests. | ||
| */ | ||
| class BlockFilterIndex final : public BaseIndex |
There was a problem hiding this comment.
Im not sure if removing final from BlockFilterIndex only to allow IndexBlockSim to subclass it for testing is the better decision, i undertsand that it was used final for a reason, and i do not think modifying the code fo fit a test is the best option.
Restoring final makes the compilation to fail with error:
[ 82%] Building CXX object src/test/CMakeFiles/test_bitcoin.dir/checkqueue_tests.cpp.o
/home/arejula27/workspaces/bitcoin/src/test/blockfilter_index_tests.cpp:440:7: error: cannot derive from ‘final’ base ‘BlockFilterIndex’ in derived type ‘blockfilter_index_tests::IndexBlockSim’
440 | class IndexBlockSim : public BlockFilterIndex
| ^~~~~~~~~~~~~
gmake[2]: *** [src/test/CMakeFiles/test_bitcoin.dir/build.make:348: src/test/CMakeFiles/test_bitcoin.dir/blockfilter_index_tests.cpp.o] Error 1
gmake[2]: *** Waiting for unfinished jobs....
gmake[1]: *** [CMakeFiles/Makefile2:2217: src/test/CMakeFiles/test_bitcoin.dir/all] Error 2
gmake: *** [Makefile:146: all] Error 2
Maybe we should base BlockFilterIndex from BaseIndex like IndexReorgCrash
There was a problem hiding this comment.
Yeah, I went back and forth on this decision for a while before pushing it. The idea was to actually cover BlockFilterIndex methods during reorgs, not just the BaseIndex behavior, even if they aren't strictly required for this PR changes.
Ideally, we should have both testing scenarios: reorgs during initial sync against BaseIndex, and reorgs against BlockFilterIndex. But.. doing that would require extracting BuildChainTestingSetup into a separate file, and I didn't want to expand the scope of this PR any further. This test, and even the existing index_reorg_crash (the one that uses IndexReorgCrash) do not belong to blockfilter_index_tests.cpp. They only test against the base class. But well, that is something that already exists in our codebase, and not something we are including here, so it is better to leave it for a different PR.
Also, the final declaration here (and in all other indexes) seems really unnecessary. I cannot think of anything we gain from it. It only restricts our unit testing capabilities. I would remove it from all indexes.
That being said, most of this is just me ranting about improvements we could make. So will take your suggestion and use the BaseIndex class here for now. Let's go step by step. Thanks for the feedback.
There was a problem hiding this comment.
Ideally, we should have both testing scenarios: reorgs during initial sync against BaseIndex, and reorgs against BlockFilterIndex.
Why? Isn't that redundant?
They only test against the base class. But that is something that already exists in our codebase and is not included here, so it is better to leave it for a different PR.
Also, the final declaration here (and in all other indexes) seems really unnecessary. I cannot think of anything we gain from it. It only restricts our unit testing capabilities. I would remove it from all indexes.
Sounds reasonable to discuss in another PR. I like the idea of removing final if it is not needed, but as you said, it should be consistent across all indexes.
There was a problem hiding this comment.
Why? Isn't that redundant?
We can test at different levels.
The BaseIndex test can verify that we always follow the best chain during initial sync, that the appropriate child methods are called in order (append - remove), and check the db chain locator updates. Then, BlockFilterIndex test, and others, should only focus on manually connecting and disconnecting blocks, verifying that its data is consistent in memory and disk.
|
Two small comments, I feel the one about removing Waiting for the HDD benchmark of @l0rinc. I can do it if he is busy |
Having more benchmarks does not hurt :). Feel more than welcome adding your results as well! |
No behavior change. This is just for correctness.
No behavior change.
No behavior change. Introduce ProcessBlocks(start, end) to handle a range of blocks in forward order. Currently used per-block, but this lays the foundation for future batch processing and parallelization. This has the nice property of allowing us to collect the block indexes we are about to process without locking 'cs_main'. Just by traversing the chain backwards via each block index pprev.
80aea07 to
5be65db
Compare
|
Updated per feedback. Thanks @arejula27. |
|
Did another index benchmark for the latest push on an HDD:
indexes | i7-hdd | x86_64 | Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz | 8 cores | 62Gi RAM | ext4 | HDDBEFORE="4c40a923f003420193aa574745f70788bcf35265"; AFTER="5be65dbcdceaafb6f5a0e34c69efc2b3dfcfe27a"; \
DATA_DIR="/mnt/my_storage/BitcoinData"; export DATA_DIR; \
RESULTS_FILE="${DATA_DIR}/index_benchmark_results.txt"; export RESULTS_FILE; \
wait_index() { \
tail -F ${DATA_DIR}/debug.log 2>/dev/null | grep -q -m1 'index is enabled\|is enabled at height'; \
sleep 2; \
killall bitcoind 2>/dev/null || true; \
wait; \
}; export -f wait_index; \
log_size() { \
local label="$1"; local index_dir="$2"; \
local size=$(du -sh "$index_dir" 2>/dev/null | cut -f1); \
local files=$(find "$index_dir" -type f -name '*.ldb' 2>/dev/null | wc -l); \
echo "$label: size=$size, ldb_files=$files" | tee -a "$RESULTS_FILE"; \
}; export -f log_size; \
git reset --hard >/dev/null 2>&1 && git clean -fxd >/dev/null 2>&1 && git fetch origin $BEFORE $AFTER >/dev/null 2>&1; \
for c in $BEFORE:build-before $AFTER:build-after; do \
git checkout ${c%:*} >/dev/null 2>&1 && cmake -B ${c#*:} -G Ninja -DCMAKE_BUILD_TYPE=Release >/dev/null 2>&1 && ninja -C ${c#*:} bitcoind >/dev/null 2>&1; \
done; \
echo "indexes | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $DATA_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $DATA_DIR | tail -1) 2>/dev/null | grep -q 0 && echo SSD || echo HDD)" | tee -a "$RESULTS_FILE" && \
for INDEX in txindex blockfilterindex coinstatsindex txospenderindex;; do \
echo -e "\n--- $INDEX ---" | tee -a "$RESULTS_FILE"; \
if [ "$INDEX" = "blockfilterindex" ]; then \
INDEX_DIR="${DATA_DIR}/indexes/blockfilter/basic"; \
else \
INDEX_DIR="${DATA_DIR}/indexes/${INDEX}"; \
fi; \
export INDEX_DIR; \
for BUILD in before after; do \
if [ "$BUILD" = "before" ]; then COMMIT="${BEFORE:0:7}"; else COMMIT="${AFTER:0:7}"; fi; \
hyperfine --runs 1 --shell bash --sort command \
--prepare "rm -rf ${DATA_DIR}/indexes/* ${DATA_DIR}/debug.log" \
--cleanup "log_size '${INDEX} ${BUILD}' '${INDEX_DIR}'" \
-n "${BUILD} (${COMMIT})" \
"./build-${BUILD}/bin/bitcoind -datadir=${DATA_DIR} -${INDEX}=1 -connect=0 -printtoconsole=0 & wait_index" \
2>&1 | tee -a "$RESULTS_FILE"; \
done; \
done;
Benchmark 1: before (4c40a92) Benchmark 1: after (5be65db)
Benchmark 1: before (4c40a92) Benchmark 1: after (5be65db)
Benchmark 1: before (4c40a92) Benchmark 1: after (5be65db)
Benchmark 1: after (5be65db) Edit: added |
It looks like the improvement is not very noticeable for two of the indexes. It might be interesting to try increasing the batch size. Do you have any suggestions on what could be causing this? |
I'm not sure this is really surprising given the difference in the amount of data written to the latter two. I would expect a jump more similar to the txindex for the txospenderindex. What is the expectation on the increase of memory usage here? I tried profiling it with massif, but usage when running with this patch seemed a bit erratic, which I guess is still an indication that it is likely to be higher. Can we really batch operations in this way without putting more memory pressure on the OS? |
The batch size is intentionally small. I was very conservative with the chosen value to avoid having to worry about it here. The goal of this PR is to land the structural improvements, with a net positive: less It's expected that any batching implementation adds some memory overhead, but that shouldn't be a problem, it pays off in many other areas, not only with a faster sync promise. And, if we ever wanted to, we could add a config flag to disable batching by setting the batch size to 1. But realistically speaking, I don't think anyone running indexes would want them to take hours to sync while locking Also, an important point. All these changes only matter during the index initial sync period. Once the index is synced, they are no longer relevant. We can benchmark this on small devices like the Pi 4 to see the benefits more clearly, but I honestly wouldn't spend too much time on it. If we agree this PR lands structural, scalable improvements with many benefits and no noticeable downside, and also lets us move forward with the major parallelization speedup path, it seems reasonable to just move forward. This also has a good number of tests we currently lack from. |
This is a refactoring that makes NextSyncBlock easily adaptable to return block ranges instead of single blocks in the subsequent commit. It also adds a fast-path that avoids calling 'FindFork()' and 'Next()' when the index is synced. Co-authored-by: Hao Xu <hao.xu@linux.dev>
No behavior change.
Set end of the window of blocks an index will process at time. Stopping at either the configured window size or the chain tip. This is the first step toward batch and parallel processing. These block windows become units of work and live in memory until flushed to disk in one go. Since each index has a different data size, the range size is configurable so it can be tuned considering memory consumption.
Pass CDBBatch to subclasses so writes can be accumulated and committed together instead of flushed per block Note: Batch writes were intentionally not tied to the existing Commit() method. The rationale is to bound memory consumption as batches accumulate in RAM until flushed to disk. This leaves two separate workflows with different purposes: 1) Batch writes flushes data for a block range atomically: either all blocks in the batch land in LevelDB or none do. 2) Commit() is about progress checkpointing: it writes the best block locator alongside any other index state position (e.g. last written file for the block filter index - which requires an fsync call), so the node always has a consistent recovery point after an interruption or crash. Reprocessing data after a crash is not really a problem because we would just overwrite the existing one.
Verifies the index persists the last processed index when interrupted.
5be65db to
1497086
Compare
l0rinc
left a comment
There was a problem hiding this comment.
Concept ACK on the goal: batching is the right direction for reducing write amplification, cs_main contention, and LDB file proliferation.
However, I think it's important to stress my concern that this change is more complex than it's being treated as.
It alters the invariants every index subclass relies on: in-memory state can now advance ahead of what's persisted, and the interrupt/shutdown paths haven't been fully thought through for each subclass.
The most serious issue seems to me that Commit() during interrupt writes advanced m_muhash alongside a stale locator, which causes CoinStatsIndex::CustomInit to reject the index as corrupted on restart. blockfilterindex has a milder variant (orphaned flat file data). I wonder if fuzzing could catch these.
More generally, several "no behavior change" refactoring commits don't explain why the restructuring is safe, and memory concerns are brushed away with "shouldn't be a problem".
I also left a ton of comments about the ProcessBlocks loop needing tighter bounds and the shutdown test only exercising a no-op BaseIndex subclass. It doesn't catch the actual subclass corruption paths.
Some of the earlier commits (e.g. the iterator ordering fix) could land as standalone PRs to reduce the review surface here.
See inline comments for details on these and other issues (interrupt handler scoping, NextSyncBlock invariants, loop safety, stale comments, etc.).
Note
in the meantime, I finished the HDD benchmarks with txospenderindex measurements, see #34489 (comment).
src/index/base.cpp
Outdated
| } | ||
| pindex = pindex_next; | ||
|
|
||
| if (!ProcessBlock(pindex_next)) return; // error logged internally |
There was a problem hiding this comment.
95fc91f index: sync, update iterator only after processing block:
We're changing behavior in a commit that claims it's not a behavior change, without adding context for why this change was needed or how to validate that it is correct. Given that it wasn't obvious to the original authors, an explanation is needed. Also, No test was added to exercise this behavior, to explain exactly why it was necessary, and make sure it doesn't happen again.
It also seems independent enough to be pushed in a separate PR, with a test documenting the current behavior in the first commit and, in the next commit, a fix and a test update that reflects the new behavior.
| } | ||
| } | ||
|
|
||
| if (m_interrupt) { |
There was a problem hiding this comment.
0cbe84d index: refactor, decouple interruption from sync loop:
The commit claims there's no behavior change, but it doesn't explain why the move is safe. Given all the internal returns and breaks it's not self-evident.
It seems to me that previously a synced, and later interrupted, path could hit
LogInfo("%s is enabled at height %d", GetName(), pindex->nHeight);, but after the change it hits
LogInfo("%s: m_interrupt set; exiting ThreadSync", GetName());and returns.
I'm not sure there is any other change, but I would appreciate a commit message explanation, and maybe a test exercising this path, preferably a fuzz test to cover all these weird combinations.
|
|
||
| bool BaseIndex::ProcessBlocks(const CBlockIndex* start, const CBlockIndex* end) | ||
| { | ||
| // Collect all block indexes from [end...start] in order |
There was a problem hiding this comment.
0336c61 index: add method to process block ranges:
nit: for consistency we could use the same comment style as below:
| // Collect all block indexes from [end...start] in order | |
| // Collect all block indexes from end to start |
| // Collect all block indexes from [end...start] in order | ||
| std::vector<const CBlockIndex*> ordered_blocks; | ||
| ordered_blocks.reserve(end->nHeight - start->nHeight + 1); | ||
| for (const CBlockIndex* block = end; block && start->pprev != block; block = block->pprev) { |
There was a problem hiding this comment.
0336c61 index: add method to process block ranges:
When can block be nullptr here? Wouldn't it be a fatal error if we walk past genesis? And what if the caller mixes up begin and end? Or if start and end have valid heights but are on different forks.
Given how intertwined the code is here, a serious mess-up could cause an infinite loop and would be really hard to trace back to the source.
I would probably sleep better if we asserted that end->nHeight >= start->nHeight and used a bounded for loop here instead.
const int range_size{end->nHeight - start->nHeight + 1};
Assert(range_size > 0);
// Collect all block indexes from end to start
std::vector<const CBlockIndex*> ordered_blocks;
{
ordered_blocks.reserve(range_size);
CBlockIndex* it = end;
for (int i{0}; i < range_size; ++i) {
ordered_blocks.emplace_back(Assert(it));
it = it->pprev;
}
Assert(it == start->pprev);
}|
|
||
| // And process blocks in forward order: from start to end | ||
| for (auto it = ordered_blocks.rbegin(); it != ordered_blocks.rend(); ++it) { | ||
| if (m_interrupt) return false; |
There was a problem hiding this comment.
0336c61 index: add method to process block ranges:
The method comment states:
Returns false on unrecoverable failure or during interruption
But if the interruption happens before we call the first ProcessBlock, we should be safe since no state was changed - i.e. nothing to recover from.
|
|
||
| // Verifies that the index persists its sync progress when interrupted during initial sync. | ||
| // The index should resume from the last processed batch rather than restarting from genesis. | ||
| BOOST_FIXTURE_TEST_CASE(shutdown_during_initial_sync, BuildChainTestingSetup) |
There was a problem hiding this comment.
1497086 test: Add coverage for index locator persistence during shutdown:
This should have probably caught the partial write regression.
Can we extend IndexCommitStateSim to implement CustomCommit as well?
| index.Stop(); | ||
| } | ||
|
|
||
| // Ensure the initial sync batch window behaves as expected. |
There was a problem hiding this comment.
7c82cc8 test: index, add initial sync batch writes coverage:
Adding tests at the very end doesn't help document the progression of the changes introduced in this PR, commit by commit.
We could add this test as a first commit:
// Ensure initial sync can be restarted cleanly without overwriting earlier
// results. Tests sync from genesis and from a higher block to mimic a restart.
BOOST_FIXTURE_TEST_CASE(initial_sync_restart, BuildChainTestingSetup)and later, in the batch size commit, just add filter_index.SetProcessingBatchSize(BATCH_SIZE); to the test to prove that it still passes. That way, we are asserting that the previous behavior is retained, not just that the new behavior is covered.
| // Tests that indexes can complete its initial sync even if a reorg occurs mid-sync. | ||
| // The index is paused at a specific block while a fork is introduced a few blocks before the tip. | ||
| // Once unblocked, the index should continue syncing and correctly reach the new chain tip. | ||
| BOOST_FIXTURE_TEST_CASE(initial_sync_reorg, BuildChainTestingSetup) |
There was a problem hiding this comment.
28d8b04 test: index, add coverage for initial sync reorgs:
This test almost completely passes on master - we should add as much coverage before the refactor.
But we need to make it more realistic, likely by overriding CustomAppend as well.
| constexpr int SHUTDOWN_HEIGHT = 45; | ||
| constexpr int BATCH_SIZE = 10; | ||
| constexpr int EXPECTED_LAST_SYNCED_BLOCK = 39; |
There was a problem hiding this comment.
1497086 test: Add coverage for index locator persistence during shutdown:
This passes before the change with:
// The index will be interrupted while processing block 45. Since sync is
// still per-block here, the last persisted block is expected to be 45.
constexpr int SHUTDOWN_HEIGHT = 45;
constexpr int EXPECTED_LAST_SYNCED_BLOCK = 45;(and no SetProcessingBatchSize, of course)
| bool BlockFilterIndex::Write(const BlockFilter& filter, uint32_t block_height, const uint256& filter_header) | ||
| bool BlockFilterIndex::Write(CDBBatch& batch, const BlockFilter& filter, uint32_t block_height, const uint256& filter_header) | ||
| { | ||
| size_t bytes_written = WriteFilterToDisk(m_next_filter_pos, filter); |
There was a problem hiding this comment.
We seem to be writing to disk to eagerly here, this also is misaligned with the batching.


Decouples part of #26966.
Right now, index initial sync writes to disk for every block, which is not the best for HDD. This change batches those, so disk writes are less frequent during initial sync. This also cuts down
cs_maincontention by reducing the number ofNextBlockSynccalls (instead of calling it for every block, we will call it once per block window), making the node more responsive (IBD and validation) while indexes sync up. On top of that, it lays the groundwork for the bigger speedups, since part of the parallelization pre-work is already in place.Just as a small summary:
cs_mainlock contention, which improves the overall node responsiveness (and primarily IBD) while the indexes threads are syncing.