-
Notifications
You must be signed in to change notification settings - Fork 38.7k
validation: reduce persisted UTXO set size by prioritizing positive lookups (RFC) #33817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
LevelDB's `FilterPolicy` stores a per-table probabilistic filter to optimize for negative lookups. Outside the BIP30/BIP34 window (first ~230k blocks), validation does not deliberately probe the UTXO set for missing entries (missing coins imply invalid transactions). Filters therefore slow the common case (present-key lookups) while adding a probabilistic structure to on-disk tables. Bloom filters were introduced in the Ultraprune PR (bitcoin#1677) without explicit documentation of their purpose. Removing them slightly speeds up present-key workloads (~11% faster AssumeUTXO load) and reduces the on-disk chainstate size by ~2% because filter blocks are not stored. This commit is placed before burying BIP30 behind assumevalid to make performance changes reproducible in isolation. Benchmarking reindex-chainstate for the first 230k blocks (to quantify the cost of negative lookups without filters) shows only a small slowdown on misses, totaling a few seconds, while later blocks can be faster due to optimizing for the common case. ----- 2b9c351 Merge bitcoin#33768: refactor: remove dead branches in `SingletonClusterImpl` 166d35713c leveldb: remove bloom filters from leveldb Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=230000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 2b9c351) Time (mean ± σ): 170.615 s ± 0.468 s [User: 186.278 s, System: 10.035 s] Range (min … max): 170.285 s … 170.946 s 2 runs Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=230000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 166d35713cf61986bb4b37283cb8b001ad013771) Time (mean ± σ): 181.904 s ± 0.534 s [User: 196.567 s, System: 10.482 s] Range (min … max): 181.526 s … 182.281 s 2 runs
BIP30 prevents duplicate transaction IDs by checking whether outputs already exist in the UTXO set before adding them. This applies to blocks <227,930 (pre-BIP34 activation) and is conservatively re-enforced after height 1,983,701. BIP30 checks are the only place in validation where we intentionally query the UTXO database for entries we expect not to find. For blocks prior to the `assumevalid` anchor, we already skip script verification and other checks, relying on accumulated proof of work. Skipping BIP30 for those deeply buried blocks is consistent with assumevalid's purpose. This removes negative UTXO lookups during IBD when íassumevalidí is used. Nodes syncing from genesis with -assumevalid=0 still perform full BIP30 validation. Checks beyond 1,983,701 remain enforced regardless of `fScriptChecks`. ----- 2b9c351 Merge bitcoin#33768: refactor: remove dead branches in `SingletonClusterImpl` 060a83df97 validation: bury bip30 checks behind assumevalid Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=230000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 2b9c351) Time (mean ± σ): 170.827 s ± 0.718 s [User: 186.351 s, System: 10.223 s] Range (min … max): 170.319 s … 171.334 s 2 runs Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=230000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 060a83df97a84e39a44a7f4a8ea27512d2e7b008) Time (mean ± σ): 128.569 s ± 0.168 s [User: 143.057 s, System: 10.436 s] Range (min … max): 128.449 s … 128.688 s 2 runs
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code Coverage & BenchmarksFor details see: https://corecheck.dev/bitcoin/bitcoin/pulls/33817. ReviewsSee the guideline for information on the review process. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
Not sure about this. Wouldn't this mean someone can feed a Even if it wasn't, I am not sure if touching validation.cpp is worth it for basically a rounding error on overall IBD speed? |
It's not about IBD necessarily, but reduced disk footprint and adjusting the database to resemble the usage more closely:
|
|
Hm. I don't know we were aware that you could turn off the filters in leveldb-- I thought they were used to also decide what level an entry might be in! Have you tried to characterize if this opens up any DOS attacks with unconfirmed transactions? I think assumeutxo load time is not the best benchmark for this-- it's a one time operation and already pretty fast. It would be more compelling if it could be shown to reduce IBD time or block validation time at tip-- though the validation time graph you've provided isn't very compelling. Nor do I think 2% storage is particularly compelling. But if there isn't a potential downside, why not? |
|
Thanks for the comments! |
|
🐙 This pull request conflicts with the target branch and needs rebase. |
draft to gather comments and conceptual reviews
Context
BIP30 prevents duplicate transaction IDs by checking whether outputs already exist in the UTXO set before adding them.
LevelDB's
FilterPolicystores a per-table probabilistic filter to optimize for negative lookups.After the first ~230k blocks (BIP30/BIP34 windows), validation does not deliberately probe the UTXO set for missing entries (missing coins imply invalid transactions).
Bloom filters therefore slow the common case (present-key lookups) while bloating the on-disk tables.
History
Bloom filters were introduced in the Ultraprune PR (#1677) without explicit documentation of their purpose.
Fix
For blocks prior to the
assumevalidanchor, we already skip script verification, relying on accumulated proof of work. Skipping BIP30 for those deeply buried blocks is consistent with assumevalid's purpose (especially after the recent checkpoint removal).Removing the
LevelDBbloom filters slightly speeds up present-key workloads (~11% fasterAssumeUTXOload) and reduces the on-disk chainstate size by ~2% because filter blocks are not stored.Disclaimer
Nodes syncing from genesis with
-assumevalid=0still perform full BIP30 validation, which may be a few seconds slower.Checks beyond 1,983,701 remain enforced regardless of
fScriptChecks.Performance
Performance change is best demonstrated by an
AssumeUTXOloading - since this change was mostly motivated by UTXO set size and memory reduction.AssumeUTXOloads with default dbcache show ~11% faster bootstrapping.11%, 8% and 1% faster assumeutxo load | 880000 blocks | dbcache 450/4500/45000 | i7-hdd | x86_64 | Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz | 8 cores | 62Gi RAM | ext4 | HDD
For reference, here is how the change affects

reindex-chainstateper 100k block chunk:(note: image will be moved to a comment later)
Persisted Size
UTXO set size depends on LevelDB compaction scheduling. To stability stabilize measurements, we have instrumented the code to compact after every block connect to see the exact effect of the bloom filters on number of LevelDB files and their total sizes. This is for on-disk size measurement only, not for performance.
compact after each block connection for stable size
Running a before/after
reindex-chainstateand plotting the on-disk size of the chainstate index for every block shows that the PR reduces the UTXO index by roughly 222MB (2%).instrumented benchmark patch
Full validation
To help with reproducibility, the first commit introduces a slight regression to demonstrate the need for the second commit.
With BIP30 checks still active and without LevelDB bloom filters, the first 230k blocks validate ~7% slower.
7% slower reindex-chainstate | 230000 blocks | dbcache 450 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD
With BIP30 buried behind assumevalid and without LevelDB bloom filters, the first 230k blocks validate ~33% faster.
33% faster reindex-chainstate | 230000 blocks | dbcache 450 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD