-
Notifications
You must be signed in to change notification settings - Fork 38.7k
Improve runtime performance of --reindex #16981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Note: this comment is outdated; the scope of this PR has been reduced. Running Performance isn't very good because about 90% of blocks are read before their parents are read. (I think this is a result of the headers-first IBD performance enhancement of a few years ago.) This can be seen by running then, a short time later: So 90% of blocks end up being read from disk twice (actually, not twice, since the operating system's buffer cache likely still has the data, but there is still a memory-to-memory copy), and deserialized twice. This PR mitigates this problem by taking advantage of the fact that recent blocks are still in the buffer ( The second part of this change initially deserializes only the block header, rather than the entire block. since about 90% of the time we're going to have to process this block later; only the On my system, this PR reduces the time to |
src/validation.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LogPrint probably should be combined with the (existing) one a few lines below before merging, but I wanted reviewers to be aware of when the new code is being run.
src/validation.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how to write a unit test for this PR (there are no existing unit tests for any of this area of the code), but test/functional/wallet_basic.py does run through the new code -- you can add an assert(false) here and that test fails, so that's at least something.
promag
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK.
Do you think it's possible to split in 2 commits
- first commit only refactors
CBufferedFile- no behavior change - second improves
LoadExternalBlockFile.
48c0cc7 to
6d8644d
Compare
|
Thanks for the suggestion, @promag, done (force-pushed). |
6d8644d to
c598893
Compare
|
I forgot to update the |
|
I changed this PR to WIP because I thought of a further improvement, which I'll try to push by the end of the week. It will build on the existing two commits, so reviewing those wouldn't be a waste of time. Thanks in advance for any reviews and suggestions. |
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsNo conflicts as of last run. |
c598893 to
7af0587
Compare
LarryRuane
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main change is that the CBufferedFile memory buffer object can now span multiple files. It's very common that a block is read from disk before its parent, and that parent is found in a later blocks/blk00nnn.dat file. In that case, it's usually possible to find the earlier (child) block in the memory buffer, obviating the need to read it again from disk.
src/init.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a much larger buffer (640 mb), but it only exists during the first phase of reindexing (when you see Reindexing block... in the debug log); after that, the bitcoind memory usage increases for other reasons anyway. During the first phase, memory (VSZ in the ps output) is about 2.4 gb on my system; afterwards (when it's logging UpdateTip: new best...), it's about 3.5 gb. The benefit of a large buffer is that a larger fraction of the out-of-order blocks (that have to be processed later, when their parents are seen) can be found in memory, rather than having to read them again from disk. The "hit ratio" with this buffer size is just under 99%. The difference between 160 (but overall buffer size in max-blocks)) and 140 (the rewind limit) determines the size of disk reads (the larger, the better, more efficient).
src/streams.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is unused, and really doesn't make sense. The buffer contents should match file (or, now, files) content, and this messes that up. If it's needed in the future, it can be re-thought.
src/validation.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These pim (out-of-order block found in-memory) and pom (out-of-order block not in memory) let reviewers see the effectiveness of the read caching during testing; they can be removed before merging. Here's a typical instance:
2019-10-09T15:28:17Z Loaded 839 blocks from external file in 11729ms pim 726 pom 95
This means 821 (726+95) blocks were processed out-of-order, so 18 (839-821) blocks were processed in-order (their parents had already been seen). This is typical, and shows that almost all blocks are processed out-of-order. With this PR, 726 of these out-of-order blocks were recent enough that they were still in the buffer, and 95 had already been pushed out and had to be read again from disk. Overall, 99% of the out-of-order blocks are still in the buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the pim/pom statistics is useful, but maybe too special for the main log. You could use a separate BCLog::REINDEX message for that information.
dc256d8 to
14cb27d
Compare
|
Testnet performance isn't important, but as a sanity check, I verified that the new code runs faster there too on my laptop. The time between |
mzumsande
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very interesting, will review the code more closely soon.
I am concept ACK for the first iteration, but would like to understand better if the additional benefits of the second iteration (multiple-files) outweigh the costs (memory):
- Do you have data from mainnet for the multiple-file buffer implementation? I'm interested if there are diminishing returns or another huge speedup compared to the first revision of this PR.
- While other objects that use a lot of memory can be reduced (see reduce-memory.md) in order to run a full node in low-memory environments, the large size of
CBufferedFileis hardcoded, which does not seem ideal. The minimum requirements mentioned in (https://bitcoin.org/en/bitcoin-core/features/requirements) are also considerably lower than the 2.4 gb that are used.
|
@mzumsande, those are very good thoughts, thank you. I will study this further (and invite others to also). One nice thing about this design is that any amount of memory will work, in that it will be an improvement -- it is one of those classic space-time tradeoffs. The idea of making the buffer not have a hard-coded size is intriguing and promising. I think we would not go back to the pre-second iteration (multiple-files), because even reducing the buffer memory to the original amount would be improved by the multi-file version. (That is to say, there is no downside to the latest version here.) That's because all the multi-file commit means is that we do not discard cached data when switching files. One thing I hope I made clear enough in an earlier comment is that I tried to keep the memory usage during reindexing be strictly less than what is needed during normal operation, so that there would be no overall increase in memory demand. We certainly would not want this design to create a spike (high water mark) of memory usage. But I was very likely mistaken, since I was not aware that memory could be "dialed down" for small environments. |
14cb27d to
ed9b569
Compare
|
I force-pushed a reorganization of the commits to make reviewing easier. I split the largest commit, speed up reindex by reading out-of-order blocks from memory, into two separate commits:
We could merge the first commit without the second. The first commit is pretty simple; the second commit is a little more involved. These two commits have a buffer size unchanged from master, 8mb ( Here are the block-load times on my system (with all default settings), which is just the first part of reindexing (each row builds on the previous row):
(Take with a grain of salt, because the times varied even on identical runs, and a lot probably depends on the specifics of my system.) Keep in mind that this buffer exists only during the block-load part of reindex. |
mzumsande
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK.
Thanks for the statistics, I agree that 32MB is a reasonable size.
src/streams.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to remove the FILE from the constructor? Couldn't you just initialize it with nullptr in the multi-file case and call SetFile() later, so that there need to be no syntax changes for the single-file case? It is also easy to create a CBufferedFile object and forget about SetFile(), leading to a memory access violation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mzumsande, good idea, I restored that argument to the constructor.
src/validation.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the pim/pom statistics is useful, but maybe too special for the main log. You could use a separate BCLog::REINDEX message for that information.
|
Under what circumstances is a |
|
@MarcoFalke Building new indices is required if upgrading from pre-0.17 to post-0.17 (https://bitcoin.org/en/release/v0.17.0) with a txindex.
I haven't looked into whether this PR speeds up that process, but it is something to consider. |
bef3953 to
451b1f3
Compare
|
Force-pushed to address review comments (thank you, @stickies-v), ready for review. |
stickies-v
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re-ACK 451b1f3
I think this latest refactoring strikes a nice balance between optimization, maintenance and readability. Thank you for incorporating my comments, and my apologies to previous reviewers for making you drift away a bit further from the initial proposal. I think this is great! Hopefully we can get some more (re) ACKs on this soon.
|
Rebased, I'll look into adding a benchmark test. |
451b1f3 to
6c8bc9a
Compare
aureleoules
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review ACK 6c8bc9a19cc9f35a93d264b69f63d4c93e2c7da2
LGTM and saw a 25%-30% reduction in the first part of reindexing on testnet.
src/streams.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit e1df0eb806d919cecceebcb2c72c3f54b1a529f5
| private: |
there is already a private identifier above
theStack
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code-review ACK 6c8bc9a19cc9f35a93d264b69f63d4c93e2c7da2
In addition to the performance improvement, this PR also increases the readibility of the CBufferedFile::read logic by using more expressive names (special kudos for using std::min for determining the advance value, nice idea!).
Did three -reindex runs for both the master and PR branch (with a patch to shut down after emitting the "Reindexing finished" debug message) each on testnet at block height 2377503 with quite significant speed-up results:
master:
real 4m58.987s, user 4m0.337s, sys 0m38.207s
real 4m54.370s, user 4m2.202s, sys 0m36.742s
real 4m54.738s, user 4m2.151s, sys 0m37.362s
PR:
real 3m33.332s, user 2m42.151s, sys 0m36.165s
real 3m52.371s, user 2m42.028s, sys 0m37.558s
real 3m36.388s, user 2m41.354s, sys 0m36.710s
test/functional/feature_reindex.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could replace these magic bytes
| return b.find(b'\xfa\xbf\xb5\xda', start)+4 | |
| return b.find(MAGIC_BYTES["regtest"], start)+4 |
(the MAGIC_BYTES dictionary is defined in test_framework.p2p)
6c8bc9a to
a2ed1e2
Compare
|
Force-pushed to add benchmark (@achow101 please remove Needs Benchmark label), the PR shows about 16 to 1 improvement on my system. Also addressed review comments (thanks). |
a2ed1e2 to
cd3ea77
Compare
SkipTo() reads data from the file into the CBufferedFile object (memory), but, unlike this object's read() method, SkipTo() doesn't transfer data into a caller's memory buffer. This is useful because after skipping forward in the stream in this way, the user can, if needed, rewind the stream (SetPos()) and access the object's memory buffer including ranges that were skipped over (without needing to read from the disk file).
When a block is initially read from a blk*.dat file during reindexing, it can be added to the block index only if all of its ancestor blocks have been added, which is rare. If the block's ancestors have not been added, the block must be re-read from disk later when it can be added. This commit: During the initial block read, deserialize only its header, rather than the entire block, since this is sufficient to determine if its parent (and thus all its ancestors) has been added. This is a performance improvement.
cd3ea77 to
db92989
Compare
|
Another force-push (actually 2 of them) to fix linter errors. |
aureleoules
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK db92989 - minor changes and new benchmark since last review
Force-pushed to add benchmark, the PR shows about 16 to 1 improvement on my system.
The results of this benchmark are even more impressive on my machine (SSD):
Master
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |
|---|---|---|---|---|---|---|---|---|---|
| 1,088,323,673.00 | 0.92 | 0.9% | 7,649,033,881.00 | 2,438,838,336.00 | 3.136 | 412,805,077.00 | 0.7% | 12.18 | LoadExternalBlockFile |
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |
|---|---|---|---|---|---|---|---|---|---|
| 1,078,400,062.00 | 0.93 | 0.2% | 7,649,033,832.00 | 2,417,239,776.00 | 3.164 | 412,805,028.00 | 0.6% | 11.88 | LoadExternalBlockFile |
PR (~37x faster)
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |
|---|---|---|---|---|---|---|---|---|---|
| 29,175,159.00 | 34.28 | 0.6% | 1,369,544.00 | 1,728,864.00 | 0.792 | 88,238.00 | 1.2% | 0.34 | LoadExternalBlockFile |
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |
|---|---|---|---|---|---|---|---|---|---|
| 27,193,734.00 | 36.77 | 1.2% | 1,369,543.00 | 1,647,264.00 | 0.831 | 88,237.00 | 1.5% | 0.32 | LoadExternalBlockFile |
theStack
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stickies-v
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re-ACK db92989
Did not go super deep on the benchmark test code but looks alright. Other than that, no meaningful changes since my previous ACK.
I saw a bench performance increase of ~24x, very impressive.
2 nits, but would prefer to try and get merged as is - only for when you really need to retouch.
benchmark results
WIthout improvement (48a6890):
| ns/op | op/s | err% | total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 246,969,041.00 | 4.05 | 2.1% | 2.71 | `LoadExternalBlockFile`
With improvement (db92989):
| ns/op | op/s | err% | total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 10,413,666.00 | 96.03 | 0.6% | 0.12 | `LoadExternalBlockFile`
| * This benchmark measures the performance of deserializing the block (or just | ||
| * its header, beginning with PR 16981). | ||
| */ | ||
| static void LoadExternalBlockFile(benchmark::Bench& bench) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: for grepping purposes, perhaps useful to name this fn slightly different?
| static void LoadExternalBlockFile(benchmark::Bench& bench) | |
| static void BenchLoadExternalBlockFile(benchmark::Bench& bench) |
| // "wb+" is "binary, O_RDWR | O_CREAT | O_TRUNC". | ||
| FILE* file{fsbridge::fopen(blkfile, "wb+")}; | ||
| // Make the test block file about 128 MB in length. | ||
| for (size_t i = 0; i < node::MAX_BLOCKFILE_SIZE / ss.size(); ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: a bit more clear without i imo.
| for (size_t i = 0; i < node::MAX_BLOCKFILE_SIZE / ss.size(); ++i) { | |
| for (size_t written = 0; written < node::MAX_BLOCKFILE_SIZE; written += ss.size()) { |
|
ACK db92989 Benchmark runs roughly 42x speedup over master. Very nice!
Benchmark resultsThis Branch (db92989)
Master (1bad29f) with the benchmark commit (48a6890) cherry-picked
|
|
ACK db92989 |
Background
During the first part of reindexing,
LoadExternalBlockFile()sequentially reads raw blocks from theblocks/blk00nnn.datfiles (rather than receiving them from peers, as with initial block download) and eventually adds all of them to the block index. When an individual block is initially read, it can't be immediately added unless all its ancestors have been added, which is rare (only about 8% of the time), because the blocks are not sorted by height. When the block can't be immediately added to the block index, its disk location is saved in a map so it can be added later. When its parent is later added to the block index,LoadExternalBlockFile()reads and deserializes the block from disk a second time and adds it to the block index. Most blocks (92%) get deserialized twice.This PR
During the initial read, it's rarely useful to deserialize the entire block; only the header is needed to determine if the block can be added to the block index immediately. This change to
LoadExternalBlockFile()initially deserializes only a block's header, then deserializes the entire block only if it can be added immediately. This reduces reindex time on mainnet by 7 hours on a Raspberry Pi, which translates to around a 25% reduction in the first part of reindexing (adding blocks to the index), and about a 6% reduction in overall reindex time.Summary: The performance gain is the result of deserializing each block only once, except its header which is deserialized twice, but the header is only 80 bytes.