Skip to content

Conversation

@LarryRuane
Copy link
Contributor

@LarryRuane LarryRuane commented Sep 29, 2019

Background

During the first part of reindexing, LoadExternalBlockFile() sequentially reads raw blocks from the blocks/blk00nnn.dat files (rather than receiving them from peers, as with initial block download) and eventually adds all of them to the block index. When an individual block is initially read, it can't be immediately added unless all its ancestors have been added, which is rare (only about 8% of the time), because the blocks are not sorted by height. When the block can't be immediately added to the block index, its disk location is saved in a map so it can be added later. When its parent is later added to the block index, LoadExternalBlockFile() reads and deserializes the block from disk a second time and adds it to the block index. Most blocks (92%) get deserialized twice.

This PR

During the initial read, it's rarely useful to deserialize the entire block; only the header is needed to determine if the block can be added to the block index immediately. This change to LoadExternalBlockFile() initially deserializes only a block's header, then deserializes the entire block only if it can be added immediately. This reduces reindex time on mainnet by 7 hours on a Raspberry Pi, which translates to around a 25% reduction in the first part of reindexing (adding blocks to the index), and about a 6% reduction in overall reindex time.

Summary: The performance gain is the result of deserializing each block only once, except its header which is deserialized twice, but the header is only 80 bytes.

@LarryRuane LarryRuane changed the title improve runtime performance of --reindex [WIP] improve runtime performance of --reindex Sep 29, 2019
@LarryRuane
Copy link
Contributor Author

LarryRuane commented Sep 29, 2019

Note: this comment is outdated; the scope of this PR has been reduced.

Running bitcoind --reindex causes the client to reread the .bitcoin/blocks/blknnnnn.dat files and re-derive all the index and chainstate information. It's similar to IBD but the client uses these local files rather than download blocks from the network. As LoadExternalBlockFile() reads each block from disk, it checks to see if its parent has already been seen (and processed). If so, then this block can be immediately processed (by giving it to AcceptBlock()). If this block's parent hasn't yet been seen, it cannot be processed until after its parent is seen. Its prev-hash (parent's hash) is put into a map of "pending" blocks (called mapBlocksUnknownParent). Later, when its parent is encountered and processed, this map is consulted to see it has a child that's already been seen that can now be processed. If so, LoadExternalBlockFile() now has to go back and re-read the child block from disk (ReadBlockFromDisk()), deserialize it again, and then process (pass it to AcceptBlock()).

Performance isn't very good because about 90% of blocks are read before their parents are read. (I think this is a result of the headers-first IBD performance enhancement of a few years ago.) This can be seen by running bitcoind --reindex --debug=reindex, causing many messages like:

2019-09-29T00:29:00Z LoadExternalBlockFile: Out of order block 00000000a2268fb2d4ddc0408fae961a96668db58de415d3c9e0a7694eeb1657, parent 000000005fb53da5c3cda78aef28cbc44fde3e435f1dc2ac04df53387049171f not known

then, a short time later:

2019-09-29T00:29:00Z LoadExternalBlockFile: Processing out of order child 00000000a2268fb2d4ddc0408fae961a96668db58de415d3c9e0a7694eeb1657 of 000000005fb53da5c3cda78aef28cbc44fde3e435f1dc2ac04df53387049171f

So 90% of blocks end up being read from disk twice (actually, not twice, since the operating system's buffer cache likely still has the data, but there is still a memory-to-memory copy), and deserialized twice.

This PR mitigates this problem by taking advantage of the fact that recent blocks are still in the buffer (blkdat) that is used to read from the blk files. Since rewinding to an earlier offset in the buffer has been fixed by the recently merged #16577, we can see if the child block is still in memory (the probability is increased by making the buffer somewhat larger), and if so, avoid having to read the child block data from disk. It turns out that the child block is still in the buffer 60% of the time. This depends on some randomness that's introduced during IBD, so it may be different for you. (My IBD was done using all the default settings, nothing special.)

The second part of this change initially deserializes only the block header, rather than the entire block. since about 90% of the time we're going to have to process this block later; only the hashPrevBlock field of the header is needed to determine if we can process this block immediately (if this block's parent has been seen). This does mean that 10% of the time, we deserialize the header, and then immediately also the full block (where it would have been more efficient to simply deserialize the full block), but overall this is a big performance win.

On my system, this PR reduces the time to --reindex on mainnet by 48 minutes. I encourage reviewers to attempt to replicate my findings. Reindexing has two parts, reading the blocks/blknnnnn.dat files (which this PR improves), and then connecting the blocks to the blockchain (the UpdateTip messages), which this PR doesn't affect. For me, this PR reduces the first part from 123 to 75 minutes. After you start seeing UpdateTip messages, you can stop the client, and you will have the time measurement you need from debug.log.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LogPrint probably should be combined with the (existing) one a few lines below before merging, but I wanted reviewers to be aware of when the new code is being run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how to write a unit test for this PR (there are no existing unit tests for any of this area of the code), but test/functional/wallet_basic.py does run through the new code -- you can add an assert(false) here and that test fails, so that's at least something.

@LarryRuane LarryRuane changed the title [WIP] improve runtime performance of --reindex Improve runtime performance of --reindex Sep 29, 2019
@LarryRuane LarryRuane marked this pull request as ready for review September 29, 2019 01:43
Copy link
Contributor

@promag promag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK.

Do you think it's possible to split in 2 commits

  • first commit only refactors CBufferedFile - no behavior change
  • second improves LoadExternalBlockFile.

@LarryRuane
Copy link
Contributor Author

Thanks for the suggestion, @promag, done (force-pushed).

@LarryRuane
Copy link
Contributor Author

I forgot to update the CBufferedFile unit tests, just did that and another force-push.

@LarryRuane LarryRuane changed the title Improve runtime performance of --reindex [WIP] Improve runtime performance of --reindex Oct 1, 2019
@LarryRuane
Copy link
Contributor Author

I changed this PR to WIP because I thought of a further improvement, which I'll try to push by the end of the week. It will build on the existing two commits, so reviewing those wouldn't be a waste of time. Thanks in advance for any reviews and suggestions.

@DrahtBot
Copy link
Contributor

DrahtBot commented Oct 3, 2019

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

No conflicts as of last run.

Copy link
Contributor Author

@LarryRuane LarryRuane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main change is that the CBufferedFile memory buffer object can now span multiple files. It's very common that a block is read from disk before its parent, and that parent is found in a later blocks/blk00nnn.dat file. In that case, it's usually possible to find the earlier (child) block in the memory buffer, obviating the need to read it again from disk.

src/init.cpp Outdated
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a much larger buffer (640 mb), but it only exists during the first phase of reindexing (when you see Reindexing block... in the debug log); after that, the bitcoind memory usage increases for other reasons anyway. During the first phase, memory (VSZ in the ps output) is about 2.4 gb on my system; afterwards (when it's logging UpdateTip: new best...), it's about 3.5 gb. The benefit of a large buffer is that a larger fraction of the out-of-order blocks (that have to be processed later, when their parents are seen) can be found in memory, rather than having to read them again from disk. The "hit ratio" with this buffer size is just under 99%. The difference between 160 (but overall buffer size in max-blocks)) and 140 (the rewind limit) determines the size of disk reads (the larger, the better, more efficient).

src/streams.h Outdated
Comment on lines 819 to 708
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is unused, and really doesn't make sense. The buffer contents should match file (or, now, files) content, and this messes that up. If it's needed in the future, it can be re-thought.

Comment on lines 4662 to 4762
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These pim (out-of-order block found in-memory) and pom (out-of-order block not in memory) let reviewers see the effectiveness of the read caching during testing; they can be removed before merging. Here's a typical instance:

2019-10-09T15:28:17Z Loaded 839 blocks from external file in 11729ms pim 726 pom 95

This means 821 (726+95) blocks were processed out-of-order, so 18 (839-821) blocks were processed in-order (their parents had already been seen). This is typical, and shows that almost all blocks are processed out-of-order. With this PR, 726 of these out-of-order blocks were recent enough that they were still in the buffer, and 95 had already been pushed out and had to be read again from disk. Overall, 99% of the out-of-order blocks are still in the buffer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the pim/pom statistics is useful, but maybe too special for the main log. You could use a separate BCLog::REINDEX message for that information.

@LarryRuane LarryRuane changed the title [WIP] Improve runtime performance of --reindex Improve runtime performance of --reindex Oct 9, 2019
@LarryRuane
Copy link
Contributor Author

Testnet performance isn't important, but as a sanity check, I verified that the new code runs faster there too on my laptop. The time between Reindexing block file blk00001.dat... and Reindexing finished reduced from 10m21s to 7m13s.

Copy link
Contributor

@mzumsande mzumsande left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very interesting, will review the code more closely soon.
I am concept ACK for the first iteration, but would like to understand better if the additional benefits of the second iteration (multiple-files) outweigh the costs (memory):

  • Do you have data from mainnet for the multiple-file buffer implementation? I'm interested if there are diminishing returns or another huge speedup compared to the first revision of this PR.
  • While other objects that use a lot of memory can be reduced (see reduce-memory.md) in order to run a full node in low-memory environments, the large size of  CBufferedFile is hardcoded, which does not seem ideal. The minimum requirements mentioned in (https://bitcoin.org/en/bitcoin-core/features/requirements) are also considerably lower than the 2.4 gb that are used.

@LarryRuane
Copy link
Contributor Author

@mzumsande, those are very good thoughts, thank you. I will study this further (and invite others to also). One nice thing about this design is that any amount of memory will work, in that it will be an improvement -- it is one of those classic space-time tradeoffs. The idea of making the buffer not have a hard-coded size is intriguing and promising. I think we would not go back to the pre-second iteration (multiple-files), because even reducing the buffer memory to the original amount would be improved by the multi-file version. (That is to say, there is no downside to the latest version here.) That's because all the multi-file commit means is that we do not discard cached data when switching files.

One thing I hope I made clear enough in an earlier comment is that I tried to keep the memory usage during reindexing be strictly less than what is needed during normal operation, so that there would be no overall increase in memory demand. We certainly would not want this design to create a spike (high water mark) of memory usage. But I was very likely mistaken, since I was not aware that memory could be "dialed down" for small environments.

@LarryRuane
Copy link
Contributor Author

LarryRuane commented Oct 15, 2019

I force-pushed a reorganization of the commits to make reviewing easier. I split the largest commit, speed up reindex by reading out-of-order blocks from memory, into two separate commits:

  1. Commit message: initially deserialize only CBlockHeader, not CBlock. Deserialize only a block header (only 32 bytes) initially, instead of a full block. The block header is all that's needed to tell us if this block's parent has been seen yet; if not, the block will have to be deserialized later, after its parent is found. Nearly all blocks (89% on my system) are seen before their parents, so this is a big improvement.
  2. Commit message: speed up reindex by reading out-of-order blocks from memory. If we've just processed a block that's the parent of an earlier-seen block, we must now process that earlier block. Instead of always reading that block from disk, see if it happens to still be in the circular disk read memory buffer ('CBufferedFile blkdat`). This saves a disk read.

We could merge the first commit without the second. The first commit is pretty simple; the second commit is a little more involved.

These two commits have a buffer size unchanged from master, 8mb (2*MAX_BLOCK_SERIALIZED_SIZE), so there is no change in memory usage. But with this buffer size, only 29% of the out-of-order blocks are found in the buffer. If we increase the buffer by not too much (as I did in an earlier commit that's now been erased by force-push, 640mb), to 32mb, the "hit rate" increases to 50%. This is probably a good space-time balance.

Here are the block-load times on my system (with all default settings), which is just the first part of reindexing (each row builds on the previous row):

version real time (minutes)
master 130
deserialize only header 90
look for blocks in buffer, 8mb 84
look for blocks in buffer, 32mb 78
look for blocks in buffer, 640mb 72

(Take with a grain of salt, because the times varied even on identical runs, and a lot probably depends on the specifics of my system.) Keep in mind that this buffer exists only during the block-load part of reindex.

Copy link
Contributor

@mzumsande mzumsande left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK.
Thanks for the statistics, I agree that 32MB is a reasonable size. 

src/streams.h Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to remove the FILE from the constructor? Couldn't you just initialize it with nullptr in the multi-file case and call SetFile() later, so that there need to be no syntax changes for the single-file case? It is also easy to create a CBufferedFile object and forget about SetFile(), leading to a memory access violation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mzumsande, good idea, I restored that argument to the constructor.

Comment on lines 4662 to 4762
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the pim/pom statistics is useful, but maybe too special for the main log. You could use a separate BCLog::REINDEX message for that information.

@maflcko
Copy link
Member

maflcko commented Oct 18, 2019

Under what circumstances is a -reindex needed these days? Only when disabling -prune? Though, then you'll download from the network anyway.

@leishman
Copy link
Contributor

@MarcoFalke Building new indices is required if upgrading from pre-0.17 to post-0.17 (https://bitcoin.org/en/release/v0.17.0) with a txindex.

If your node has a txindex, the txindex db will be migrated the first time you run 0.17.0 or newer, which may take up to a few hours. Your node will not be functional until this migration completes.

I haven't looked into whether this PR speeds up that process, but it is something to consider.

@LarryRuane
Copy link
Contributor Author

Force-pushed to address review comments (thank you, @stickies-v), ready for review.

Copy link
Contributor

@stickies-v stickies-v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-ACK 451b1f3

I think this latest refactoring strikes a nice balance between optimization, maintenance and readability. Thank you for incorporating my comments, and my apologies to previous reviewers for making you drift away a bit further from the initial proposal. I think this is great! Hopefully we can get some more (re) ACKs on this soon.

@LarryRuane
Copy link
Contributor Author

Rebased, I'll look into adding a benchmark test.

Copy link
Contributor

@aureleoules aureleoules left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review ACK 6c8bc9a19cc9f35a93d264b69f63d4c93e2c7da2
LGTM and saw a 25%-30% reduction in the first part of reindexing on testnet.

src/streams.h Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit e1df0eb806d919cecceebcb2c72c3f54b1a529f5

Suggested change
private:

there is already a private identifier above

Copy link
Contributor

@theStack theStack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code-review ACK 6c8bc9a19cc9f35a93d264b69f63d4c93e2c7da2

In addition to the performance improvement, this PR also increases the readibility of the CBufferedFile::read logic by using more expressive names (special kudos for using std::min for determining the advance value, nice idea!).

Did three -reindex runs for both the master and PR branch (with a patch to shut down after emitting the "Reindexing finished" debug message) each on testnet at block height 2377503 with quite significant speed-up results:

master:

real    4m58.987s, user    4m0.337s, sys     0m38.207s
real    4m54.370s, user    4m2.202s, sys     0m36.742s
real    4m54.738s, user    4m2.151s, sys     0m37.362s

PR:

real    3m33.332s, user    2m42.151s, sys     0m36.165s
real    3m52.371s, user    2m42.028s, sys     0m37.558s
real    3m36.388s, user    2m41.354s, sys     0m36.710s

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could replace these magic bytes

Suggested change
return b.find(b'\xfa\xbf\xb5\xda', start)+4
return b.find(MAGIC_BYTES["regtest"], start)+4

(the MAGIC_BYTES dictionary is defined in test_framework.p2p)

@LarryRuane
Copy link
Contributor Author

Force-pushed to add benchmark (@achow101 please remove Needs Benchmark label), the PR shows about 16 to 1 improvement on my system.

Also addressed review comments (thanks).
@theStack, @aureleoules, @stickies-v, @vasild, @hebasto, please re-review, thank you!

SkipTo() reads data from the file into the CBufferedFile object
(memory), but, unlike this object's read() method, SkipTo() doesn't
transfer data into a caller's memory buffer. This is useful because
after skipping forward in the stream in this way, the user can, if
needed, rewind the stream (SetPos()) and access the object's memory
buffer including ranges that were skipped over (without needing to
read from the disk file).
When a block is initially read from a blk*.dat file during reindexing,
it can be added to the block index only if all of its ancestor blocks
have been added, which is rare. If the block's ancestors have not been
added, the block must be re-read from disk later when it can be added.

This commit: During the initial block read, deserialize only its header,
rather than the entire block, since this is sufficient to determine
if its parent (and thus all its ancestors) has been added. This is a
performance improvement.
@LarryRuane
Copy link
Contributor Author

Another force-push (actually 2 of them) to fix linter errors.

Copy link
Contributor

@aureleoules aureleoules left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK db92989 - minor changes and new benchmark since last review

Force-pushed to add benchmark, the PR shows about 16 to 1 improvement on my system.

The results of this benchmark are even more impressive on my machine (SSD):

Master

ns/op op/s err% ins/op cyc/op IPC bra/op miss% total benchmark
1,088,323,673.00 0.92 0.9% 7,649,033,881.00 2,438,838,336.00 3.136 412,805,077.00 0.7% 12.18 LoadExternalBlockFile
ns/op op/s err% ins/op cyc/op IPC bra/op miss% total benchmark
1,078,400,062.00 0.93 0.2% 7,649,033,832.00 2,417,239,776.00 3.164 412,805,028.00 0.6% 11.88 LoadExternalBlockFile

PR (~37x faster)

ns/op op/s err% ins/op cyc/op IPC bra/op miss% total benchmark
29,175,159.00 34.28 0.6% 1,369,544.00 1,728,864.00 0.792 88,238.00 1.2% 0.34 LoadExternalBlockFile
ns/op op/s err% ins/op cyc/op IPC bra/op miss% total benchmark
27,193,734.00 36.77 1.2% 1,369,543.00 1,647,264.00 0.831 88,237.00 1.5% 0.32 LoadExternalBlockFile

Copy link
Contributor

@theStack theStack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-ACK db92989

The benchmark yields a ~13.9x speed-up on my machine.

master branch (commit 48a6890):

ns/op op/s err% total benchmark
309,424,533.00 3.23 0.6% 3.41 LoadExternalBlockFile

PR branch (commit db92989):

ns/op op/s err% total benchmark
22,226,060.00 44.99 3.6% 0.26 LoadExternalBlockFile

Copy link
Contributor

@stickies-v stickies-v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-ACK db92989

Did not go super deep on the benchmark test code but looks alright. Other than that, no meaningful changes since my previous ACK.

I saw a bench performance increase of ~24x, very impressive.

2 nits, but would prefer to try and get merged as is - only for when you really need to retouch.

benchmark results

WIthout improvement (48a6890):

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|      246,969,041.00 |                4.05 |    2.1% |      2.71 | `LoadExternalBlockFile`

With improvement (db92989):

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|       10,413,666.00 |               96.03 |    0.6% |      0.12 | `LoadExternalBlockFile`

* This benchmark measures the performance of deserializing the block (or just
* its header, beginning with PR 16981).
*/
static void LoadExternalBlockFile(benchmark::Bench& bench)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for grepping purposes, perhaps useful to name this fn slightly different?

Suggested change
static void LoadExternalBlockFile(benchmark::Bench& bench)
static void BenchLoadExternalBlockFile(benchmark::Bench& bench)

// "wb+" is "binary, O_RDWR | O_CREAT | O_TRUNC".
FILE* file{fsbridge::fopen(blkfile, "wb+")};
// Make the test block file about 128 MB in length.
for (size_t i = 0; i < node::MAX_BLOCKFILE_SIZE / ss.size(); ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a bit more clear without i imo.

Suggested change
for (size_t i = 0; i < node::MAX_BLOCKFILE_SIZE / ss.size(); ++i) {
for (size_t written = 0; written < node::MAX_BLOCKFILE_SIZE; written += ss.size()) {

@andrewtoth
Copy link
Contributor

andrewtoth commented Oct 28, 2022

ACK db92989

Benchmark runs roughly 42x speedup over master. Very nice!

Would it make sense to have a fuzz target for CBufferedFile? I see that exists already.

Benchmark results

This Branch (db92989)

ns/op op/s err% ins/op cyc/op IPC bra/op miss% total benchmark
15,150,607.00 66.00 2.0% 1,366,806.00 948,612.00 1.441 87,657.00 0.6% 0.20 LoadExternalBlockFile

Master (1bad29f) with the benchmark commit (48a6890) cherry-picked

ns/op op/s err% ins/op cyc/op IPC bra/op miss% total benchmark
629,905,824.00 1.59 1.8% 7,542,006,593.00 1,353,516,328.00 5.572 392,803,992.00 0.4% 7.07 LoadExternalBlockFile

@achow101
Copy link
Member

ACK db92989

@achow101 achow101 merged commit 5602cc7 into bitcoin:master Nov 16, 2022
@LarryRuane LarryRuane deleted the reindex-speedup branch November 16, 2022 02:37
sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Nov 16, 2022
@bitcoin bitcoin locked and limited conversation to collaborators Nov 16, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.