validation: Make ReplayBlocks interruptible #30155

mzumsande · 2024-05-22T15:12:21Z

This addresses the problem from #11600 that the Rolling Forward process after a crash (ReplayBlocks()) is uninterruptible. ReplayBlocks can take a long time to finish, and this can be especially annoying to GUI users who are taunted to "press q to shutdown" even though pressing "q" does nothing.

Now, when an interrupt is received during ReplayBlocks(), the intermediate progress is saved:
In addition to writing the updated coins to disk, the flush adjusts the lower head block (saved in DB_HEAD_BLOCKS), so that with the next restart, the replay continues from where it was stopped without losing progress.

I tested this manually on signet: A situation where ReplayBlocks() becomes necessary can be created by syncing with -dbcrashratio=1 and stopping the node after ~70k blocks. Then, after the next restart, the replay process can be interrupted. I then used the dumptxoutset output to check that the final utxo-set hash is unaffected by how many times the process is interrupted.

DrahtBot · 2024-05-22T15:12:24Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/30155.

Reviews

See the guideline for information on the review process.

Type	Reviewers
Concept ACK	TheCharlatan, andrewtoth, l0rinc

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#33443 (log: reduce excessive "rolling back/forward" messages during block replay by l0rinc)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

src/validation.cpp

luke-jr · 2024-05-23T17:17:10Z

src/txdb.cpp

        std::vector<uint256> old_heads = GetHeadBlocks();
        if (old_heads.size() == 2) {
            if (old_heads[0] != hashBlock) {
-                LogPrintLevel(BCLog::COINDB, BCLog::Level::Error, "The coins database detected an inconsistent state, likely due to a previous crash or shutdown. You will need to restart bitcoind with the -reindex-chainstate or -reindex configuration option.\n");


Is it a safe assumption that this cannot happen except an interrupted replay?

Currently we abort with an assert in the next line after the log, so the expectation in master is that this shouldn't happen outside of db corruption. I couldn't find any reports on the internet of users that have run into this error / assert, so it probably doesn't happen very often in practice.
But theoretically speaking, if the coins db gets corrupted, I guess anything can happen depending on the specifics of the corruption?

While working on this PR, I initially went with the different approach of adding an explicit in_replay flag to BatchWrite(), and setting that flag only during the interrupt flush. I switched to the current approach of detecting this implicitly inside BatchWrite() because it seemed more elegant and led to less boilerplate code, but it's still an option.

In commit "validation: Make ReplayBlocks interruptible" (26e78f2)

I switched to the current approach of detecting this implicitly inside BatchWrite() because it seemed more elegant and led to less boilerplate code, but it's still an option.

I would definitely be curious to see the other approach. Not just because it might be better at detecting inconsistent states like Luke is suggesting, but also because it seems like it would make the code more clear.

I think it would be pretty hard to look at the code after this PR (without looking at the PR itself), and understand why old_heads[0] != hashBlock implies that this BatchWrite call is finalizing a replay that is currently being interrupted.

I was also confused by the name of the "unfinished_replay" variable below, because I thought "unfinished replay" was just referring to a replay of an previously unfinished flush. I did not realize it was referring specifically to an interrupted replay of an unfinished flush.

Also as a possible followup, for future maintainability of this code I think it could be a good idea to remove the replay logic entirely from BatchWrite(), since ReplayBlocks() is the only BatchWrite() caller that should need it. Other BatchWrite calls should be able to assume and assert that GetHeadBlocks() is empty, and this is just a normal write, not a replay.

I would definitely be curious to see the other approach. Not just because it might be better at detecting inconsistent states like Luke is suggesting, but also because it seems like it would make the code more clear.

See mzumsande@db8232d - because of the inheritance the bool interrupted_replay needed to be added to quite a few spots, which I didn't love.

I think it would be pretty hard to look at the code after this PR (without looking at the PR itself), and understand why old_heads[0] != hashBlock implies that this BatchWrite call is finalizing a replay that is currently being interrupted.

I'll add more documentation (if I don't switch to the other approach).

I was also confused by the name of the "unfinished_replay" variable below,

Maybe interrupted_replay is better? But the issue remains that it's the replay that is being interrupted (or unfinished), and not the flush.

Also as a possible followup, for future maintainability of this code I think it could be a good idea to remove the replay logic entirely from BatchWrite(), since ReplayBlocks() is the only BatchWrite() caller that should need it. Other BatchWrite calls should be able to assume and assert that GetHeadBlocks() is empty, and this is just a normal write, not a replay.

Seems reasonable for a follow-up, though I haven't looked into details yet.

sedited · 2024-10-15T14:54:35Z

Concept ACK

andrewtoth · 2024-10-19T17:52:27Z

Concept ACK

It appears to me that the size of the cache used in ReplayBlocks is never checked if it grows too large. Wouldn't this mean that if the node was sufficiently far behind, the rolling forward would fill up the cache to levels exceeding dbcache? This would result in a cycle of rolling forward and then OOM crashing, never being able to write any chainstate to disk.

mzumsande · 2025-03-17T15:41:58Z

It appears to me that the size of the cache used in ReplayBlocks is never checked if it grows too large. Wouldn't this mean that if the node was sufficiently far behind, the rolling forward would fill up the cache to levels exceeding dbcache? This would result in a cycle of rolling forward and then OOM crashing, never being able to write any chainstate to disk.

Good point - I think that is true, ReplayBlocks() uses its own cache, which doesn't seem to be subject to any limits. However, the fact that we got into this position in the first place means that the user specified a certain dbcache in their original run that was shutdown uncleanly, and I guess that will be respected by design because we don't connect any blocks that weren't also connected in the first run?!

I guess users don't get the chance to change their preference in between the original run and the replay though, which is not ideal. Maybe the method introduced in this PR could also be used to do regular flushes.

l0rinc · 2025-09-20T06:15:57Z

I have tested this with a recent interrupted reindex, it correctly interrupts and flushes and on next start continues from where we left off:

2025-09-20T06:11:25Z Rolling forward 00000000127100b0bcdde9e9f363ae231c73e885e2d527821ef9ce0457930494 (51141)
2025-09-20T06:11:25Z Rolling forward 0000000013484b73eff09514abc3f916b319eb9f43c55f051be9e760ad2210d0 (51142)
^C2025-09-20T06:11:25Z Flushing intermediate state of replay
2025-09-20T06:11:25Z [error] Unable to replay blocks. You will need to rebuild the database using -reindex-chainstate.
Error: Unable to replay blocks. You will need to rebuild the database using -reindex-chainstate.
2025-09-20T06:11:25Z Shutdown in progress...

and on a next run continues from where we left off:

2025-09-20T06:14:27Z Replaying blocks
2025-09-20T06:14:27Z Rolling forward 0000000002a5a4503b56e432dea4647e99037ac20b9499d27a0782d7c2fc417a (51143)
2025-09-20T06:14:27Z Rolling forward 000000001732f85802afe0ff3680ce45f62c21e03b6924054df5993c3def3c85 (51144)

Needs a rebase, but otherwise Concept ACK

Now, when an interrupt is received during ReplayBlocks the intermediate progress is flushed: In addition to updating the coins on disk, this flush adjusts the head blocks, so that with the next restart, the replay continues from where we stopped.

mzumsande · 2025-09-22T15:43:03Z

Needs a rebase, but otherwise Concept ACK

Rebased for fresh CI (rebase seemed clean, didn't change anything)

mzumsande · 2025-10-29T21:03:01Z

Closing as "up for grabs" - #11600 is already closed, and because we now flush the chainstate every hour replays in general shouldn't take as much time anymore, so there is less need to interrupt/continue them.

That said, it could still be a nice-to-have if someone wants to pick it up (note #30155 (comment) in this case, maybe there is a more elegant way I didn't see).

DrahtBot added the Validation label May 22, 2024

mzumsande mentioned this pull request May 22, 2024

"Rolling forward" at startup can take a long time, and is not interruptible (after unclean shutdown) #11600

Closed

maflcko reviewed May 22, 2024

View reviewed changes

src/validation.cpp Outdated Show resolved Hide resolved

mzumsande force-pushed the 202405_replay_blocks branch from 4bd626e to 26e78f2 Compare May 22, 2024 20:32

luke-jr reviewed May 23, 2024

View reviewed changes

This was referenced Jun 25, 2024

RFC: Instanced logs #30338

Closed

kernel, logging: Pass Logger instances to kernel objects #30342

Draft

logging: Use LogFatal instead LogError for fatal errors #30347

Closed

logging: Replace LogError and LogWarning with LogAlert #30364

Closed

achow101 requested a review from sipa October 15, 2024 14:54

DrahtBot added the CI failed label Mar 17, 2025

mzumsande force-pushed the 202405_replay_blocks branch from 26e78f2 to 4d1342f Compare March 17, 2025 15:06

DrahtBot removed the CI failed label Mar 17, 2025

mzumsande mentioned this pull request Sep 20, 2025

validation: make the "rolling forward" loop interruptible #33442

Closed

mzumsande force-pushed the 202405_replay_blocks branch from 4d1342f to d200cd5 Compare September 22, 2025 15:41

DrahtBot mentioned this pull request Sep 26, 2025

log: reduce excessive "rolling back/forward" messages during block replay #33443

Merged

mzumsande closed this Oct 29, 2025

glozow added the Up for grabs label Oct 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

validation: Make ReplayBlocks interruptible #30155

validation: Make ReplayBlocks interruptible #30155

Uh oh!

mzumsande commented May 22, 2024 •

edited

Loading

Uh oh!

DrahtBot commented May 22, 2024 •

edited

Loading

Uh oh!

Uh oh!

luke-jr May 23, 2024

Uh oh!

mzumsande May 23, 2024 •

edited

Loading

Uh oh!

ryanofsky Jun 6, 2024

Uh oh!

mzumsande Jun 10, 2024

Uh oh!

sedited commented Oct 15, 2024

Uh oh!

andrewtoth commented Oct 19, 2024

Uh oh!

mzumsande commented Mar 17, 2025

Uh oh!

l0rinc commented Sep 20, 2025

Uh oh!

mzumsande commented Sep 22, 2025

Uh oh!

mzumsande commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

validation: Make ReplayBlocks interruptible #30155

validation: Make ReplayBlocks interruptible #30155

Uh oh!

Conversation

mzumsande commented May 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DrahtBot commented May 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Coverage & Benchmarks

Reviews

Conflicts

Uh oh!

Uh oh!

luke-jr May 23, 2024

Choose a reason for hiding this comment

Uh oh!

mzumsande May 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryanofsky Jun 6, 2024

Choose a reason for hiding this comment

Uh oh!

mzumsande Jun 10, 2024

Choose a reason for hiding this comment

Uh oh!

sedited commented Oct 15, 2024

Uh oh!

andrewtoth commented Oct 19, 2024

Uh oh!

mzumsande commented Mar 17, 2025

Uh oh!

l0rinc commented Sep 20, 2025

Uh oh!

mzumsande commented Sep 22, 2025

Uh oh!

mzumsande commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

mzumsande commented May 22, 2024 •

edited

Loading

DrahtBot commented May 22, 2024 •

edited

Loading

mzumsande May 23, 2024 •

edited

Loading