-
Notifications
You must be signed in to change notification settings - Fork 780
Description
Bug Report
Setup
CometBFT version: v0.38.x branch at 721ac3c, or v0.38.2 tag.
Have you tried the latest version: yes
ABCI app (name for built-in, URL for self-written if it's publicly available): Unavailable to link our app yet, sorry.
Environment:
- OS (e.g. from /etc/os-release): Arch linux or Ubuntu or Darwin
- Install tools:
go buildwith Go 1.21.5
What happened?
With a node still in block sync (catch-up) while rapidly syncing blocks, if you stop the Node cleanly, the blocksync.(*Reactor).poolRoutine goroutine continues after blocksync.(*Reactor).OnStop returns, and even after node.(*Node).OnStop returns. This typically results in one of two panics as it continues to operate using databases that have been closed.
This does not happen after block sync has switched to the consensus reactor.
What did you expect to happen?
Wait for poolRoutine to return, and shutdown cleanly with no panics.
How to reproduce it
Get one node running and synced on a chain with at least few hundred blocks to make it easier to catch. It can be the only node using the only validator on this test network.
Start a second node as a non-validator, connect to validator. It begins blocksync (do not use snapshot sync). After a few seconds, the block rate picks up steadily. When it is processing blocks quickly, interrupt/cancel to trigger (*Node).Stop -> OnStop where it begins stopping services and reactors.
Usually observe a panic with blocksync.(*Reactor).poolRoutine in the call stack. A couple attempts may be needed, particularly if blocks are still going relatively slowly. There seems to be a short warm-up period before it becomes steady and fast.
Logs
panic: Failed to process committed block (327:D2EB81F27986A5CCD2E9C1C9EEED4D11B75BD99B2318E0662D8AE42BA9553D61): failed to create new app hash: DB Closed
goroutine 224 [running]:
github.com/cometbft/cometbft/blocksync.(*Reactor).poolRoutine(0xc00b6cc1e0, 0x0)
/home/jon/github/cometbft/cometbft/blocksync/reactor.go:511 +0x18c8
created by github.com/cometbft/cometbft/blocksync.(*Reactor).OnStart in goroutine 220
/home/jon/github/cometbft/cometbft/blocksync/reactor.go:124 +0x6e
(DB Closed)
OR
panic: leveldb: closed
goroutine 269 [running]:
github.com/cometbft/cometbft/state.dbStore.save({{_, _}, {_}}, {{{0xb, 0x0}, {0xc00034a2b0, 0x6}}, {0xc0005ae180, 0x13}, 0x1, ...}, ...)
/home/jon/github/cometbft/cometbft/state/store.go:220 +0x3f6
github.com/cometbft/cometbft/state.dbStore.Save(...)
/home/jon/github/cometbft/cometbft/state/store.go:186
github.com/cometbft/cometbft/state.(*BlockExecutor).ApplyBlock(_, {{{0xb, 0x0}, {0xc00034a2b0, 0x6}}, {0xc0005ae180, 0x13}, 0x1, 0x19a, {{0xc004873b60, ...}, ...}, ...}, ...)
/home/jon/github/cometbft/cometbft/state/execution.go:291 +0x11d6
github.com/cometbft/cometbft/blocksync.(*Reactor).poolRoutine(0xc0170da000, 0x0)
/home/jon/github/cometbft/cometbft/blocksync/reactor.go:508 +0x1457
created by github.com/cometbft/cometbft/blocksync.(*Reactor).OnStart in goroutine 263
/home/jon/github/cometbft/cometbft/blocksync/reactor.go:124 +0x6e
Anything else we need to know
The resolution is almost trivial, but I wanted to put up an issue before a PR as per the Contributing Guidelines.
With the following fix, shutdown waits and avoids any panic from poolRoutine:
I'll put up a PR for the above.