Skip to content

blocksync: stopping a node while block sync (catch-up) is running often causes a panic #1878

@jchappelow

Description

@jchappelow

Bug Report

Setup

CometBFT version: v0.38.x branch at 721ac3c, or v0.38.2 tag.

Have you tried the latest version: yes

ABCI app (name for built-in, URL for self-written if it's publicly available): Unavailable to link our app yet, sorry.

Environment:

  • OS (e.g. from /etc/os-release): Arch linux or Ubuntu or Darwin
  • Install tools: go build with Go 1.21.5

What happened?

With a node still in block sync (catch-up) while rapidly syncing blocks, if you stop the Node cleanly, the blocksync.(*Reactor).poolRoutine goroutine continues after blocksync.(*Reactor).OnStop returns, and even after node.(*Node).OnStop returns. This typically results in one of two panics as it continues to operate using databases that have been closed.

This does not happen after block sync has switched to the consensus reactor.

What did you expect to happen?

Wait for poolRoutine to return, and shutdown cleanly with no panics.

How to reproduce it

Get one node running and synced on a chain with at least few hundred blocks to make it easier to catch. It can be the only node using the only validator on this test network.

Start a second node as a non-validator, connect to validator. It begins blocksync (do not use snapshot sync). After a few seconds, the block rate picks up steadily. When it is processing blocks quickly, interrupt/cancel to trigger (*Node).Stop -> OnStop where it begins stopping services and reactors.

Usually observe a panic with blocksync.(*Reactor).poolRoutine in the call stack. A couple attempts may be needed, particularly if blocks are still going relatively slowly. There seems to be a short warm-up period before it becomes steady and fast.

Logs

panic: Failed to process committed block (327:D2EB81F27986A5CCD2E9C1C9EEED4D11B75BD99B2318E0662D8AE42BA9553D61): failed to create new app hash: DB Closed

goroutine 224 [running]:
github.com/cometbft/cometbft/blocksync.(*Reactor).poolRoutine(0xc00b6cc1e0, 0x0)
	/home/jon/github/cometbft/cometbft/blocksync/reactor.go:511 +0x18c8
created by github.com/cometbft/cometbft/blocksync.(*Reactor).OnStart in goroutine 220
	/home/jon/github/cometbft/cometbft/blocksync/reactor.go:124 +0x6e

(DB Closed)

OR

panic: leveldb: closed

goroutine 269 [running]:
github.com/cometbft/cometbft/state.dbStore.save({{_, _}, {_}}, {{{0xb, 0x0}, {0xc00034a2b0, 0x6}}, {0xc0005ae180, 0x13}, 0x1, ...}, ...)
	/home/jon/github/cometbft/cometbft/state/store.go:220 +0x3f6
github.com/cometbft/cometbft/state.dbStore.Save(...)
	/home/jon/github/cometbft/cometbft/state/store.go:186
github.com/cometbft/cometbft/state.(*BlockExecutor).ApplyBlock(_, {{{0xb, 0x0}, {0xc00034a2b0, 0x6}}, {0xc0005ae180, 0x13}, 0x1, 0x19a, {{0xc004873b60, ...}, ...}, ...}, ...)
	/home/jon/github/cometbft/cometbft/state/execution.go:291 +0x11d6
github.com/cometbft/cometbft/blocksync.(*Reactor).poolRoutine(0xc0170da000, 0x0)
	/home/jon/github/cometbft/cometbft/blocksync/reactor.go:508 +0x1457
created by github.com/cometbft/cometbft/blocksync.(*Reactor).OnStart in goroutine 263
	/home/jon/github/cometbft/cometbft/blocksync/reactor.go:124 +0x6e

Anything else we need to know

The resolution is almost trivial, but I wanted to put up an issue before a PR as per the Contributing Guidelines.

With the following fix, shutdown waits and avoids any panic from poolRoutine:

I'll put up a PR for the above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions