fix: ensure parity-db WAL is drained on SIGTERM shutdown by skylar-simoncelli · Pull Request #1140 · midnightntwrk/midnight-node

skylar-simoncelli · 2026-04-01T02:02:20Z

Overview

Fix silent chain-state truncation that occurs after unclean node shutdowns. On restart, affected nodes report a "Highest known block" thousands of blocks behind the actual chain tip, forcing a slow re-sync from that point via the network.

Root cause: The Substrate database backend (Arc<sc_client_db::Backend>) is not explicitly dropped during shutdown. When the tokio runtime's 60-second timeout fires and aborts still-running tasks (AURA, GRANDPA, BEEFY, network sync), their Arc clones are leaked. This prevents parity-db's Drop impl from ever running, which means kill_logs() — the function that joins background I/O threads and drains the WAL pipeline — never executes. On the next startup, parity-db finds a partially-written WAL, discards all remaining log records due to a sequence validation failure, and the BEST_BLOCK metadata pointer reverts to whatever was last committed to the on-disk table files.

The fix mirrors the existing pattern used for the midnight ledger storage (drop_all_default_storage() at command.rs:277): stash an Arc<Backend> outside the tokio runtime closure, and explicitly drop it after run_node_until_exit returns.

Observed failures (guardnet, 2026-03-31)

We cycled all 15 midnight nodes on guardnet (graceful systemctl stop + terminate) and observed 2 out of 15 nodes restart with truncated chain state:

Node	Type	Pre-stop block	Startup block	Delta
krill	bridge (archive)	#272,185	#187,554	-84,631
mullet	validator (pruned)	#272,574	#187,554	-85,020

All other 13 nodes restarted at their pre-stop block height. On retesting krill (2 more cycles) and mullet (1 more cycle), all passed — the bug is not deterministic per-node.

A colleague observed the same symptom on a separate occasion (Highest known block at #187641), and another observed #15 (near-genesis) after a restart.

Key observations

Both failures landed on exactly fix: error when loading chain spec for non-dev cfg_preset #187,554 — same block on different nodes with different EBS volumes
Not node-type-specific: one bridge (archive pruning), one validator (pruned 256)
Not volume-specific: retests on the same nodes/volumes passed
Nodes download from peers, not local replay: after the truncation, the node syncs at ~17 bps via network (⬇ 19.9kiB/s), confirming the blocks above fix: error when loading chain spec for non-dev cfg_preset #187,554 are genuinely inaccessible in the local ParityDB
ParityDB stats confirmed: Column 0 on the failed node showed Total values: 187,561 — matching the truncation point exactly. The table files only contained metadata up to that block; everything above was only in the WAL

Technical analysis

parity-db's async commit pipeline

parity-db (0.4.13) uses a 4-stage async pipeline:

db.commit(tx) → commit_queue (memory) → log_worker → BufWriter (WAL file)
                                                           ↓
                                    flush_worker (fsync when file > 64MB)
                                                           ↓
                                    commit_worker (enact to table files)
                                                           ↓
                                    cleanup_worker (fsync tables, truncate WAL)

db.commit() returns immediately after pushing to the in-memory queue — the data is NOT on disk. The flush_worker only fsyncs the WAL file when it exceeds MIN_LOG_SIZE_BYTES = 64 MB. This means thousands of commits (including BEST_BLOCK metadata updates) can sit in an unfsynced BufWriter.

The WAL replay failure on startup

When parity-db opens, it calls replay_all_logs() which iterates WAL files and calls enact_logs(validation_mode=true). The critical check (db.rs:591):

if reader.record_id() != self.last_enacted.load(Ordering::Relaxed) + 1 {
    self.log.clear_replay_logs();  // silently discards ALL remaining WAL records
    return Ok(false)
}

If any WAL record has a sequence gap or CRC mismatch (e.g., from a partially-written BufWriter that was never fsynced), all remaining WAL records are silently discarded — even valid ones after the gap. clear_replay_logs() moves them to the cleanup queue where they are truncated to zero.

The shutdown race

Substrate's shutdown sequence on SIGTERM:

SIGTERM → tokio signal resolves → TaskManager dropped → 60s timeout → process exits

The TaskManager drop sends an exit signal to all spawned tasks. Tasks get 60 seconds to finish. After the timeout, tokio aborts remaining tasks, leaking any Arc references they hold.

The Substrate Backend (containing the parity-db handle) is held via Arc<Backend> by multiple spawned tasks (AURA, GRANDPA, BEEFY, network sync). If any task does not exit within 60 seconds, its Arc<Backend> clone is leaked, the Arc refcount never reaches zero, and Db::drop() never runs.

The ledger DB does not have this problem because command.rs:277 already calls drop_all_default_storage() explicitly after the runtime shuts down.

Configuration verification

Both sync_wal and sync_data are at their defaults (true) — neither midnight-node nor Substrate overrides them. The issue is not about fsync settings but about the Drop impl never executing.

🗹 TODO before merging

Ready

📌 Submission Checklist

Changes are backward-compatible (or flagged if breaking)
Pull request description explains why the change is needed
Self-reviewed the diff
I have included a change file, or skipped for this reason:
If the changes introduce a new feature, I have bumped the node minor version
Update documentation (if relevant)
Updated AGENTS.md if build commands, architecture, or workflows changed
No new todos introduced

🧪 Testing Evidence

Tested on guardnet environment (15 EC2 nodes, AWS account 337112168969):

Cycled all 15 nodes with graceful systemctl stop + instance termination
13/15 passed (startup block = pre-stop block), 2/15 failed (startup block = fix: error when loading chain spec for non-dev cfg_preset #187,554)
Retested both failed nodes 2-3 more times after they re-synced — all passed
Root cause confirmed via ParityDB stats, WAL file inspection, and Substrate source code analysis

🔱 Fork Strategy

Node Client Update

Explicitly hold and drop the Substrate database backend outside the tokio runtime, ensuring parity-db's Drop impl runs after all async tasks have been stopped. This mirrors the pattern already used for the midnight ledger storage (drop_all_default_storage). Without this fix, the backend's Arc<Backend> may be leaked inside aborted tokio tasks when the 60-second shutdown timeout fires, preventing parity-db from joining its background I/O threads and flushing the write-ahead log. On next startup, the stale WAL causes parity-db to report a "Highest known block" thousands of blocks behind the actual chain tip, forcing a full re-sync from that point.

* fix: ensure parity-db WAL is drained on SIGTERM shutdown Explicitly hold and drop the Substrate database backend outside the tokio runtime, ensuring parity-db's Drop impl runs after all async tasks have been stopped. This mirrors the pattern already used for the midnight ledger storage (drop_all_default_storage). Without this fix, the backend's Arc<Backend> may be leaked inside aborted tokio tasks when the 60-second shutdown timeout fires, preventing parity-db from joining its background I/O threads and flushing the write-ahead log. On next startup, the stale WAL causes parity-db to report a "Highest known block" thousands of blocks behind the actual chain tip, forcing a full re-sync from that point. * chore: add PR link to change file * chore: cargo fmt * fix: clone backend before MmrGadget move Signed-off-by: Mike Clay <mike.clay@shielded.io>

skylar-simoncelli requested a review from a team as a code owner April 1, 2026 02:02

skylar-simoncelli added the skip-changes-check-jira label Apr 1, 2026

skylar-simoncelli and others added 2 commits March 31, 2026 20:02

chore: add PR link to change file

a81643b

Merge branch 'main' into skylar/fix-paritydb-shutdown-data-loss

f236c3a

gilescope approved these changes Apr 1, 2026

View reviewed changes

skylar-simoncelli and others added 3 commits April 1, 2026 10:21

Merge branch 'main' into skylar/fix-paritydb-shutdown-data-loss

c6ef66c

chore: cargo fmt

e0a8935

fix: clone backend before MmrGadget move

fc6d96d

skylar-simoncelli enabled auto-merge April 1, 2026 17:05

skylar-simoncelli added this pull request to the merge queue Apr 1, 2026

skylar-simoncelli removed this pull request from the merge queue due to a manual request Apr 1, 2026

skylar-simoncelli added this pull request to the merge queue Apr 1, 2026

Merged via the queue into main with commit 13d466c Apr 1, 2026
31 checks passed

skylar-simoncelli deleted the skylar/fix-paritydb-shutdown-data-loss branch April 1, 2026 18:44

skylar-simoncelli mentioned this pull request Apr 1, 2026

fix: coalesce ledger WAL flushes and extend to non-authority nodes #1218

Open

7 tasks

This was referenced Apr 8, 2026

Cross-ref: PM-21032 ParityDb ledger storage backend flush #1180

Closed

[Node] potential stall during shutdown + other actions #1145

Open

gilescope added this to the node-1.0.0 milestone Apr 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: ensure parity-db WAL is drained on SIGTERM shutdown#1140

fix: ensure parity-db WAL is drained on SIGTERM shutdown#1140
skylar-simoncelli merged 6 commits into
mainfrom
skylar/fix-paritydb-shutdown-data-loss

skylar-simoncelli commented Apr 1, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

skylar-simoncelli commented Apr 1, 2026

Overview

Observed failures (guardnet, 2026-03-31)

Key observations

Technical analysis

parity-db's async commit pipeline

The WAL replay failure on startup

The shutdown race

Configuration verification

🗹 TODO before merging

📌 Submission Checklist

🧪 Testing Evidence

🔱 Fork Strategy

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants