Skip to content

fix: ensure parity-db WAL is drained on SIGTERM shutdown#1140

Merged
skylar-simoncelli merged 6 commits into
mainfrom
skylar/fix-paritydb-shutdown-data-loss
Apr 1, 2026
Merged

fix: ensure parity-db WAL is drained on SIGTERM shutdown#1140
skylar-simoncelli merged 6 commits into
mainfrom
skylar/fix-paritydb-shutdown-data-loss

Conversation

@skylar-simoncelli

Copy link
Copy Markdown
Contributor

Overview

Fix silent chain-state truncation that occurs after unclean node shutdowns. On restart, affected nodes report a "Highest known block" thousands of blocks behind the actual chain tip, forcing a slow re-sync from that point via the network.

Root cause: The Substrate database backend (Arc<sc_client_db::Backend>) is not explicitly dropped during shutdown. When the tokio runtime's 60-second timeout fires and aborts still-running tasks (AURA, GRANDPA, BEEFY, network sync), their Arc clones are leaked. This prevents parity-db's Drop impl from ever running, which means kill_logs() — the function that joins background I/O threads and drains the WAL pipeline — never executes. On the next startup, parity-db finds a partially-written WAL, discards all remaining log records due to a sequence validation failure, and the BEST_BLOCK metadata pointer reverts to whatever was last committed to the on-disk table files.

The fix mirrors the existing pattern used for the midnight ledger storage (drop_all_default_storage() at command.rs:277): stash an Arc<Backend> outside the tokio runtime closure, and explicitly drop it after run_node_until_exit returns.

Observed failures (guardnet, 2026-03-31)

We cycled all 15 midnight nodes on guardnet (graceful systemctl stop + terminate) and observed 2 out of 15 nodes restart with truncated chain state:

Node Type Pre-stop block Startup block Delta
krill bridge (archive) #272,185 #187,554 -84,631
mullet validator (pruned) #272,574 #187,554 -85,020

All other 13 nodes restarted at their pre-stop block height. On retesting krill (2 more cycles) and mullet (1 more cycle), all passed — the bug is not deterministic per-node.

A colleague observed the same symptom on a separate occasion (Highest known block at #187641), and another observed #15 (near-genesis) after a restart.

Key observations

  • Both failures landed on exactly fix: error when loading chain spec for non-dev cfg_preset #187,554 — same block on different nodes with different EBS volumes
  • Not node-type-specific: one bridge (archive pruning), one validator (pruned 256)
  • Not volume-specific: retests on the same nodes/volumes passed
  • Nodes download from peers, not local replay: after the truncation, the node syncs at ~17 bps via network (⬇ 19.9kiB/s), confirming the blocks above fix: error when loading chain spec for non-dev cfg_preset #187,554 are genuinely inaccessible in the local ParityDB
  • ParityDB stats confirmed: Column 0 on the failed node showed Total values: 187,561 — matching the truncation point exactly. The table files only contained metadata up to that block; everything above was only in the WAL

Technical analysis

parity-db's async commit pipeline

parity-db (0.4.13) uses a 4-stage async pipeline:

db.commit(tx) → commit_queue (memory) → log_worker → BufWriter (WAL file)
                                                           ↓
                                    flush_worker (fsync when file > 64MB)
                                                           ↓
                                    commit_worker (enact to table files)
                                                           ↓
                                    cleanup_worker (fsync tables, truncate WAL)

db.commit() returns immediately after pushing to the in-memory queue — the data is NOT on disk. The flush_worker only fsyncs the WAL file when it exceeds MIN_LOG_SIZE_BYTES = 64 MB. This means thousands of commits (including BEST_BLOCK metadata updates) can sit in an unfsynced BufWriter.

The WAL replay failure on startup

When parity-db opens, it calls replay_all_logs() which iterates WAL files and calls enact_logs(validation_mode=true). The critical check (db.rs:591):

if reader.record_id() != self.last_enacted.load(Ordering::Relaxed) + 1 {
    self.log.clear_replay_logs();  // silently discards ALL remaining WAL records
    return Ok(false)
}

If any WAL record has a sequence gap or CRC mismatch (e.g., from a partially-written BufWriter that was never fsynced), all remaining WAL records are silently discarded — even valid ones after the gap. clear_replay_logs() moves them to the cleanup queue where they are truncated to zero.

The shutdown race

Substrate's shutdown sequence on SIGTERM:

SIGTERM → tokio signal resolves → TaskManager dropped → 60s timeout → process exits

The TaskManager drop sends an exit signal to all spawned tasks. Tasks get 60 seconds to finish. After the timeout, tokio aborts remaining tasks, leaking any Arc references they hold.

The Substrate Backend (containing the parity-db handle) is held via Arc<Backend> by multiple spawned tasks (AURA, GRANDPA, BEEFY, network sync). If any task does not exit within 60 seconds, its Arc<Backend> clone is leaked, the Arc refcount never reaches zero, and Db::drop() never runs.

The ledger DB does not have this problem because command.rs:277 already calls drop_all_default_storage() explicitly after the runtime shuts down.

Configuration verification

Both sync_wal and sync_data are at their defaults (true) — neither midnight-node nor Substrate overrides them. The issue is not about fsync settings but about the Drop impl never executing.

🗹 TODO before merging

  • Ready

📌 Submission Checklist

  • Changes are backward-compatible (or flagged if breaking)
  • Pull request description explains why the change is needed
  • Self-reviewed the diff
  • I have included a change file, or skipped for this reason:
  • If the changes introduce a new feature, I have bumped the node minor version
  • Update documentation (if relevant)
  • Updated AGENTS.md if build commands, architecture, or workflows changed
  • No new todos introduced

🧪 Testing Evidence

Tested on guardnet environment (15 EC2 nodes, AWS account 337112168969):

  • Cycled all 15 nodes with graceful systemctl stop + instance termination
  • 13/15 passed (startup block = pre-stop block), 2/15 failed (startup block = fix: error when loading chain spec for non-dev cfg_preset #187,554)
  • Retested both failed nodes 2-3 more times after they re-synced — all passed
  • Root cause confirmed via ParityDB stats, WAL file inspection, and Substrate source code analysis

🔱 Fork Strategy

  • Node Client Update

Explicitly hold and drop the Substrate database backend outside the
tokio runtime, ensuring parity-db's Drop impl runs after all async
tasks have been stopped.  This mirrors the pattern already used for
the midnight ledger storage (drop_all_default_storage).

Without this fix, the backend's Arc<Backend> may be leaked inside
aborted tokio tasks when the 60-second shutdown timeout fires,
preventing parity-db from joining its background I/O threads and
flushing the write-ahead log.  On next startup, the stale WAL causes
parity-db to report a "Highest known block" thousands of blocks
behind the actual chain tip, forcing a full re-sync from that point.
@skylar-simoncelli skylar-simoncelli added this pull request to the merge queue Apr 1, 2026
@skylar-simoncelli skylar-simoncelli removed this pull request from the merge queue due to a manual request Apr 1, 2026
@skylar-simoncelli skylar-simoncelli added this pull request to the merge queue Apr 1, 2026
Merged via the queue into main with commit 13d466c Apr 1, 2026
31 checks passed
@skylar-simoncelli skylar-simoncelli deleted the skylar/fix-paritydb-shutdown-data-loss branch April 1, 2026 18:44
@gilescope gilescope added this to the node-1.0.0 milestone Apr 10, 2026
m2ux added a commit that referenced this pull request Apr 23, 2026
* fix: ensure parity-db WAL is drained on SIGTERM shutdown

Explicitly hold and drop the Substrate database backend outside the
tokio runtime, ensuring parity-db's Drop impl runs after all async
tasks have been stopped.  This mirrors the pattern already used for
the midnight ledger storage (drop_all_default_storage).

Without this fix, the backend's Arc<Backend> may be leaked inside
aborted tokio tasks when the 60-second shutdown timeout fires,
preventing parity-db from joining its background I/O threads and
flushing the write-ahead log.  On next startup, the stale WAL causes
parity-db to report a "Highest known block" thousands of blocks
behind the actual chain tip, forcing a full re-sync from that point.

* chore: add PR link to change file

* chore: cargo fmt

* fix: clone backend before MmrGadget move
Signed-off-by: Mike Clay <mike.clay@shielded.io>
m2ux added a commit that referenced this pull request Apr 23, 2026
* fix: ensure parity-db WAL is drained on SIGTERM shutdown

Explicitly hold and drop the Substrate database backend outside the
tokio runtime, ensuring parity-db's Drop impl runs after all async
tasks have been stopped.  This mirrors the pattern already used for
the midnight ledger storage (drop_all_default_storage).

Without this fix, the backend's Arc<Backend> may be leaked inside
aborted tokio tasks when the 60-second shutdown timeout fires,
preventing parity-db from joining its background I/O threads and
flushing the write-ahead log.  On next startup, the stale WAL causes
parity-db to report a "Highest known block" thousands of blocks
behind the actual chain tip, forcing a full re-sync from that point.

* chore: add PR link to change file

* chore: cargo fmt

* fix: clone backend before MmrGadget move
Signed-off-by: Mike Clay <mike.clay@shielded.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants