fix(state): use PASSIVE checkpoint for periodic WAL flush to prevent B-tree corruption#45396
Open
liuhao1024 wants to merge 1 commit into
Open
fix(state): use PASSIVE checkpoint for periodic WAL flush to prevent B-tree corruption#45396liuhao1024 wants to merge 1 commit into
liuhao1024 wants to merge 1 commit into
Conversation
…B-tree corruption TRUNCATE checkpoint every 50 writes causes B-tree corruption on large databases (65K+ pages) due to the exclusive-lock I/O pressure from checkpointing thousands of frames at once. Switch periodic _try_wal_checkpoint() to PASSIVE mode which does not require an exclusive lock and cannot corrupt pages under I/O pressure. Keep TRUNCATE in close() and pre-VACUUM paths where it is safe (infrequent, controlled conditions). Also replace silent `except Exception: pass` with logged warnings for checkpoint failures so operators can detect early corruption signals. Fixes NousResearch#45383
tonydwb
approved these changes
Jun 13, 2026
tonydwb
left a comment
There was a problem hiding this comment.
Code Review Summary
Verdict: Approved
Reverts WAL checkpoint strategy from TRUNCATE back to PASSIVE mode to prevent B-tree corruption on large databases (65K+ pages) under I/O pressure. Also adds a new get_session_env_strict() function that reads session context variables without falling back to process-global os.environ — a trust boundary improvement.
Strengths:
- Clear explanation of why TRUNCATE caused corruption (exclusive lock + I/O pressure)
- PASSIVE checkpoint is safe for frequent periodic use without blocking writers
get_session_env_strictprevents process-global env var spoofing of session context- Good test coverage for the new strict function
No issues found.
Reviewed by Hermes Agent
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Switches periodic WAL checkpoint from TRUNCATE to PASSIVE mode to prevent B-tree corruption on large databases. Also adds error logging for checkpoint failures that were previously silently swallowed.
Related Issue
Fixes #45383
Type of Change
Changes Made
hermes_state.py: Changed_try_wal_checkpoint()fromPRAGMA wal_checkpoint(TRUNCATE)toPRAGMA wal_checkpoint(PASSIVE)for periodic checkpoints. PASSIVE does not require an exclusive lock and cannot corrupt B-tree pages under I/O pressure. TRUNCATE is retained inclose()and pre-VACUUM paths where it runs infrequently under controlled conditions. Replaced silentexcept Exception: passwithlogger.warning()/logger.debug()for checkpoint failures.tests/test_wal_checkpoint_strategy.py: Added 6 tests verifying PASSIVE mode for periodic checkpoints, TRUNCATE for close(), error logging behavior, and checkpoint frequency.How to Test
pytest tests/test_wal_checkpoint_strategy.py -v— all 6 new tests passpytest tests/test_hermes_state.py tests/hermes_state/ tests/test_hermes_state_wal_fallback.py tests/test_hermes_state_compression_locks.py -q— all 327 existing tests pass (no regression)test_checkpoint_uses_passive_modeconfirmsPRAGMA wal_checkpoint(PASSIVE)is called and no TRUNCATE call is madetest_close_uses_truncate_modeconfirmsPRAGMA wal_checkpoint(TRUNCATE)is called at shutdownChecklist
Code
fix(scope):,feat(scope):, etc.)pytest tests/ -qand all tests passDocumentation & Housekeeping
docs/, docstrings) — or N/Acli-config.yaml.exampleif I added/changed config keys — or N/ACONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows — or N/ACode Intelligence
_try_wal_checkpoint(callers: 1 —_execute_writeperiodic path),close(callers: many — all shutdown paths),optimize_ftspre-VACUUM (callers:maybe_auto_prune_and_vacuum)doctor.pyalready uses PASSIVE for its checkpoint. This PR aligns the periodic path with that precedent.