fix(kanban): merge complete_task and recompute_ready into a single write txn#31891
Closed
steveonjava wants to merge 2 commits into
Closed
Conversation
…ite txn The two functions previously ran in separate IMMEDIATE transactions. The inter-txn gap is a window where a WAL auto-checkpoint can partially flush — transferring the tasks-table page to main-db while leaving idx_tasks_status pages in WAL. If the checkpoint is then interrupted (SIGTERM, EIO, OS buffer eviction), the index drifts from the table and surfaces on the next connection as "row N missing from index idx_tasks_status". Fix: add a `_within_txn=False` kwarg to recompute_ready and to the `_clear_failure_counter` helper; when True they skip their own write_txn wrapper and execute inline. complete_task now invokes both with `_within_txn=True` inside its own write_txn, so the parent's status='done' UPDATE, the `completed` event, the failure-counter reset, and every child status='ready' UPDATE land in a single COMMIT. The checkpoint window closes. Stress reproducer in tests/stress/ asserts exactly ONE BEGIN IMMEDIATE for the merged txn and clean PRAGMA integrity_check across 50 sequential completions and across two concurrent connections. Cross-references: NousResearch#31208 (Bug E hardening — synchronous=FULL + secure_delete + cell_size_check, surfaces drift as loud rather than silent) and NousResearch#30908 (related corruption class triggered by EIO during checkpoint).
Add canonical commit email to AUTHOR_MAP so ci/contributor-check resolves @steveonjava for commits authored under the host gitconfig identity (Stephen Chin <steveonjava@gmail.com>).
26fac48 to
bcc2f37
Compare
11 tasks
This was referenced May 26, 2026
Closed
Contributor
Author
|
Closing for now. Branch is conflicting against significant upstream torn-write hardening that has landed since this PR was opened (6416dd5 secure_delete + cell_size_check + synchronous=FULL, 99c19eb post-commit page_count invariant, e83252d preserve original exception on rollback, c002668 grace period on crashed worker detect). The single-write-txn merge needs to be re-designed against those new invariants rather than mechanically rebased. Will revisit. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR fixes an internal SQLite transaction boundary gap in the kanban scheduler (
kanban_db.py). The fix is a correctness improvement — it eliminates a checkpoint window whereidx_tasks_statuscould lag thetaskstable — and carries no security implications. Index drift of this class (losslessly repairable via REINDEX, non-exfiltrating, not crossing any process or OS boundary) falls under regular bug reporting per the project's security policy (SECURITY.md §3.2).Root Cause
Previously,
complete_task()andrecompute_ready()ran in separate IMMEDIATE transactions. The gap between their COMMITs is a window where a WAL auto-checkpoint can partially flush — transferring the tasks table page to main-db while leavingidx_tasks_statuspages in WAL. If the checkpoint is interrupted (SIGTERM, EIO, OS buffer eviction), the index drifts from the table and surfaces on the next connection as "row N missing from index idx_tasks_status".Solution
The fix adds a
_within_txnkwarg torecompute_ready()and_clear_failure_counter(). When True, they skip their ownwrite_txnwrapper and execute inline.complete_task()now invokes both with_within_txn=Trueinside its own write_txn, so the parent'sstatus='done'UPDATE, thecompletedevent, the failure-counter reset, and every childstatus='ready'UPDATE land in a single COMMIT. The checkpoint window closes.Related Issue
Fixes #30908 (index corruption after frequent gateway restarts). Cross-references #31208 (Bug E hardening with synchronous=FULL + secure_delete, which surfaces this drift as loud rather than silent) and #31731 (wal_autocheckpoint tuning — complementary optimization that reduces checkpoint frequency).
Type of Change
Changes Made
hermes_cli/kanban_db.py
recompute_ready()to accept_within_txnkwarg; when True, executes statements inline instead of opening a new transaction_clear_failure_counter()to accept_within_txnkwarg for the same purposecomplete_task()to call both functions with_within_txn=Trueinside the same write_txn, making the entire parent completion + child promotion atomictests/hermes_cli/test_kanban_db.py
tests/stress/test_recompute_ready_index_drift.py (new)
How to Test
Run the test suite via the CI hermetic environment:
For quick local verification:
Specific tests for this fix:
Checklist
pytest tests/ -qpassesNotes
Refreshed prior-art snapshot at packaging time. No new conflicting PRs found since intake. PR #31731 (checkpoint-frequency tuning) is complementary; this PR targets the transaction-boundary root cause while #31731 optimizes checkpoint frequency independently.