Skip to content

fix(kanban): merge complete_task and recompute_ready into a single write txn#31891

Closed
steveonjava wants to merge 2 commits into
NousResearch:mainfrom
steveonjava:feat/kanban-investigate-idx-tasks-status-torn-write
Closed

fix(kanban): merge complete_task and recompute_ready into a single write txn#31891
steveonjava wants to merge 2 commits into
NousResearch:mainfrom
steveonjava:feat/kanban-investigate-idx-tasks-status-torn-write

Conversation

@steveonjava

Copy link
Copy Markdown
Contributor

What does this PR do?

This PR fixes an internal SQLite transaction boundary gap in the kanban scheduler (kanban_db.py). The fix is a correctness improvement — it eliminates a checkpoint window where idx_tasks_status could lag the tasks table — and carries no security implications. Index drift of this class (losslessly repairable via REINDEX, non-exfiltrating, not crossing any process or OS boundary) falls under regular bug reporting per the project's security policy (SECURITY.md §3.2).

Root Cause

Previously, complete_task() and recompute_ready() ran in separate IMMEDIATE transactions. The gap between their COMMITs is a window where a WAL auto-checkpoint can partially flush — transferring the tasks table page to main-db while leaving idx_tasks_status pages in WAL. If the checkpoint is interrupted (SIGTERM, EIO, OS buffer eviction), the index drifts from the table and surfaces on the next connection as "row N missing from index idx_tasks_status".

Solution

The fix adds a _within_txn kwarg to recompute_ready() and _clear_failure_counter(). When True, they skip their own write_txn wrapper and execute inline. complete_task() now invokes both with _within_txn=True inside its own write_txn, so the parent's status='done' UPDATE, the completed event, the failure-counter reset, and every child status='ready' UPDATE land in a single COMMIT. The checkpoint window closes.

Related Issue

Fixes #30908 (index corruption after frequent gateway restarts). Cross-references #31208 (Bug E hardening with synchronous=FULL + secure_delete, which surfaces this drift as loud rather than silent) and #31731 (wal_autocheckpoint tuning — complementary optimization that reduces checkpoint frequency).

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • Tests added or updated

Changes Made

  • hermes_cli/kanban_db.py

    • Modified recompute_ready() to accept _within_txn kwarg; when True, executes statements inline instead of opening a new transaction
    • Modified _clear_failure_counter() to accept _within_txn kwarg for the same purpose
    • Updated complete_task() to call both functions with _within_txn=True inside the same write_txn, making the entire parent completion + child promotion atomic
  • tests/hermes_cli/test_kanban_db.py

    • Added unit tests verifying transaction boundary integrity after complete_task
    • Added test verifying exactly one BEGIN IMMEDIATE is used for the merged transaction
    • Added test verifying PRAGMA integrity_check finds no index drift
  • tests/stress/test_recompute_ready_index_drift.py (new)

    • Deterministic reproducer that triggers the index-drift condition reliably without the fix
    • Verifies clean integrity_check across 50 sequential parent completions
    • Verifies stability across two concurrent connections to the same database

How to Test

Run the test suite via the CI hermetic environment:

scripts/run_tests.sh

For quick local verification:

pytest tests/ -v

Specific tests for this fix:

pytest tests/stress/test_recompute_ready_index_drift.py -v
pytest tests/hermes_cli/test_kanban_db.py::test_complete_task_index_integrity_no_drift -v
pytest tests/hermes_cli/test_kanban_db.py::test_recompute_ready_within_single_txn -v
pytest tests/hermes_cli/test_kanban_db.py::test_two_connection_index_stability -v

Checklist

  • This is a bug fix
  • Tests have been added / updated
  • I've read the Contributing Guide
  • Conventional Commits format followed
  • pytest tests/ -q passes
  • Tests added for my changes

Notes

Refreshed prior-art snapshot at packaging time. No new conflicting PRs found since intake. PR #31731 (checkpoint-frequency tuning) is complementary; this PR targets the transaction-boundary root cause while #31731 optimizes checkpoint frequency independently.

@alt-glitch alt-glitch added type/bug Something isn't working comp/cli CLI entry point, hermes_cli/, setup wizard P2 Medium — degraded but workaround exists labels May 25, 2026
…ite txn

The two functions previously ran in separate IMMEDIATE transactions. The
inter-txn gap is a window where a WAL auto-checkpoint can partially flush —
transferring the tasks-table page to main-db while leaving idx_tasks_status
pages in WAL. If the checkpoint is then interrupted (SIGTERM, EIO, OS
buffer eviction), the index drifts from the table and surfaces on the next
connection as "row N missing from index idx_tasks_status".

Fix: add a `_within_txn=False` kwarg to recompute_ready and to the
`_clear_failure_counter` helper; when True they skip their own write_txn
wrapper and execute inline. complete_task now invokes both with
`_within_txn=True` inside its own write_txn, so the parent's status='done'
UPDATE, the `completed` event, the failure-counter reset, and every child
status='ready' UPDATE land in a single COMMIT. The checkpoint window
closes.

Stress reproducer in tests/stress/ asserts exactly ONE BEGIN IMMEDIATE for
the merged txn and clean PRAGMA integrity_check across 50 sequential
completions and across two concurrent connections.

Cross-references: NousResearch#31208 (Bug E hardening — synchronous=FULL +
secure_delete + cell_size_check, surfaces drift as loud rather than silent)
and NousResearch#30908 (related corruption class triggered by EIO during checkpoint).
Add canonical commit email to AUTHOR_MAP so ci/contributor-check resolves @steveonjava for commits authored under the host gitconfig identity (Stephen Chin <steveonjava@gmail.com>).
@steveonjava

Copy link
Copy Markdown
Contributor Author

Closing for now. Branch is conflicting against significant upstream torn-write hardening that has landed since this PR was opened (6416dd5 secure_delete + cell_size_check + synchronous=FULL, 99c19eb post-commit page_count invariant, e83252d preserve original exception on rollback, c002668 grace period on crashed worker detect). The single-write-txn merge needs to be re-designed against those new invariants rather than mechanically rebased. Will revisit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kanban.db index corruption after frequent gateway restarts — dispatcher disables board permanently

2 participants