fix(kanban): merge complete_task and recompute_ready into a single write txn by steveonjava · Pull Request #31891 · NousResearch/hermes-agent

steveonjava · 2026-05-25T05:46:06Z

What does this PR do?

This PR fixes an internal SQLite transaction boundary gap in the kanban scheduler (kanban_db.py). The fix is a correctness improvement — it eliminates a checkpoint window where idx_tasks_status could lag the tasks table — and carries no security implications. Index drift of this class (losslessly repairable via REINDEX, non-exfiltrating, not crossing any process or OS boundary) falls under regular bug reporting per the project's security policy (SECURITY.md §3.2).

Root Cause

Previously, complete_task() and recompute_ready() ran in separate IMMEDIATE transactions. The gap between their COMMITs is a window where a WAL auto-checkpoint can partially flush — transferring the tasks table page to main-db while leaving idx_tasks_status pages in WAL. If the checkpoint is interrupted (SIGTERM, EIO, OS buffer eviction), the index drifts from the table and surfaces on the next connection as "row N missing from index idx_tasks_status".

Solution

The fix adds a _within_txn kwarg to recompute_ready() and _clear_failure_counter(). When True, they skip their own write_txn wrapper and execute inline. complete_task() now invokes both with _within_txn=True inside its own write_txn, so the parent's status='done' UPDATE, the completed event, the failure-counter reset, and every child status='ready' UPDATE land in a single COMMIT. The checkpoint window closes.

Related Issue

Fixes #30908 (index corruption after frequent gateway restarts). Cross-references #31208 (Bug E hardening with synchronous=FULL + secure_delete, which surfaces this drift as loud rather than silent) and #31731 (wal_autocheckpoint tuning — complementary optimization that reduces checkpoint frequency).

Type of Change

Bug fix (non-breaking change which fixes an issue)
Tests added or updated

Changes Made

hermes_cli/kanban_db.py
- Modified recompute_ready() to accept _within_txn kwarg; when True, executes statements inline instead of opening a new transaction
- Modified _clear_failure_counter() to accept _within_txn kwarg for the same purpose
- Updated complete_task() to call both functions with _within_txn=True inside the same write_txn, making the entire parent completion + child promotion atomic
tests/hermes_cli/test_kanban_db.py
- Added unit tests verifying transaction boundary integrity after complete_task
- Added test verifying exactly one BEGIN IMMEDIATE is used for the merged transaction
- Added test verifying PRAGMA integrity_check finds no index drift
tests/stress/test_recompute_ready_index_drift.py (new)
- Deterministic reproducer that triggers the index-drift condition reliably without the fix
- Verifies clean integrity_check across 50 sequential parent completions
- Verifies stability across two concurrent connections to the same database

How to Test

Run the test suite via the CI hermetic environment:

scripts/run_tests.sh

For quick local verification:

pytest tests/ -v

Specific tests for this fix:

pytest tests/stress/test_recompute_ready_index_drift.py -v
pytest tests/hermes_cli/test_kanban_db.py::test_complete_task_index_integrity_no_drift -v
pytest tests/hermes_cli/test_kanban_db.py::test_recompute_ready_within_single_txn -v
pytest tests/hermes_cli/test_kanban_db.py::test_two_connection_index_stability -v

Checklist

Notes

Refreshed prior-art snapshot at packaging time. No new conflicting PRs found since intake. PR #31731 (checkpoint-frequency tuning) is complementary; this PR targets the transaction-boundary root cause while #31731 optimizes checkpoint frequency independently.

…ite txn The two functions previously ran in separate IMMEDIATE transactions. The inter-txn gap is a window where a WAL auto-checkpoint can partially flush — transferring the tasks-table page to main-db while leaving idx_tasks_status pages in WAL. If the checkpoint is then interrupted (SIGTERM, EIO, OS buffer eviction), the index drifts from the table and surfaces on the next connection as "row N missing from index idx_tasks_status". Fix: add a `_within_txn=False` kwarg to recompute_ready and to the `_clear_failure_counter` helper; when True they skip their own write_txn wrapper and execute inline. complete_task now invokes both with `_within_txn=True` inside its own write_txn, so the parent's status='done' UPDATE, the `completed` event, the failure-counter reset, and every child status='ready' UPDATE land in a single COMMIT. The checkpoint window closes. Stress reproducer in tests/stress/ asserts exactly ONE BEGIN IMMEDIATE for the merged txn and clean PRAGMA integrity_check across 50 sequential completions and across two concurrent connections. Cross-references: NousResearch#31208 (Bug E hardening — synchronous=FULL + secure_delete + cell_size_check, surfaces drift as loud rather than silent) and NousResearch#30908 (related corruption class triggered by EIO during checkpoint).

@steveonjava

Add canonical commit email to AUTHOR_MAP so ci/contributor-check resolves @steveonjava for commits authored under the host gitconfig identity (Stephen Chin <steveonjava@gmail.com>).

steveonjava · 2026-05-28T16:55:11Z

Closing for now. Branch is conflicting against significant upstream torn-write hardening that has landed since this PR was opened (6416dd5 secure_delete + cell_size_check + synchronous=FULL, 99c19eb post-commit page_count invariant, e83252d preserve original exception on rollback, c002668 grace period on crashed worker detect). The single-write-txn merge needs to be re-designed against those new invariants rather than mechanically rebased. Will revisit.

alt-glitch added type/bug Something isn't working comp/cli CLI entry point, hermes_cli/, setup wizard P2 Medium — degraded but workaround exists labels May 25, 2026

steveonjava added 2 commits May 25, 2026 00:58

chore(release): map steveonjava@gmail.com to @steveonjava

bcc2f37

Add canonical commit email to AUTHOR_MAP so ci/contributor-check resolves @steveonjava for commits authored under the host gitconfig identity (Stephen Chin <steveonjava@gmail.com>).

steveonjava force-pushed the feat/kanban-investigate-idx-tasks-status-torn-write branch from 26fac48 to bcc2f37 Compare May 25, 2026 07:59

steveonjava mentioned this pull request May 25, 2026

fix(kanban): hoist zombie reaper out of dispatch_once #32301

Closed

11 tasks

steveonjava marked this pull request as ready for review May 25, 2026 23:29

herrschmidt mentioned this pull request May 26, 2026

fix(kanban): remove false-positive corruption detection from separate probe connection #32449

Closed

This was referenced May 26, 2026

fix(kanban): batch-salvage 8 SQLite corruption hardening fixes (closes #31158, refs #29610) #32857

Closed

Bug: embedded Kanban dispatcher still leaks sqlite/WAL file descriptors after #28301 #29610

Closed

steveonjava closed this May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kanban): merge complete_task and recompute_ready into a single write txn#31891

fix(kanban): merge complete_task and recompute_ready into a single write txn#31891
steveonjava wants to merge 2 commits into
NousResearch:mainfrom
steveonjava:feat/kanban-investigate-idx-tasks-status-torn-write

steveonjava commented May 25, 2026

Uh oh!

steveonjava commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

steveonjava commented May 25, 2026

What does this PR do?

Root Cause

Solution

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Notes

Uh oh!

steveonjava commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants