Skip to content

feat: FTS corruption detection, auto-recovery, and repair command#33869

Closed
tuancookiez-hub wants to merge 1 commit into
NousResearch:mainfrom
tuancookiez-hub:feat/fts-repair
Closed

feat: FTS corruption detection, auto-recovery, and repair command#33869
tuancookiez-hub wants to merge 1 commit into
NousResearch:mainfrom
tuancookiez-hub:feat/fts-repair

Conversation

@tuancookiez-hub

Copy link
Copy Markdown
Contributor

Summary

Adds FTS5 corruption detection and auto-recovery to state.db. When FTS indexes become corrupt (malformed), Hermes now self-heals on startup instead of silently breaking session_search and all FTS-backed features.

Closes #33865.

Background

Related reports: #5563, #30908, #23717, #30445 — all describe SQLite corruption in state.db or kanban.db caused by interrupted WAL checkpoints, concurrent process contention, or force-kills during active transactions.

Changes

hermes_state.py

  • _init_schema(): Now catches sqlite3.DatabaseError (corrupt FTS) in addition to sqlite3.OperationalError (missing FTS). On either, drops and recreates FTS tables, then backfills from messages.

  • _drop_fts() / _drop_fts_trigram(): Static helpers that cleanly drop FTS virtual tables and their triggers. Shared by _init_schema() and rebuild_fts().

  • rebuild_fts(): Public method that drops, recreates, and backfills both FTS indexes. Returns (fts_count, trigram_count). Used by hermes sessions repair and hermes doctor --fix.

  • fts_integrity_check(): Returns a dict comparing FTS rowcount against messages table. Detects both corruption (DatabaseError) and index drift (count mismatch).

  • integrity_check(): Wraps PRAGMA integrity_check, returns list of issues.

hermes_cli/doctor.py

The state.db check now:

  1. Runs PRAGMA integrity_check (catches B-tree corruption, page errors)
  2. Validates FTS rowcount matches messages table (catches index drift)
  3. Reports specific errors instead of generic "has issues"
  4. hermes doctor --fix can auto-rebuild corrupt FTS indexes

hermes_cli/main.py

New hermes sessions repair subcommand:

  • Runs integrity check + FTS health validation
  • Auto-rebuilds corrupt FTS indexes
  • --check-only flag for read-only diagnostics
  • Handles both corruption (malformed) and drift (count mismatch)

Testing

python -c "
from hermes_state import SessionDB
import tempfile, os
os.environ['HERMES_HOME'] = tempfile.mkdtemp()
db = SessionDB()
fts_count, tri_count = db.rebuild_fts()
msg_count = db.message_count()
assert fts_count == msg_count
assert tri_count == msg_count
fts = db.fts_integrity_check()
assert fts['fts_ok'] and fts['trigram_ok']
print('All assertions passed')
db.close()
"

Breaking Changes

None. All changes are additive. Existing behavior preserved for healthy databases.

Checklist

  • Bug fix (crash/data loss prevention)
  • Cross-platform (Windows + Linux + macOS — pure sqlite3, no platform-specific code)
  • No new dependencies
  • Backward compatible
  • Follows existing patterns (v11 migration FTS rebuild, _reconcile_columns() declarative approach)

- hermes_state.py: catch DatabaseError (corrupt FTS) in _init_schema(),
  auto-drop + recreate + backfill FTS on startup
- hermes_state.py: add rebuild_fts(), fts_integrity_check(), integrity_check()
- hermes_cli/doctor.py: PRAGMA integrity_check + FTS rowcount validation,
  --fix can auto-rebuild corrupt FTS
- hermes_cli/main.py: add 'hermes sessions repair' subcommand with
  --check-only support

Fixes NousResearch#5563. Related: NousResearch#30908, NousResearch#23717, NousResearch#30445.
@alt-glitch alt-glitch added type/feature New feature or request P2 Medium — degraded but workaround exists comp/cli CLI entry point, hermes_cli/, setup wizard comp/agent Core agent loop, run_agent.py, prompt builder labels May 28, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Addresses #33865. Note competing open PRs: #32589 (doctor FTS detection + repair) and #30514 (MATCH-based FTS corruption detection). This PR is broader — covers startup auto-recovery, hermes sessions repair subcommand, and doctor integration — but overlaps significantly with both.

OutThisLife added a commit that referenced this pull request Jun 9, 2026
…s reappear (#43149)

* fix(state.db): recover from malformed sqlite_master so hidden sessions reappear

The corruption class behind "Desktop/Dashboard show no sessions while
hundreds of session files sit on disk" is a malformed sqlite_master — most
often a duplicate object row, e.g. two CREATE VIRTUAL TABLE messages_fts
entries — surfacing as:

    sqlite3.DatabaseError: malformed database schema (messages_fts) -
    table messages_fts already exists

SQLite parses the whole schema while preparing the FIRST statement on a
connection, so on this class every statement fails before it runs: PRAGMA
journal_mode (which is where SessionDB.__init__ actually trips, in
apply_wal_with_fallback, BEFORE _init_schema), PRAGMA integrity_check, and
even DROP TABLE. The only operations that still work are
PRAGMA writable_schema=ON plus direct sqlite_master surgery. A plain
FTS-index rebuild at the _init_schema layer therefore cannot reach or fix
this; the canonical sessions/messages rows are intact — only the derived
schema is broken.

Add a dedicated recovery that operates where the failure actually happens:

- hermes_state.repair_state_db_schema(): backs up the raw file first, then a
  least-destructive ladder — (1) de-duplicate sqlite_master keeping the
  lowest rowid per object (preserves the existing FTS index), escalating to
  (2) drop every messages_fts* schema object + VACUUM and let the next open
  rebuild the FTS index from messages. sessions/messages are never modified.
  Plus is_malformed_db_error() to discriminate this class.
- SessionDB.__init__ auto-heals: on a malformed-schema open error it repairs
  once (process-guarded against loops / concurrent web_server opens) and
  reopens, so Desktop/Dashboard recover on their own instead of silently
  showing "no sessions".
- hermes doctor --fix detects the malformed class and repairs it (reporting
  the recovered session count + backup name).
- hermes sessions repair [--check-only] [--no-backup] runs on the raw file
  path, since SessionDB() itself cannot open a malformed DB.

Supersedes #32589 and #33869: both targeted FTS corruption but gated their
repair behind statements (integrity_check / SELECT / DROP TABLE) that
themselves fail on this class, and neither addressed the apply_wal_with_fallback
open-time failure. Credit preserved via Co-authored-by.

Closes #33865.

Co-authored-by: João Vitor Cunha <145560011+plcunha@users.noreply.github.com>
Co-authored-by: Tuna Dev <273476039+tuancookiez-hub@users.noreply.github.com>

* test(state.db): cover strat-B escalation + unrepairable safe-fail paths

---------

Co-authored-by: João Vitor Cunha <145560011+plcunha@users.noreply.github.com>
Co-authored-by: Tuna Dev <273476039+tuancookiez-hub@users.noreply.github.com>
@teknium1

teknium1 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Closing as superseded by PR #43149.

This PR correctly identified that the FTS recovery has to happen before normal session search paths can work, but the merged fix goes lower-level: it repairs malformed sqlite_master / duplicate messages_fts schema rows before SessionDB() can open, then lets the derived FTS objects rebuild from canonical messages. PR #43149 includes you as a co-author for this repair direction. Thanks for the detailed write-up and implementation work.

@teknium1 teknium1 closed this Jun 9, 2026
itskaism pushed a commit to itskaism/hermes-agent that referenced this pull request Jun 10, 2026
…s reappear (NousResearch#43149)

* fix(state.db): recover from malformed sqlite_master so hidden sessions reappear

The corruption class behind "Desktop/Dashboard show no sessions while
hundreds of session files sit on disk" is a malformed sqlite_master — most
often a duplicate object row, e.g. two CREATE VIRTUAL TABLE messages_fts
entries — surfacing as:

    sqlite3.DatabaseError: malformed database schema (messages_fts) -
    table messages_fts already exists

SQLite parses the whole schema while preparing the FIRST statement on a
connection, so on this class every statement fails before it runs: PRAGMA
journal_mode (which is where SessionDB.__init__ actually trips, in
apply_wal_with_fallback, BEFORE _init_schema), PRAGMA integrity_check, and
even DROP TABLE. The only operations that still work are
PRAGMA writable_schema=ON plus direct sqlite_master surgery. A plain
FTS-index rebuild at the _init_schema layer therefore cannot reach or fix
this; the canonical sessions/messages rows are intact — only the derived
schema is broken.

Add a dedicated recovery that operates where the failure actually happens:

- hermes_state.repair_state_db_schema(): backs up the raw file first, then a
  least-destructive ladder — (1) de-duplicate sqlite_master keeping the
  lowest rowid per object (preserves the existing FTS index), escalating to
  (2) drop every messages_fts* schema object + VACUUM and let the next open
  rebuild the FTS index from messages. sessions/messages are never modified.
  Plus is_malformed_db_error() to discriminate this class.
- SessionDB.__init__ auto-heals: on a malformed-schema open error it repairs
  once (process-guarded against loops / concurrent web_server opens) and
  reopens, so Desktop/Dashboard recover on their own instead of silently
  showing "no sessions".
- hermes doctor --fix detects the malformed class and repairs it (reporting
  the recovered session count + backup name).
- hermes sessions repair [--check-only] [--no-backup] runs on the raw file
  path, since SessionDB() itself cannot open a malformed DB.

Supersedes NousResearch#32589 and NousResearch#33869: both targeted FTS corruption but gated their
repair behind statements (integrity_check / SELECT / DROP TABLE) that
themselves fail on this class, and neither addressed the apply_wal_with_fallback
open-time failure. Credit preserved via Co-authored-by.

Closes NousResearch#33865.

Co-authored-by: João Vitor Cunha <145560011+plcunha@users.noreply.github.com>
Co-authored-by: Tuna Dev <273476039+tuancookiez-hub@users.noreply.github.com>

* test(state.db): cover strat-B escalation + unrepairable safe-fail paths

---------

Co-authored-by: João Vitor Cunha <145560011+plcunha@users.noreply.github.com>
Co-authored-by: Tuna Dev <273476039+tuancookiez-hub@users.noreply.github.com>
(cherry picked from commit 218452b)
wachoo pushed a commit to wachoo/hermes-agent that referenced this pull request Jun 10, 2026
…s reappear (NousResearch#43149)

* fix(state.db): recover from malformed sqlite_master so hidden sessions reappear

The corruption class behind "Desktop/Dashboard show no sessions while
hundreds of session files sit on disk" is a malformed sqlite_master — most
often a duplicate object row, e.g. two CREATE VIRTUAL TABLE messages_fts
entries — surfacing as:

    sqlite3.DatabaseError: malformed database schema (messages_fts) -
    table messages_fts already exists

SQLite parses the whole schema while preparing the FIRST statement on a
connection, so on this class every statement fails before it runs: PRAGMA
journal_mode (which is where SessionDB.__init__ actually trips, in
apply_wal_with_fallback, BEFORE _init_schema), PRAGMA integrity_check, and
even DROP TABLE. The only operations that still work are
PRAGMA writable_schema=ON plus direct sqlite_master surgery. A plain
FTS-index rebuild at the _init_schema layer therefore cannot reach or fix
this; the canonical sessions/messages rows are intact — only the derived
schema is broken.

Add a dedicated recovery that operates where the failure actually happens:

- hermes_state.repair_state_db_schema(): backs up the raw file first, then a
  least-destructive ladder — (1) de-duplicate sqlite_master keeping the
  lowest rowid per object (preserves the existing FTS index), escalating to
  (2) drop every messages_fts* schema object + VACUUM and let the next open
  rebuild the FTS index from messages. sessions/messages are never modified.
  Plus is_malformed_db_error() to discriminate this class.
- SessionDB.__init__ auto-heals: on a malformed-schema open error it repairs
  once (process-guarded against loops / concurrent web_server opens) and
  reopens, so Desktop/Dashboard recover on their own instead of silently
  showing "no sessions".
- hermes doctor --fix detects the malformed class and repairs it (reporting
  the recovered session count + backup name).
- hermes sessions repair [--check-only] [--no-backup] runs on the raw file
  path, since SessionDB() itself cannot open a malformed DB.

Supersedes NousResearch#32589 and NousResearch#33869: both targeted FTS corruption but gated their
repair behind statements (integrity_check / SELECT / DROP TABLE) that
themselves fail on this class, and neither addressed the apply_wal_with_fallback
open-time failure. Credit preserved via Co-authored-by.

Closes NousResearch#33865.

Co-authored-by: João Vitor Cunha <145560011+plcunha@users.noreply.github.com>
Co-authored-by: Tuna Dev <273476039+tuancookiez-hub@users.noreply.github.com>

* test(state.db): cover strat-B escalation + unrepairable safe-fail paths

---------

Co-authored-by: João Vitor Cunha <145560011+plcunha@users.noreply.github.com>
Co-authored-by: Tuna Dev <273476039+tuancookiez-hub@users.noreply.github.com>
changman pushed a commit to changman/hermes-agent that referenced this pull request Jun 10, 2026
…s reappear (NousResearch#43149)

* fix(state.db): recover from malformed sqlite_master so hidden sessions reappear

The corruption class behind "Desktop/Dashboard show no sessions while
hundreds of session files sit on disk" is a malformed sqlite_master — most
often a duplicate object row, e.g. two CREATE VIRTUAL TABLE messages_fts
entries — surfacing as:

    sqlite3.DatabaseError: malformed database schema (messages_fts) -
    table messages_fts already exists

SQLite parses the whole schema while preparing the FIRST statement on a
connection, so on this class every statement fails before it runs: PRAGMA
journal_mode (which is where SessionDB.__init__ actually trips, in
apply_wal_with_fallback, BEFORE _init_schema), PRAGMA integrity_check, and
even DROP TABLE. The only operations that still work are
PRAGMA writable_schema=ON plus direct sqlite_master surgery. A plain
FTS-index rebuild at the _init_schema layer therefore cannot reach or fix
this; the canonical sessions/messages rows are intact — only the derived
schema is broken.

Add a dedicated recovery that operates where the failure actually happens:

- hermes_state.repair_state_db_schema(): backs up the raw file first, then a
  least-destructive ladder — (1) de-duplicate sqlite_master keeping the
  lowest rowid per object (preserves the existing FTS index), escalating to
  (2) drop every messages_fts* schema object + VACUUM and let the next open
  rebuild the FTS index from messages. sessions/messages are never modified.
  Plus is_malformed_db_error() to discriminate this class.
- SessionDB.__init__ auto-heals: on a malformed-schema open error it repairs
  once (process-guarded against loops / concurrent web_server opens) and
  reopens, so Desktop/Dashboard recover on their own instead of silently
  showing "no sessions".
- hermes doctor --fix detects the malformed class and repairs it (reporting
  the recovered session count + backup name).
- hermes sessions repair [--check-only] [--no-backup] runs on the raw file
  path, since SessionDB() itself cannot open a malformed DB.

Supersedes NousResearch#32589 and NousResearch#33869: both targeted FTS corruption but gated their
repair behind statements (integrity_check / SELECT / DROP TABLE) that
themselves fail on this class, and neither addressed the apply_wal_with_fallback
open-time failure. Credit preserved via Co-authored-by.

Closes NousResearch#33865.

Co-authored-by: João Vitor Cunha <145560011+plcunha@users.noreply.github.com>
Co-authored-by: Tuna Dev <273476039+tuancookiez-hub@users.noreply.github.com>

* test(state.db): cover strat-B escalation + unrepairable safe-fail paths

---------

Co-authored-by: João Vitor Cunha <145560011+plcunha@users.noreply.github.com>
Co-authored-by: Tuna Dev <273476039+tuancookiez-hub@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/cli CLI entry point, hermes_cli/, setup wizard P2 Medium — degraded but workaround exists type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

state.db FTS corruption goes undetected — no integrity check, no repair path

3 participants