Skip to content

fix(state.db): recover from malformed sqlite_master so hidden sessions reappear#43149

Merged
OutThisLife merged 2 commits into
mainfrom
bb/state-db-fts-repair
Jun 9, 2026
Merged

fix(state.db): recover from malformed sqlite_master so hidden sessions reappear#43149
OutThisLife merged 2 commits into
mainfrom
bb/state-db-fts-repair

Conversation

@OutThisLife

Copy link
Copy Markdown
Collaborator

Summary

Fixes the corruption class behind "Desktop/Dashboard show no sessions while hundreds of session files sit on disk". The backend logs:

sqlite3.DatabaseError: malformed database schema (messages_fts) - table messages_fts already exists

This is a malformed sqlite_master (typically a duplicate object row — two CREATE VIRTUAL TABLE messages_fts entries), which is a worse class than a malformed FTS inverted index. SQLite parses the entire schema while preparing the first statement on a connection, so on this class every statement fails before it runs — reproduced:

[PRAGMA journal_mode]       FAIL -> malformed database schema...
[PRAGMA integrity_check]    FAIL -> malformed database schema...
[SELECT FROM sessions]      FAIL -> malformed database schema...
[DROP TABLE messages_fts]   FAIL -> malformed database schema...
[PRAGMA writable_schema=ON] OK     <- only this survives

Crucially the error fires in apply_wal_with_fallback() (hermes_state.py), before _init_schema() runs — so a plain FTS-rebuild at the schema-init layer can neither reach nor fix it. The canonical sessions/messages rows are intact; only the derived schema is broken.

Changes

  • hermes_state.repair_state_db_schema() — backs up the raw file first, then a least-destructive ladder:

    1. de-duplicate sqlite_master keeping the lowest rowid per object (preserves the existing FTS index intact);
    2. escalate to drop every messages_fts* schema object + VACUUM and let the next open rebuild the FTS index from messages.

    sessions/messages are never modified. Plus is_malformed_db_error() to discriminate this class.

  • SessionDB.__init__ auto-heals — on a malformed-schema open error it repairs once (process-guarded against loops / concurrent web_server opens) and reopens, so Desktop/Dashboard recover on their own instead of silently showing "no sessions".

  • hermes doctor --fix — detects the malformed class and repairs it, reporting recovered session count + backup name.

  • hermes sessions repair [--check-only] [--no-backup] — operates on the raw file path, since SessionDB() itself cannot open a malformed DB.

Supersedes

Supersedes #32589 and #33869. Both targeted FTS corruption but gated their repair behind statements (integrity_check / SELECT / DROP TABLE) that themselves fail on this class, and neither addressed the open-time (apply_wal_with_fallback) failure. Credit preserved via Co-authored-by.

Closes #33865.

Test plan

  • New tests/test_state_db_malformed_repair.py (7 tests): documents that every statement fails on this corruption; repair preserves sessions + messages; rebuilt index search works; SessionDB auto-heals on open; auto-heal attempted only once per process; clean-DB repair is a no-op.
  • Regression: tests/hermes_state/, tests/test_hermes_state*.py, tests/hermes_cli/test_doctor*.py, tests/hermes_cli/test_sessions_delete.py, tests/test_lazy_session_regressions.py — all green (400+ tests).
  • Manual E2E against a reproduced malformed DB:
    • hermes sessions repair --check-only → reports malformed
    • hermes sessions repair → backs up, strategy: dedup_schema, "1 sessions recovered"
    • hermes sessions list → recovered session listed
    • hermes doctor --fix → "Repaired state.db schema (N sessions recovered)"

…s reappear

The corruption class behind "Desktop/Dashboard show no sessions while
hundreds of session files sit on disk" is a malformed sqlite_master — most
often a duplicate object row, e.g. two CREATE VIRTUAL TABLE messages_fts
entries — surfacing as:

    sqlite3.DatabaseError: malformed database schema (messages_fts) -
    table messages_fts already exists

SQLite parses the whole schema while preparing the FIRST statement on a
connection, so on this class every statement fails before it runs: PRAGMA
journal_mode (which is where SessionDB.__init__ actually trips, in
apply_wal_with_fallback, BEFORE _init_schema), PRAGMA integrity_check, and
even DROP TABLE. The only operations that still work are
PRAGMA writable_schema=ON plus direct sqlite_master surgery. A plain
FTS-index rebuild at the _init_schema layer therefore cannot reach or fix
this; the canonical sessions/messages rows are intact — only the derived
schema is broken.

Add a dedicated recovery that operates where the failure actually happens:

- hermes_state.repair_state_db_schema(): backs up the raw file first, then a
  least-destructive ladder — (1) de-duplicate sqlite_master keeping the
  lowest rowid per object (preserves the existing FTS index), escalating to
  (2) drop every messages_fts* schema object + VACUUM and let the next open
  rebuild the FTS index from messages. sessions/messages are never modified.
  Plus is_malformed_db_error() to discriminate this class.
- SessionDB.__init__ auto-heals: on a malformed-schema open error it repairs
  once (process-guarded against loops / concurrent web_server opens) and
  reopens, so Desktop/Dashboard recover on their own instead of silently
  showing "no sessions".
- hermes doctor --fix detects the malformed class and repairs it (reporting
  the recovered session count + backup name).
- hermes sessions repair [--check-only] [--no-backup] runs on the raw file
  path, since SessionDB() itself cannot open a malformed DB.

Supersedes #32589 and #33869: both targeted FTS corruption but gated their
repair behind statements (integrity_check / SELECT / DROP TABLE) that
themselves fail on this class, and neither addressed the apply_wal_with_fallback
open-time failure. Credit preserved via Co-authored-by.

Closes #33865.

Co-authored-by: João Vitor Cunha <145560011+plcunha@users.noreply.github.com>
Co-authored-by: Tuna Dev <273476039+tuancookiez-hub@users.noreply.github.com>
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

🔎 Lint report: bb/state-db-fts-repair vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 10620 on HEAD, 10614 on base (🆕 +6)

🆕 New issues (23):

Rule Count
unresolved-attribute 21
invalid-argument-type 1
unresolved-import 1
First entries
tests/gateway/test_telegram_topic_mode.py:1333: [unresolved-attribute] unresolved-attribute: Attribute `execute` is not defined on `None` in union `None | Connection`
tests/test_hermes_state.py:3584: [unresolved-attribute] unresolved-attribute: Attribute `execute` is not defined on `None` in union `None | Connection`
hermes_state.py:1169: [unresolved-attribute] unresolved-attribute: Attribute `commit` is not defined on `None` in union `None | Connection`
tests/hermes_cli/test_web_server.py:662: [unresolved-attribute] unresolved-attribute: Attribute `commit` is not defined on `None` in union `None | Connection`
hermes_cli/web_server.py:8254: [unresolved-attribute] unresolved-attribute: Attribute `execute` is not defined on `None` in union `None | Connection`
tests/gateway/test_telegram_topic_mode.py:1100: [unresolved-attribute] unresolved-attribute: Attribute `commit` is not defined on `None` in union `None | Connection`
hermes_state.py:821: [invalid-argument-type] invalid-argument-type: Argument is incorrect: Expected `Connection`, found `None | Connection`
hermes_cli/main.py:11272: [unresolved-attribute] unresolved-attribute: Attribute `execute` is not defined on `None` in union `None | Connection`
hermes_state.py:825: [unresolved-attribute] unresolved-attribute: Attribute `rollback` is not defined on `None` in union `None | Connection`
hermes_state.py:998: [unresolved-attribute] unresolved-attribute: Attribute `cursor` is not defined on `None` in union `None | Connection`
tests/hermes_state/test_resolve_resume_session_id.py:30: [unresolved-attribute] unresolved-attribute: Attribute `execute` is not defined on `None` in union `None | Connection`
tests/hermes_state/test_resolve_resume_session_id.py:34: [unresolved-attribute] unresolved-attribute: Attribute `commit` is not defined on `None` in union `None | Connection`
tests/test_state_db_malformed_repair.py:19: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tests/hermes_cli/test_resolve_last_session.py:148: [unresolved-attribute] unresolved-attribute: Attribute `execute` is not defined on `None` in union `None | Connection`
hermes_cli/cli_agent_setup_mixin.py:504: [unresolved-attribute] unresolved-attribute: Attribute `execute` is not defined on `None` in union `None | Connection`
tests/tools/test_session_search.py:459: [unresolved-attribute] unresolved-attribute: Attribute `execute` is not defined on `None` in union `None | Connection`
tests/agent/test_compression_concurrent_fork.py:85: [unresolved-attribute] unresolved-attribute: Attribute `execute` is not defined on `None` in union `None | Connection`
tests/hermes_cli/test_resolve_last_session.py:152: [unresolved-attribute] unresolved-attribute: Attribute `commit` is not defined on `None` in union `None | Connection`
hermes_state.py:4562: [unresolved-attribute] unresolved-attribute: Attribute `execute` is not defined on `None` in union `None | Connection`
tests/tools/test_session_search.py:509: [unresolved-attribute] unresolved-attribute: Attribute `commit` is not defined on `None` in union `None | Connection`
hermes_cli/cli_agent_setup_mixin.py:509: [unresolved-attribute] unresolved-attribute: Attribute `commit` is not defined on `None` in union `None | Connection`
tests/hermes_cli/test_web_server.py:659: [unresolved-attribute] unresolved-attribute: Attribute `execute` is not defined on `None` in union `None | Connection`
tests/test_state_db_malformed_repair.py:117: [unresolved-attribute] unresolved-attribute: Attribute `execute` is not defined on `None` in union `None | Connection`

✅ Fixed issues (20):

Rule Count
unresolved-attribute 19
invalid-argument-type 1
First entries
tests/hermes_state/test_resolve_resume_session_id.py:34: [unresolved-attribute] unresolved-attribute: Attribute `commit` is not defined on `None` in union `Connection | None`
hermes_state.py:591: [unresolved-attribute] unresolved-attribute: Attribute `rollback` is not defined on `None` in union `Connection | None`
tests/hermes_cli/test_resolve_last_session.py:148: [unresolved-attribute] unresolved-attribute: Attribute `execute` is not defined on `None` in union `Connection | None`
hermes_state.py:587: [invalid-argument-type] invalid-argument-type: Argument is incorrect: Expected `Connection`, found `Connection | None`
hermes_cli/cli_agent_setup_mixin.py:504: [unresolved-attribute] unresolved-attribute: Attribute `execute` is not defined on `None` in union `Connection | None`
tests/tools/test_session_search.py:459: [unresolved-attribute] unresolved-attribute: Attribute `execute` is not defined on `None` in union `Connection | None`
tests/agent/test_compression_concurrent_fork.py:85: [unresolved-attribute] unresolved-attribute: Attribute `execute` is not defined on `None` in union `Connection | None`
hermes_state.py:4328: [unresolved-attribute] unresolved-attribute: Attribute `execute` is not defined on `None` in union `Connection | None`
tests/hermes_cli/test_web_server.py:659: [unresolved-attribute] unresolved-attribute: Attribute `execute` is not defined on `None` in union `Connection | None`
tests/hermes_cli/test_resolve_last_session.py:152: [unresolved-attribute] unresolved-attribute: Attribute `commit` is not defined on `None` in union `Connection | None`
tests/gateway/test_telegram_topic_mode.py:1333: [unresolved-attribute] unresolved-attribute: Attribute `execute` is not defined on `None` in union `Connection | None`
tests/test_hermes_state.py:3584: [unresolved-attribute] unresolved-attribute: Attribute `execute` is not defined on `None` in union `Connection | None`
tests/tools/test_session_search.py:509: [unresolved-attribute] unresolved-attribute: Attribute `commit` is not defined on `None` in union `Connection | None`
hermes_cli/web_server.py:8254: [unresolved-attribute] unresolved-attribute: Attribute `execute` is not defined on `None` in union `Connection | None`
hermes_cli/cli_agent_setup_mixin.py:509: [unresolved-attribute] unresolved-attribute: Attribute `commit` is not defined on `None` in union `Connection | None`
hermes_state.py:935: [unresolved-attribute] unresolved-attribute: Attribute `commit` is not defined on `None` in union `Connection | None`
tests/hermes_cli/test_web_server.py:662: [unresolved-attribute] unresolved-attribute: Attribute `commit` is not defined on `None` in union `Connection | None`
tests/hermes_state/test_resolve_resume_session_id.py:30: [unresolved-attribute] unresolved-attribute: Attribute `execute` is not defined on `None` in union `Connection | None`
tests/gateway/test_telegram_topic_mode.py:1100: [unresolved-attribute] unresolved-attribute: Attribute `commit` is not defined on `None` in union `Connection | None`
hermes_state.py:764: [unresolved-attribute] unresolved-attribute: Attribute `cursor` is not defined on `None` in union `Connection | None`

Unchanged: 5541 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

@OutThisLife OutThisLife merged commit 218452b into main Jun 9, 2026
23 checks passed
@OutThisLife OutThisLife deleted the bb/state-db-fts-repair branch June 9, 2026 23:49
itskaism pushed a commit to itskaism/hermes-agent that referenced this pull request Jun 10, 2026
…s reappear (NousResearch#43149)

* fix(state.db): recover from malformed sqlite_master so hidden sessions reappear

The corruption class behind "Desktop/Dashboard show no sessions while
hundreds of session files sit on disk" is a malformed sqlite_master — most
often a duplicate object row, e.g. two CREATE VIRTUAL TABLE messages_fts
entries — surfacing as:

    sqlite3.DatabaseError: malformed database schema (messages_fts) -
    table messages_fts already exists

SQLite parses the whole schema while preparing the FIRST statement on a
connection, so on this class every statement fails before it runs: PRAGMA
journal_mode (which is where SessionDB.__init__ actually trips, in
apply_wal_with_fallback, BEFORE _init_schema), PRAGMA integrity_check, and
even DROP TABLE. The only operations that still work are
PRAGMA writable_schema=ON plus direct sqlite_master surgery. A plain
FTS-index rebuild at the _init_schema layer therefore cannot reach or fix
this; the canonical sessions/messages rows are intact — only the derived
schema is broken.

Add a dedicated recovery that operates where the failure actually happens:

- hermes_state.repair_state_db_schema(): backs up the raw file first, then a
  least-destructive ladder — (1) de-duplicate sqlite_master keeping the
  lowest rowid per object (preserves the existing FTS index), escalating to
  (2) drop every messages_fts* schema object + VACUUM and let the next open
  rebuild the FTS index from messages. sessions/messages are never modified.
  Plus is_malformed_db_error() to discriminate this class.
- SessionDB.__init__ auto-heals: on a malformed-schema open error it repairs
  once (process-guarded against loops / concurrent web_server opens) and
  reopens, so Desktop/Dashboard recover on their own instead of silently
  showing "no sessions".
- hermes doctor --fix detects the malformed class and repairs it (reporting
  the recovered session count + backup name).
- hermes sessions repair [--check-only] [--no-backup] runs on the raw file
  path, since SessionDB() itself cannot open a malformed DB.

Supersedes NousResearch#32589 and NousResearch#33869: both targeted FTS corruption but gated their
repair behind statements (integrity_check / SELECT / DROP TABLE) that
themselves fail on this class, and neither addressed the apply_wal_with_fallback
open-time failure. Credit preserved via Co-authored-by.

Closes NousResearch#33865.

Co-authored-by: João Vitor Cunha <145560011+plcunha@users.noreply.github.com>
Co-authored-by: Tuna Dev <273476039+tuancookiez-hub@users.noreply.github.com>

* test(state.db): cover strat-B escalation + unrepairable safe-fail paths

---------

Co-authored-by: João Vitor Cunha <145560011+plcunha@users.noreply.github.com>
Co-authored-by: Tuna Dev <273476039+tuancookiez-hub@users.noreply.github.com>
(cherry picked from commit 218452b)
wachoo pushed a commit to wachoo/hermes-agent that referenced this pull request Jun 10, 2026
…s reappear (NousResearch#43149)

* fix(state.db): recover from malformed sqlite_master so hidden sessions reappear

The corruption class behind "Desktop/Dashboard show no sessions while
hundreds of session files sit on disk" is a malformed sqlite_master — most
often a duplicate object row, e.g. two CREATE VIRTUAL TABLE messages_fts
entries — surfacing as:

    sqlite3.DatabaseError: malformed database schema (messages_fts) -
    table messages_fts already exists

SQLite parses the whole schema while preparing the FIRST statement on a
connection, so on this class every statement fails before it runs: PRAGMA
journal_mode (which is where SessionDB.__init__ actually trips, in
apply_wal_with_fallback, BEFORE _init_schema), PRAGMA integrity_check, and
even DROP TABLE. The only operations that still work are
PRAGMA writable_schema=ON plus direct sqlite_master surgery. A plain
FTS-index rebuild at the _init_schema layer therefore cannot reach or fix
this; the canonical sessions/messages rows are intact — only the derived
schema is broken.

Add a dedicated recovery that operates where the failure actually happens:

- hermes_state.repair_state_db_schema(): backs up the raw file first, then a
  least-destructive ladder — (1) de-duplicate sqlite_master keeping the
  lowest rowid per object (preserves the existing FTS index), escalating to
  (2) drop every messages_fts* schema object + VACUUM and let the next open
  rebuild the FTS index from messages. sessions/messages are never modified.
  Plus is_malformed_db_error() to discriminate this class.
- SessionDB.__init__ auto-heals: on a malformed-schema open error it repairs
  once (process-guarded against loops / concurrent web_server opens) and
  reopens, so Desktop/Dashboard recover on their own instead of silently
  showing "no sessions".
- hermes doctor --fix detects the malformed class and repairs it (reporting
  the recovered session count + backup name).
- hermes sessions repair [--check-only] [--no-backup] runs on the raw file
  path, since SessionDB() itself cannot open a malformed DB.

Supersedes NousResearch#32589 and NousResearch#33869: both targeted FTS corruption but gated their
repair behind statements (integrity_check / SELECT / DROP TABLE) that
themselves fail on this class, and neither addressed the apply_wal_with_fallback
open-time failure. Credit preserved via Co-authored-by.

Closes NousResearch#33865.

Co-authored-by: João Vitor Cunha <145560011+plcunha@users.noreply.github.com>
Co-authored-by: Tuna Dev <273476039+tuancookiez-hub@users.noreply.github.com>

* test(state.db): cover strat-B escalation + unrepairable safe-fail paths

---------

Co-authored-by: João Vitor Cunha <145560011+plcunha@users.noreply.github.com>
Co-authored-by: Tuna Dev <273476039+tuancookiez-hub@users.noreply.github.com>
changman pushed a commit to changman/hermes-agent that referenced this pull request Jun 10, 2026
…s reappear (NousResearch#43149)

* fix(state.db): recover from malformed sqlite_master so hidden sessions reappear

The corruption class behind "Desktop/Dashboard show no sessions while
hundreds of session files sit on disk" is a malformed sqlite_master — most
often a duplicate object row, e.g. two CREATE VIRTUAL TABLE messages_fts
entries — surfacing as:

    sqlite3.DatabaseError: malformed database schema (messages_fts) -
    table messages_fts already exists

SQLite parses the whole schema while preparing the FIRST statement on a
connection, so on this class every statement fails before it runs: PRAGMA
journal_mode (which is where SessionDB.__init__ actually trips, in
apply_wal_with_fallback, BEFORE _init_schema), PRAGMA integrity_check, and
even DROP TABLE. The only operations that still work are
PRAGMA writable_schema=ON plus direct sqlite_master surgery. A plain
FTS-index rebuild at the _init_schema layer therefore cannot reach or fix
this; the canonical sessions/messages rows are intact — only the derived
schema is broken.

Add a dedicated recovery that operates where the failure actually happens:

- hermes_state.repair_state_db_schema(): backs up the raw file first, then a
  least-destructive ladder — (1) de-duplicate sqlite_master keeping the
  lowest rowid per object (preserves the existing FTS index), escalating to
  (2) drop every messages_fts* schema object + VACUUM and let the next open
  rebuild the FTS index from messages. sessions/messages are never modified.
  Plus is_malformed_db_error() to discriminate this class.
- SessionDB.__init__ auto-heals: on a malformed-schema open error it repairs
  once (process-guarded against loops / concurrent web_server opens) and
  reopens, so Desktop/Dashboard recover on their own instead of silently
  showing "no sessions".
- hermes doctor --fix detects the malformed class and repairs it (reporting
  the recovered session count + backup name).
- hermes sessions repair [--check-only] [--no-backup] runs on the raw file
  path, since SessionDB() itself cannot open a malformed DB.

Supersedes NousResearch#32589 and NousResearch#33869: both targeted FTS corruption but gated their
repair behind statements (integrity_check / SELECT / DROP TABLE) that
themselves fail on this class, and neither addressed the apply_wal_with_fallback
open-time failure. Credit preserved via Co-authored-by.

Closes NousResearch#33865.

Co-authored-by: João Vitor Cunha <145560011+plcunha@users.noreply.github.com>
Co-authored-by: Tuna Dev <273476039+tuancookiez-hub@users.noreply.github.com>

* test(state.db): cover strat-B escalation + unrepairable safe-fail paths

---------

Co-authored-by: João Vitor Cunha <145560011+plcunha@users.noreply.github.com>
Co-authored-by: Tuna Dev <273476039+tuancookiez-hub@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

state.db FTS corruption goes undetected — no integrity check, no repair path

1 participant