Skip to content

fix: gate startup backfills with migration state#380

Merged
jalehman merged 1 commit into
mainfrom
carnap-be042a6b-impl-startup-backfill-gate
Apr 11, 2026
Merged

fix: gate startup backfills with migration state#380
jalehman merged 1 commit into
mainfrom
carnap-be042a6b-impl-startup-backfill-gate

Conversation

@jalehman

Copy link
Copy Markdown
Contributor

What

This PR adds additive startup backfill state to the SQLite migration path so the expensive summary and tool-call backfills stop rerunning after a successful startup, while still retrying cleanly if startup fails before the backfill version is marked complete.

Why

Lossless-claw was rerunning heavy startup backfills on every launch. That added avoidable startup work and still lacked an explicit completion signal for partial-upgrade recovery.

Changes

  • Add lcm_migration_state for versioned backfill completion
  • Gate startup backfills by step name and version
  • Wrap backfills plus state writes in savepoints
  • Add repeat-startup and retry-safety migration tests
  • Add patch changeset for the runtime fix

Testing

  • npm test -- test/migration.test.ts
  • Expect test/migration.test.ts to pass

Add versioned startup backfill state so the expensive summary and tool-call
repairs only run once per algorithm version. Keep retry safety by wrapping each
versioned backfill and its completion marker in a savepoint so a failed startup
rolls back partial backfill writes and reruns cleanly on the next launch.

Regeneration-Prompt: |
  Implement the startup backfill gating change in lossless-claw without using
  PRAGMA user_version or column-existence guesses as the completion signal.
  Add an additive SQLite table keyed by backfill step name and algorithm
  version, and only skip a backfill after that exact version completes.
  Preserve partial-upgrade safety by making the backfill work and state write
  succeed or roll back together, then cover first-run state creation, repeat
  startup skipping, and retry-after-failure behavior in migration tests. This
  runtime change affects package behavior, so include a patch changeset.
@rmarr

rmarr commented Apr 11, 2026

Copy link
Copy Markdown

Confirming this fixes a real-world crash loop documented in #383.

Environment: lossless-claw v0.8.0, OpenClaw 2026.4.9, macOS arm64, ~22K messages / 510 summaries / 176MB database.

Symptoms: Every prompt triggered a 1–2 minute hang followed by database is locked → health check timeout → gateway restart. The unconditional backfillSummaryDepths and backfillSummaryMetadata DAG walks were blocking the event loop past the 10s health check threshold on every startup cycle.

Validation: A simpler version of this fix (early-return guards checking COUNT(*) WHERE earliest_at IS NULL and COUNT(*) WHERE kind = 'condensed' AND depth = 0) immediately resolved the crash loop — chat.send latency dropped from timeout to ~52ms. The lcm_migration_state table approach in this PR is a cleaner long-term solution with proper retry safety.

Looking forward to this shipping so I can drop the local patch. 🦞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants