Skip to content

Refactor runtime state into SQLite#78595

Open
steipete wants to merge 597 commits intomainfrom
exp-vfs
Open

Refactor runtime state into SQLite#78595
steipete wants to merge 597 commits intomainfrom
exp-vfs

Conversation

@steipete
Copy link
Copy Markdown
Contributor

@steipete steipete commented May 6, 2026

Summary

This PR is the database-first runtime/state refactor for #78096. It moves OpenClaw away from scattered JSON, JSONL, sidecar SQLite, lock-file, pruning, and truncation-repair paths toward a typed SQLite storage model with one shared control-plane database and one data-plane database per agent.

Current rule of the road:

  • openclaw.json, plugin manifests, Git checkouts, explicit credential source files, and real user workspaces remain file-backed configuration or user content.
  • Runtime data, caches, ledgers, auth profiles, session rows, transcript events, plugin state, scheduler state, task state, and agent scratch/artifact state live in SQLite.
  • Legacy files are migration inputs, explicit debug/export output, or real workspace files. They are not compatibility stores that normal runtime code keeps reading and writing.
  • Runtime code must not write session JSON, transcript JSONL, auth-profile JSON, plugin-cache JSON, or file-backed ACPX/session stores anymore.

Refs #78096

Current State

Updated May 10, 2026.

The PR branch has the main architecture in place:

  • global state DB for gateway/control-plane data
  • per-agent DB for session/transcript/VFS/artifact data
  • generated Kysely types from SQL schemas
  • native node:sqlite runtime access
  • doctor/migrate as the legacy-file import boundary
  • SQLite session rows and transcript events
  • SQLite auth profiles and model catalog/cache state
  • SQLite plugin state/blob stores
  • SQLite ACPX session state
  • SQLite VFS, tool artifact, and run artifact stores
  • worker-prepared run plumbing with serializable storage boundaries

The active cleanup pass is now focused on deleting/refactoring stale file-era assumptions from tests and runtime seams. Recent cleanup removed or rewired tests that still pretended active sessions/transcripts/auth lived in JSON/JSONL files, and tightened docs so transcript JSONL is migration-only input, never a runtime locator or bridge.

This is not merge-ready until the latest cleanup is pushed and the current Crabbox/Testbox validation is green.

Reviewer Mental Model

Review this as a storage-boundary refactor, not as a single session bug fix.

  • Global database = control plane. ~/.openclaw/state/openclaw.sqlite owns gateway-wide registries, plugin state/blob rows, cron/task/flow state, queues, sandbox/browser/device/pairing data, migration bookkeeping, auth-profile KV rows, and shared caches.
  • Per-agent database = data plane. ~/.openclaw/agents/<agentId>/agent/openclaw-agent.sqlite owns sessions, transcripts, ACP/subagent rows, VFS rows, tool/run artifacts, and agent-local cache data for that one workspace.
  • Configuration remains files. openclaw.json, plugin manifests, Git checkouts, and real workspace files stay file-backed by design. Credential source files remain only where the user explicitly configured a file-backed secret source; auth profile runtime state itself is SQLite-backed.
  • Doctor/migrate is the compatibility boundary. Legacy JSON/JSONL/sidecar SQLite files are imported once, verified, recorded, and removed after successful migration. Runtime code should not silently import, repair, prune, or rewrite legacy session/cache/auth files while handling normal gateway traffic.
  • Kysely is the typed query layer. SQL files are the schema source of truth, generated Kysely types come from a real temporary SQLite database, and runtime uses OpenClaw's native node:sqlite dialect rather than a second SQLite runtime driver.
  • Worker/VFS readiness is part of the shape. Agent workers get serializable storage boundaries and open their own global/per-agent DB handles; gateway-owned streaming, cancellation, and delivery stay in the parent process.

Database Layout

Global State Database

~/.openclaw/state/openclaw.sqlite

Owns data that must be visible across the gateway and agents:

  • agent registry and per-agent database discovery
  • auth-profile secrets/state KV rows and SQLite-backed refresh coordination
  • plugin runtime state, plugin blobs, setup/migration bookkeeping, and installed-plugin indexes
  • device, node, pairing, sandbox, browser, and delivery queue state
  • cron job definitions, cron runtime state, and cron run history
  • task and Task Flow registries
  • commitments and small scoped key-value state
  • migration/backup metadata and shared cache rows

The global DB is intentionally not where large agent-local transcripts, VFS rows, or artifacts accumulate.

Per-Agent Database

~/.openclaw/agents/<agentId>/agent/openclaw-agent.sqlite

Owns state that belongs to one agent/workspace:

  • session entries and route metadata
  • transcript events, snapshots, and checkpoint metadata
  • ACP/subagent run metadata
  • SQLite-backed VFS rows for agent scratch/workspace data
  • tool artifacts and run artifacts
  • agent-local runtime/cache rows that can later become searchable or indexable

The global database registers where each agent database lives. The agent database owns the write-heavy per-agent lane, so transcripts and artifacts do not become a global gateway bottleneck.

Schema And Access

The database schemas are SQL-first and generated into TypeScript from real temporary SQLite databases:

  • src/state/openclaw-state-schema.sql -> src/state/openclaw-state-db.generated.d.ts and src/state/openclaw-state-schema.generated.ts
  • src/state/openclaw-agent-schema.sql -> src/state/openclaw-agent-db.generated.d.ts and src/state/openclaw-agent-schema.generated.ts
  • owner-local schemas, such as proxy capture, keep their own SQL and generated files near the owning store

Runtime code uses Kysely over OpenClaw's native node:sqlite dialect. better-sqlite3 is used only by kysely-codegen for dev-time introspection; it is not a runtime driver. Runtime stores use typed Kysely queries, transactions, unique keys, conflict handling, and explicit row patches instead of whole-file mutation.

Refactor Scope

This branch moves or removes the file-era runtime paths across the codebase:

  • Session indexes move from sessions.json to per-agent session_entries rows.
  • Transcript runtime reads/writes move from JSONL tail/read/append paths to SQLite transcript tables.
  • Runtime session identity becomes { agentId, sessionKey }, not a legacy storePath or transcript locator.
  • Auth profiles move from auth-profiles.json / auth-state.json into SQLite-backed state, with doctor-owned import for legacy files.
  • Gateway session history, reset, compaction, route updates, ACP metadata, heartbeat-isolated runs, fast abort, subagent spawning, TUI session state, and auto-reply session updates use SQLite row helpers.
  • Channel/plugin runtime APIs carry session identity and SQLite-backed metadata instead of requiring callers to pass session-store file paths.
  • Plugin SDK surfaces expose database-backed session and plugin-state helpers; legacy path helpers are narrowed to migration/export/debug boundaries.
  • Memory search session indexing now uses transcript terminology end to end; host exports list/build session transcript helpers, targeted sync queues sessionTranscripts, and QMD/builtin indexers no longer expose session-file helper names.
  • ACPX service state uses a SQLite/plugin-state-backed ACP session store instead of the upstream file-backed runtime store.
  • Plugin/runtime ledgers, installed-plugin indexes, task/flow state, cron state, commitments, delivery queues, sandbox/browser registries, pairing/device state, model catalog/cache state, and small plugin caches move into SQLite-backed stores.
  • Extension-owned caches and state have explicit migration hooks where ownership belongs to the plugin, including Discord, Matrix, Microsoft Teams, Telegram, QQBot, Feishu, Nostr, iMessage, and related channel/plugin surfaces.
  • SQLite VFS, tool artifact, and run artifact stores give agents a database-backed workspace/scratch option.
  • Worker-prepared agent runs use serializable storage boundaries so Node workers can open their own database-backed VFS/cache/artifact stores while gateway-owned streaming, cancellation, and reply operations stay in the parent process.

Removed File-Era Machinery

SQLite replaces these old runtime concepts:

  • session file locks and lock doctor lanes
  • whole-store sessions.json rewrite queues
  • runtime JSON/JSONL import fallbacks
  • startup session repair and pruning side effects
  • transcript truncation repair and JSONL backup writes
  • disk-budget cleanup based on transcript file existence
  • transcript locators as a runtime bridge
  • path-shaped session APIs in gateway/channel/plugin hot paths
  • test-only sessions.json row-store shims that let new runtime tests keep pretending session state was a file
  • upstream ACPX file session stores; ACPX runtime session records now go through SQLite-backed plugin state
  • auth-profile file reads/writes in normal runtime paths; legacy auth files are doctor migration inputs
  • ad hoc plugin/channel cache JSON files where the data is runtime state

Migration Model

openclaw doctor --fix and openclaw migrate are the migration boundary.

Migration builds the global and per-agent SQLite databases from legacy inputs, verifies imported rows, records migration runs, and removes old source files only after successful import. Runtime code should not silently import, repair, prune, or rewrite legacy session/cache/auth files while handling normal gateway traffic.

Migration inputs include:

  • legacy sessions.json -> per-agent session rows
  • legacy transcript *.jsonl -> per-agent transcript events/snapshots
  • legacy auth-profiles.json, auth-state.json, auth.json, and OAuth credential files -> SQLite auth-profile state where applicable
  • legacy cron jobs.json, jobs-state.json, and run JSONL files -> shared cron tables
  • legacy task and Task Flow sidecar stores -> shared task/flow tables
  • legacy plugin-state sidecars and plugin JSON caches -> shared plugin state/blob tables or plugin-owned migration hooks
  • legacy sandbox/browser registry JSON -> shared registry tables
  • legacy channel/plugin caches -> plugin-owned SQLite state through setup/doctor migration hooks

Future state changes should add schema migrations and typed stores instead of adding new sidecar files.

Backup, Export, And Vacuum

Backups should be database-first archives:

  • include a compact snapshot of state/openclaw.sqlite
  • include every agents/<agentId>/agent/openclaw-agent.sqlite
  • include configuration, explicit credential source files, plugin manifests, and requested workspace/artifact exports
  • vacuum/compact database snapshots into one archive so restore has one obvious database-first path

Session export remains a support/debug/workspace-export feature, not a second canonical runtime state system.

VFS, Workers, And PI Boundary

Agent workspace state is designed for disk, hybrid, or SQLite-backed VFS operation. The per-agent database owns VFS rows, tool artifacts, run artifacts, and scoped caches so worker-prepared runs can receive a serializable storage boundary.

Worker execution stays experimental behind settings while the storage boundary settles. The intended shape is one worker per active run first; pooling can come later after lifecycle and database contention are proven.

This also continues internalizing PI behind OpenClaw-owned facades. Session state, transcript storage, filesystem behavior, worker preparation, runtime accounting, and provider/runtime contracts are represented through OpenClaw-owned stores and contracts instead of letting PI define the durable layout.

Review Guide

For each former file-backed state owner, ask: where does this data live now?

Acceptable homes are:

  • global SQLite database
  • per-agent SQLite database
  • explicit configuration/secret file
  • real user workspace file
  • migration-only legacy input
  • debug/export-only output
  • temporary scratch selected by the agent filesystem mode

Anything else is probably leftover file-era state and should be deleted, migrated, or renamed until the boundary is clear.

Latest Validation

Current Crabbox/Blacksmith Testbox run after the latest rebase/cleanup:

Recent local/remote validation during the cleanup pass:

  • pnpm check:database-first-legacy-stores
  • pnpm db:kysely:check
  • pnpm lint:kysely
  • pnpm tsgo:core
  • pnpm tsgo:extensions
  • git diff --check
  • pnpm test src/channels/plugins/bundled.shape-guard.test.ts
  • pnpm test extensions/matrix/src/migration-config.test.ts extensions/matrix/src/matrix/sdk/idb-persistence.test.ts
  • focused session/transcript/gateway/auth/OAuth/doctor/model-auth/secrets/Matrix/WhatsApp/reset Vitest shards

Recent Testbox findings were stale assumptions rather than new product design issues: imessage setup metadata missing a legacy migration discovery hint, Matrix tests leaking SQLite plugin state across cases, Matrix IndexedDB snapshot tests using async env-scoped state around SQLite, and channel/gateway fixtures that still asserted pre-SQLite file/session paths. Those are being cleaned up as part of the current pass before this PR is considered ready.

@steipete steipete requested a review from a team as a code owner May 6, 2026 19:12
@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation channel: discord Channel integration: discord channel: matrix Channel integration: matrix channel: mattermost Channel integration: mattermost channel: slack Channel integration: slack channel: telegram Channel integration: telegram channel: whatsapp-web Channel integration: whatsapp-web app: web-ui App: web-ui gateway Gateway runtime cli CLI command changes security Security documentation scripts Repository scripts commands Command implementations agents Agent runtime and tooling channel: feishu Channel integration: feishu extensions: anthropic extensions: openai extensions: minimax extensions: cloudflare-ai-gateway extensions: kimi-coding extensions: kilocode extensions: codex extensions: lmstudio size: XL maintainer Maintainer-authored PR labels May 6, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 6, 2026

Codex review: needs real behavior proof before merge.

Summary
The PR moves runtime/session/auth/plugin state from JSON, JSONL, sidecar files, and related repair paths into global and per-agent SQLite stores across gateway, CLI, macOS, SDK, docs, and tests.

Reproducibility: yes. for the blocking classes by source inspection: legacy transcript import replaces all session rows, and several doctor importers overwrite live SQLite state without merge or recency checks. I did not run a live populated-install upgrade in this read-only review.

Real behavior proof
Needs stronger real behavior proof before merge: Older discussion includes populated-install evidence for parts of the migration, but latest head still lacks sufficient redacted terminal/log/live output or screenshot/video proof of the after-fix real upgrade path; after updating the PR body, ClawSweeper should re-review automatically or a maintainer can comment @clawsweeper re-review.

Next step before merge
Protected labels, security-sensitive migration findings, and missing latest-head real behavior proof make this a maintainer/security sequencing item rather than a narrow automated repair lane.

Security
Needs attention: The diff moves durable auth, pairing, approvals, and plugin state into SQLite while legacy import paths can still overwrite live authorization or runtime state.

Review findings

  • [P1] Preserve Node 22 compatibility or split the runtime bump — package.json:1823
  • [P1] Merge legacy transcript imports instead of replacing rows — src/commands/doctor/state-migrations.ts:304-309
  • [P1] Skip stale exec approvals when SQLite state exists — src/commands/doctor/legacy/exec-approvals.ts:44-50
Review details

Best possible solution:

Keep the database-first storage direction under maintainer/security review, make legacy imports merge-safe and source-preserving, settle the Node support decision, and require latest-head real upgrade proof before merge.

Do we have a high-confidence way to reproduce the issue?

Yes, for the blocking classes by source inspection: legacy transcript import replaces all session rows, and several doctor importers overwrite live SQLite state without merge or recency checks. I did not run a live populated-install upgrade in this read-only review.

Is this the best way to solve the issue?

No, not as-is. The architectural direction is plausible, but the current patch needs merge-safe migration semantics, a deliberate Node support decision, and latest-head real behavior proof before it is the maintainable solution.

Full review comments:

  • [P1] Preserve Node 22 compatibility or split the runtime bump — package.json:1823
    Current main declares Node >=22.16.0, but this PR raises the package engine to Node >=24.0.0. Existing supported installs can fail before they can run the SQLite migration unless maintainers explicitly approve and sequence a release-breaking runtime bump.
    Confidence: 0.88
  • [P1] Merge legacy transcript imports instead of replacing rows — src/commands/doctor/state-migrations.ts:304-309
    Each legacy transcript file is imported through replaceSqliteSessionTranscriptEvents, which deletes all existing rows for that session before inserting that file. Real installs can have multiple legacy files for one session, so earlier files or newer SQLite rows can be lost before the source JSONL is removed.
    Confidence: 0.9
  • [P1] Skip stale exec approvals when SQLite state exists — src/commands/doctor/legacy/exec-approvals.ts:44-50
    This importer writes the legacy approvals snapshot into the live SQLite KV row and removes the file without checking for existing state. A delayed doctor run can roll back newer approval policy already written through runtime paths.
    Confidence: 0.88
  • [P1] Preserve current device auth during legacy import — src/commands/doctor/legacy/device-auth-store.ts:39-41
    The legacy device-auth importer replaces the live SQLite auth snapshot and then removes the source file. If tokens were created, rotated, or revoked after upgrade but before doctor runs, this can restore stale tokens and make the rollback durable.
    Confidence: 0.86
  • [P1] Preserve existing plugin state on migration conflict — src/plugin-sdk/migration-runtime.ts:55-60
    The migration helper resolves plugin-state conflicts by replacing the existing row with the imported value. Because plugin_state_entries is live runtime state, delayed sidecar migration can overwrite newer plugin data instead of preserving or safely merging it.
    Confidence: 0.84
  • [P1] Merge channel pairing imports with live state — src/commands/doctor/legacy/channel-pairing.ts:231-247
    The channel-pairing importer replaces request and allowlist entries from legacy files and then removes the files. If SQLite already contains newer approvals or revocations, delayed doctor migration can re-add revoked senders or drop newly approved ones.
    Confidence: 0.82

Overall correctness: patch is incorrect
Overall confidence: 0.88

Security concerns:

  • [high] Legacy exec approvals can restore stale policy — src/commands/doctor/legacy/exec-approvals.ts:44
    importLegacyExecApprovalsFileToSqlite writes a legacy approvals snapshot over exec.approvals/current without checking for newer SQLite state, so delayed doctor runs can reintroduce stale exec decisions.
    Confidence: 0.88
  • [high] Legacy device auth can restore revoked tokens — src/commands/doctor/legacy/device-auth-store.ts:39
    importLegacyDeviceAuthFileToSqlite replaces the SQLite device-auth snapshot and removes the source file, which can roll token creation, rotation, or revocation state backward after runtime has already updated it.
    Confidence: 0.86
  • [high] Channel pairing import can roll back access state — src/commands/doctor/legacy/channel-pairing.ts:231
    The channel pairing importer replaces request and allowlist arrays from legacy files, so delayed doctor runs can re-add revoked senders or drop newly approved ones from the live SQLite pairing store.
    Confidence: 0.82
  • [medium] Plugin migration conflicts clobber live plugin state — src/plugin-sdk/migration-runtime.ts:55
    The plugin-state migration upsert uses imported values on conflict instead of preserving existing live rows, allowing stale sidecar data to overwrite current plugin state.
    Confidence: 0.84

What I checked:

  • Protected labels: The provided GitHub context shows this open PR carries both security and maintainer labels, so cleanup automation should not close or auto-merge it without explicit maintainer handling. (eb5d4f6b0bc6)
  • Current main supports Node 22: Current main still declares Node >=22.16.0, while the PR raises the package engine to Node >=24.0.0. (package.json:1804, 115049753d59)
  • PR raises runtime engine: At PR head, package.json requires Node >=24.0.0, which is a release/support decision rather than an incidental storage refactor detail. (package.json:1823, eb5d4f6b0bc6)
  • Transcript import replaces rows: importLegacyTranscriptFileToSqlite calls replaceSqliteSessionTranscriptEvents, so multiple legacy files for one session or delayed import after SQLite writes can delete prior rows for that session. (src/commands/doctor/state-migrations.ts:304, eb5d4f6b0bc6)
  • Exec approvals import overwrites live KV: The legacy exec approvals importer writes the file snapshot into exec.approvals/current and removes the file without checking for existing SQLite state. (src/commands/doctor/legacy/exec-approvals.ts:44, eb5d4f6b0bc6)
  • Device auth import overwrites live auth snapshot: The legacy device-auth importer writes the parsed legacy token store into SQLite and then removes the source file, with no merge or recency check. (src/commands/doctor/legacy/device-auth-store.ts:39, eb5d4f6b0bc6)

Likely related people:

  • Peter Steinberger: The available current-main history and blame for sampled session/plugin migration files point to Peter Steinberger, and the provided PR commit list shows the SQLite refactor branch commits are authored by steipete. (role: recent area contributor and PR branch author; confidence: medium; commits: 0636bbb12455, 9444b2ad9b54, eb5d4f6b0bc6; files: src/config/sessions/transcript-append.ts, src/plugin-sdk/migration-runtime.ts, src/commands/doctor/state-migrations.ts)
  • jalehman: The PR discussion includes a detailed hardening prompt and compatibility review for the SQLite runtime boundary, including transaction, backup, migration, and companion transcript reader concerns. (role: reviewer / adjacent storage-hardening contributor; confidence: medium; files: src/state/openclaw-state-db.ts, src/config/sessions/transcript-store.sqlite.ts, src/commands/doctor/state-migrations.ts)
  • 100yenadmin: The PR discussion maps companion-seam and correctness follow-up work to separate OpenClaw PRs and issues, making this person useful for routing the downstream compatibility lane rather than the security blockers. (role: follow-up implementer / routing contributor; confidence: medium; files: src/gateway/session-transcript-readers.ts, src/wizard/setup.migration-import.ts, src/auto-reply/reply/export-html/template.js)

Remaining risk / open question:

  • Protected security and maintainer labels require explicit maintainer handling before merge or close.
  • Delayed migration paths can overwrite newer SQLite authorization, pairing, plugin, and transcript state with stale legacy files.
  • The Node 24 engine bump may be intentional, but it needs explicit release/support sequencing because current main still supports Node 22.
  • The PR is very large, so the sampled blockers may not cover every remaining stale file-era migration path.
  • Latest-head real upgrade proof is still missing or insufficient for a branch that migrates durable runtime state.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 115049753d59.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c536838794

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if (!isSqliteSessionStoreBackendEnabled(env)) {
return null;
}
const agentId = resolveAgentIdFromSessionStorePath(storePath) ?? DEFAULT_AGENT_ID;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Derive SQLite agent scope from configured store path

When session.store uses a non-canonical path (for example a custom template not shaped like .../agents/<id>/sessions/sessions.json), this code silently falls back to DEFAULT_AGENT_ID. After this commit moved runtime session reads/writes to SQLite, that fallback causes all such stores to be loaded/saved under the main SQLite partition, so non-main agent data can be mixed into the wrong store scope and agent-specific queries can return incorrect sessions.

Useful? React with 👍 / 👎.

Comment on lines 765 to +767
const sessionsTargetDir = path.join(stateDir, "agents", targetAgentId, "sessions");
const sessionsTargetStorePath = path.join(sessionsTargetDir, "sessions.json");
const legacySessionEntries = safeReadDir(sessionsLegacyDir);
const hasTargetJsonSessionStore = includeSessions && fileExists(sessionsTargetStorePath);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Detect legacy session JSON using session.store override

Doctor legacy migration detection is hardcoded to $STATE_DIR/agents/<defaultAgent>/sessions/sessions.json, so installs that previously persisted session metadata at a custom session.store path are skipped during openclaw doctor --fix. Because this commit removed runtime JSON session-store fallback, those users can lose visible session history after upgrading until they manually import from the custom path.

Useful? React with 👍 / 👎.

@openclaw-barnacle openclaw-barnacle Bot added the extensions: memory-core Extension: memory-core label May 6, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c567352a42

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +35 to +43
async function withTranscriptAppendQueue<T>(queueKey: string, fn: () => Promise<T>): Promise<T> {
const previous = transcriptAppendQueues.get(queueKey) ?? Promise.resolve();
let releaseCurrent!: () => void;
const current = new Promise<void>((resolve) => {
releaseCurrent = resolve;
});
const tail = previous.catch(() => undefined).then(() => current);
transcriptAppendQueues.set(queueKey, tail);
await previous.catch(() => undefined);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restore cross-process locking for transcript appends

appendSessionTranscriptMessage now serializes writes only through the in-memory transcriptAppendQueues map, so concurrent writers from different Node processes are no longer coordinated. When a gateway process and another process (e.g., CLI/worker) append to the same agentId/sessionId at the same time, both can read the same pre-append event list and then insert independently, which can duplicate idempotent messages and produce conflicting parentId chains in SQLite transcripts.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eb5d4f6b0b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +35 to +39
writePairingStateRecord({
baseDir,
subdir: "devices",
key: "bootstrap",
value: parsed as DeviceBootstrapState,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Skip stale device-bootstrap import when SQLite state exists

This migration always writes the legacy devices/bootstrap.json snapshot into pairing.devices/bootstrap and then removes the source file. Bootstrap tokens are live runtime state (issued/revoked/cleared via persistState in src/infra/device-bootstrap.ts), so a delayed openclaw doctor --fix can overwrite newer token/profile data with stale file-era content and make that rollback durable after deletion.

Useful? React with 👍 / 👎.

Comment on lines +73 to +75
writeSubagentRegistryRunsSnapshot(runs, env);
try {
fs.unlinkSync(pathname);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid overwriting live subagent runs from legacy snapshot

The importer unconditionally writes legacy subagents/runs.json records into SQLite and then unlinks the file. Subagent rows are updated during normal lifecycle persistence (persistSubagentRunsToState/saveSubagentRegistryToState), so if doctor runs later, matching run_id rows can be rolled back to stale metadata (cleanup/outcome/announce fields) and the rollback becomes permanent once the legacy file is deleted.

Useful? React with 👍 / 👎.

if (record && (await importLegacyManagedImageRecord(record, stateDir))) {
records += 1;
}
await fs.rm(recordPath, { force: true }).catch(() => {});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep managed-image record files when a row fails import

Each legacy managed-image record file is removed even when parsing fails or importLegacyManagedImageRecord returns false (for example, malformed JSON or missing source image). That permanently discards skipped record metadata, so operators cannot repair the bad rows and rerun migration to recover those attachments.

Useful? React with 👍 / 👎.

.map((entry) => entry.name);
let imported = 0;
for (const fileName of files) {
const raw = await runsRoot.readText(fileName).catch(() => "");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Do not delete cron log files after a read failure

When reading a legacy cron run-log file fails, the code coerces content to "", imports zero entries, and still deletes the file. In transient I/O/permission error scenarios this silently loses the only recoverable log data, preventing a retry after fixing the underlying read issue.

Useful? React with 👍 / 👎.

@100yenadmin
Copy link
Copy Markdown
Contributor

100yenadmin commented May 10, 2026

@steipete @jalehman One maintainer-readable, agent-runnable map for the companion-fit work around #78595.

Architecture decision

The right long-term shape is:

  • OpenClaw SQLite remains the canonical operational runtime store
  • Lossless Claw / LCM remains a separate downstream companion DB
  • OpenClaw exposes a few small generic read/discovery/projection seams
  • LCM adapts to those seams instead of pushing its own summary/context/GC tables into OpenClaw core

That keeps the boundary clean:

  • OpenClaw owns runtime truth, session continuity policy, and transcript replay semantics
  • LCM owns richer downstream identity, message_parts, summaries, recall, and maintenance behavior

Why this is better for OpenClaw, not just for LCM

  • future first-party memory/search/export/audit features get stable read seams instead of private one-off glue
  • reset/rotation/multi-entity continuity stays core-owned and consistent
  • advanced consumers do not have to scan blobs or reimplement private active-branch logic
  • the database-first runtime stays canonical instead of pulling JSONL/file identity back into core

Current stack map

flowchart LR
    A["#78595 database-first runtime"] --> B["#79971 runtime-truth correctness"]
    A --> C["#79970 selection seam"]
    A --> D["#79972 replay seam"]
    B --> E["#79905 typed projection seam"]
    C --> E
    D --> E
    D --> F["LCM #646 SQLite frontier"]
    C --> G["LCM #642 identity mapping"]
    D --> H["LCM #643 message_parts reconstruction"]
    H --> I["LCM #644 rotate/checkpoint/GC remap"]
Loading

What is already implemented and why it exists

1. #79971 — runtime-truth correctness follow-up

PR: #79971

Why it exists:

  • OpenClaw runtime truth needed to be correct before any companion layering mattered
  • stale JSONL-era assumptions were still leaking into setup, doctor, export, and lightweight transcript readers

What it does:

  • SQLite-aware onboarding freshness
  • SQLite-as-truth doctor behavior
  • canonical session header export
  • safer rotated checkpoint cleanup
  • active-branch-safe recent reader behavior

Direct review follow-up/proof:

2. #79970 — canonical session selection seam

PR: #79970

Why it exists:

  • sessionId selection policy and ambiguity handling should not be reimplemented by every caller
  • OpenClaw should own canonical selection semantics without absorbing LCM’s richer lineage storage model

What it does:

  • exposes SessionIdMatchSelection
  • adds richer run-key selection
  • preserves active run-context semantics

Direct review follow-up/proof:

3. #79972 — canonical replay seam

PR: #79972

Why it exists:

  • file offsets / mtime are not a durable replay contract in a database-first runtime
  • consumers need a typed frontier/delta seam over canonical SQLite transcript ownership

What it does:

  • exposes transcript frontier/delta helpers
  • keeps the seam additive
  • restores compatibility exports on the public subpath
  • keeps raw DB handles off the public seam
  • forces reset on same-millisecond rewrites via a monotonic write floor

Direct review follow-up/proof:

4. lossless-claw#646 — first downstream proof consumer

PR: Martian-Engineering/lossless-claw#646

Why it exists:

  • it proves the architecture is executable, not just theoretical
  • LCM can now bootstrap/resume from a SQLite frontier while keeping JSONL fallback for older hosts

What remains

Remaining upstream OpenClaw seam

  • #79905 — typed transcript projections/helpers plus rebuild contract

Why this is still needed:

  • LCM should not have to duplicate active-branch traversal and raw event archaeology forever
  • OpenClaw already has the right internal raw materials, but the reusable typed seam is still missing

Remaining downstream LCM work

  • #642family + segment + runtime-session
  • #643message_parts and tool/result reconstruction
  • #644 — rotate/checkpoint/GC remap

Confidence read

After the latest implementation + re-verification loop:

  • the live review findings on #79970, #79971, and #79972 are addressed and pushed
  • each PR body now includes direct runtime proof, not only unit tests
  • I’m above 95% confident there is no additional hidden upstream OpenClaw core/schema parity survivor beyond #79905

That is not the same as saying the whole stack is merged or review-clean yet.
It means the remaining work now looks like known work, not hidden architecture debt.

Agent-ready routing

If Peter or another agent is going to keep moving this refactor, this is the clean routing:

Keep separate

Recommended order

  1. Keep the direct correctness/security review lane on #78595 separate.
  2. Carry forward / merge-hardening #79971.
  3. Carry forward / merge-hardening #79970.
  4. Carry forward / merge-hardening #79972.
  5. Implement #79905.
  6. Continue downstream LCM work in order: #642 -> #643 -> #644.

What not to do

  • do not pull JSONL/file-locator identity back into runtime APIs
  • do not absorb LCM summary/context/GC tables into OpenClaw core
  • do not blend the #78595 correctness/security lane into the seam-feature lane

If useful, I can turn this same map into one shorter “send this to Peter” checklist comment next, but this is the full maintainer/agent handoff version.

@100yenadmin
Copy link
Copy Markdown
Contributor

Final implementation map for the companion-fit / Lossless Claw migration stack.

Architecture decision remains:

  • OpenClaw SQLite stays the canonical operational runtime store
  • Lossless Claw stays a downstream companion DB that ingests and projects from OpenClaw's canonical state

Implemented OpenClaw slices:

Implemented LCM slices:

Why some of the later implementation PRs live in the forks:

  • GitHub would allow cross-repo PRs into the upstream repos when the base is main, but the create API rejected fork-only stacked baseRefNames.
  • Publishing these as clean stacked fork PRs preserves the exact diff each agent should work from.
  • That is better than opening inflated upstream PRs that silently re-include predecessor slices.

Recommended execution/review order:

  1. #79971
  2. #79970
  3. #79972
  4. 100yenadmin/openclaw#1
  5. lossless-claw#646
  6. 100yenadmin/lossless-claw#3
  7. 100yenadmin/lossless-claw#1
  8. 100yenadmin/lossless-claw#2

For Peter / agent handoff:

  • use the PR bodies first, because they now contain the why, boundaries, diagrams, validation commands, and stack placement
  • use the upstream umbrella issues and #78595 for routing
  • do not collapse OpenClaw into an LCM-shaped core schema
  • do not treat LCM as the raw transcript owner once SQLite is canonical

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling app: macos App: macos app: web-ui App: web-ui channel: discord Channel integration: discord channel: feishu Channel integration: feishu channel: googlechat Channel integration: googlechat channel: imessage Channel integration: imessage channel: irc channel: line Channel integration: line channel: matrix Channel integration: matrix channel: mattermost Channel integration: mattermost channel: msteams Channel integration: msteams channel: nextcloud-talk Channel integration: nextcloud-talk channel: nostr Channel integration: nostr channel: qa-channel Channel integration: qa-channel channel: qqbot channel: signal Channel integration: signal channel: slack Channel integration: slack channel: synology-chat channel: telegram Channel integration: telegram channel: tlon Channel integration: tlon channel: twitch Channel integration: twitch channel: voice-call Channel integration: voice-call channel: whatsapp-web Channel integration: whatsapp-web channel: zalo Channel integration: zalo channel: zalouser Channel integration: zalouser cli CLI command changes commands Command implementations docker Docker and sandbox tooling docs Improvements or additions to documentation extensions: acpx extensions: anthropic extensions: cloudflare-ai-gateway extensions: codex extensions: device-pair extensions: diagnostics-otel Extension: diagnostics-otel extensions: kilocode extensions: kimi-coding extensions: llm-task Extension: llm-task extensions: lmstudio extensions: memory-core Extension: memory-core extensions: memory-wiki extensions: minimax extensions: openai extensions: phone-control extensions: qa-lab extensions: tts-local-cli gateway Gateway runtime maintainer Maintainer-authored PR plugin: azure-speech Azure Speech plugin plugin: file-transfer plugin: google-meet plugin: migrate-claude plugin: migrate-hermes scripts Repository scripts security Security documentation size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants