Skip to content

feat(plugin-sdk): add persistent keyed store helper#70626

Closed
amknight wants to merge 8 commits intomainfrom
ak/persistent-keyed-store-sdk
Closed

feat(plugin-sdk): add persistent keyed store helper#70626
amknight wants to merge 8 commits intomainfrom
ak/persistent-keyed-store-sdk

Conversation

@amknight
Copy link
Copy Markdown
Member

@amknight amknight commented Apr 23, 2026

Summary

  • Problem: core and plugins do not have one shared SDK primitive for small restart-safe keyed registries, so each consumer would need to re-implement file layout, locking, TTL cleanup, and corrupt-file handling.
  • Why it matters: we want to be able to safely restart the gateway and maintain necessary lifecycle state
  • What changed: added openclaw/plugin-sdk/persistent-keyed-store with register / lookup / consume / entries / clear, TTL pruning, maxEntries eviction, deterministic enumeration, corrupt/higher-version quarantine, and same-process + cross-process serialization.
  • What changed: added focused unit coverage for restart survival, TTL/default TTL, consume/clear, deterministic ordering, quarantine behavior, namespace validation, overwrite behavior, and concurrent writes.
  • What did NOT change (scope boundary): no consumer migrations, no lifecycle harness changes, and no inbound dedupe changes.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

User-visible / Behavior Changes

  • Plugins and core can now import openclaw/plugin-sdk/persistent-keyed-store.
  • The new helper persists small keyed registries under the lifecycle state directory with bounded growth, deterministic enumeration, and quarantine-on-corruption behavior.

Diagram (if applicable)

Before:
[consumer] -> [custom per-feature persistence / no persistence]

After:
[consumer] -> [persistent-keyed-store]
           -> [same-process serialized access]
           -> [withFileLock(store.json)]
           -> [stateDir/lifecycle/<namespace>/store.json]

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (No)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: Node 22 + pnpm workspace
  • Model/provider: N/A
  • Integration/channel (if any): N/A
  • Relevant config (redacted): temp plugin-sdk state dirs created by test harness

Human Verification (required)

  • Verified scenarios:
    • pnpm test src/plugin-sdk/persistent-keyed-store.test.ts
    • pnpm tsgo
    • pnpm lint:core
    • pnpm plugin-sdk:check-exports
    • pnpm plugin-sdk:api:check
    • pnpm build
    • pnpm check:architecture
  • Edge cases checked:
    • corrupt JSON quarantine
    • higher-version quarantine
    • invalid namespace rejection
    • same-process concurrent register serialization across two instances
    • TTL/default TTL expiry and maxEntries pruning
  • What you did not verify:
    • no consumer integration was wired in this PR
    • no restart lifecycle scenario harness work yet

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (No)
  • Migration needed? (No)
  • If yes, exact upgrade steps:

Risks and Mitigations

  • Risk: future consumers may need semantics beyond the v1 helper contract.
    • Mitigation: the helper intentionally ships with zero consumer migrations, narrow API surface, and contract tests covering TTL/quarantine/ordering/concurrency semantics first.

@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation scripts Repository scripts size: L maintainer Maintainer-authored PR labels Apr 23, 2026
@amknight amknight force-pushed the ak/persistent-keyed-store-sdk branch from a2ea03c to ce9d5c3 Compare April 24, 2026 01:53
@amknight
Copy link
Copy Markdown
Member Author

amknight commented Apr 24, 2026

Migration note after checking the prior issues/PRs:

I dug into Slack first. This PR looks like the first generic plugin-sdk keyed-store proposal, but several channel PRs independently solve the same durable-keyed-state problem.

Slack is the cleanest first dogfood candidate:

That shape maps well onto this helper: namespace like slack.thread-participation, id ${accountId}:${channelId}:${threadTs}, record { agentId, repliedAt }, with the Slack TTL/cap policy passed through store options. The main migration detail is that today recordSlackThreadParticipation is synchronous, so we would either make that path async or intentionally fire-and-forget the durable write while keeping the existing in-memory cache as the hot path. Also keep the storage migration separate from the policy semantics in #64277 / thread.requireExplicitMention; persistence should not decide whether prior participation counts as a mention.

Other good migration candidates:

  • Discord component registry, Discord: persist component registry to disk across gateway restarts #66241. This independently adds versioned JSON persistence, TTL pruning, caps, validation, atomic writes, and consume-on-resolve behavior. It should map well to two keyed-store namespaces (discord.components and discord.modals) using register, lookup, and consume. The important constraint is to persist only the safe public registry shape and preserve the current peek-vs-consume behavior.
  • MS Teams sent-message cache. extensions/msteams/src/sent-message-cache.ts keeps a 24h process-global Map under Symbol.for("openclaw.msteamsSentMessages"); extensions/msteams/src/monitor-handler/message-handler.ts records bot-sent message IDs and later treats replies to those IDs as reply_to_bot. Restart drops that state, unlike Telegram's analogous sent-message cache in extensions/telegram/src/sent-message-cache.ts, which is already file-backed. This should map cleanly to a keyed-store namespace like msteams.sent-messages keyed by ${conversationId}:${messageId} with the same TTL.
  • Matrix approval reaction targets. extensions/matrix/src/approval-reactions.ts stores roomId:eventId -> { approvalId, allowedDecisions } in an in-memory Map; extensions/matrix/src/approval-handler.runtime.ts registers the target and extensions/matrix/src/matrix/monitor/reaction-events.ts resolves reactions against it. A gateway restart leaves the Matrix approval prompt visible, but reactions can no longer resolve. This is close to the same class as Discord buttons, though the stored record should stay narrow and only cover the reaction target metadata needed to resolve a pending approval.
  • BlueBubbles inbound dedupe, fix(bluebubbles): dedupe inbound webhooks across restarts (#19176, #12053) #66816 and BlueBubbles/catchup: per-message retry cap for wedged messages (#66870) #67426. This is strong evidence for the shared primitive: BlueBubbles/catchup: per-message retry cap for wedged messages (#66870) #67426 fixed the same class of in-process read/modify/write lost-update race that this store now handles with a canonicalized same-process queue. I would not directly rewrite BlueBubbles onto PersistentKeyedStore yet, because its current persistent-dedupe API has claim/commit/release and same-process memory semantics that a plain lookup/register pair would not preserve atomically. Better follow-up: share this store's path normalization, file-lock, and atomic JSON hardening, or rebuild persistent-dedupe on top of a keyed-store-level atomic claim/check operation if we add one.
  • Feishu persistent dedupe, fix(feishu): persist dedup cache across gateway restarts via warmup #31605. Similar to BlueBubbles: restart-safe dedupe plus warmup is the same durable-keyed-state need, but direct migration should wait until we can preserve warmup and memory-fast-path behavior without weakening dedupe atomicity.

Suggested order after #70626 lands: migrate Slack thread participation first, then Discord component registry, then MS Teams sent-message cache / Matrix approval reaction targets, then decide whether persistent-dedupe should share this helper internally rather than migrating BlueBubbles/Feishu call sites directly.

@amknight amknight force-pushed the ak/persistent-keyed-store-sdk branch from d10e286 to 6b62abf Compare April 24, 2026 02:55
@amknight amknight force-pushed the ak/persistent-keyed-store-sdk branch 3 times, most recently from 5a120a3 to ebe44e6 Compare April 27, 2026 11:23
zeroaltitude added a commit to zeroaltitude/openclaw that referenced this pull request Apr 27, 2026
Three concrete fixes for clawsweeper bot's review on PR openclaw#33845:

1. **Stop leaking test-only helpers through the public API barrel.**
   extensions/slack/api.ts previously did 'export * from
   ./src/sent-thread-cache.js', which re-exported the underscored
   test-only helpers (_flushPersist, _resetForTests) as part of the
   slack extension's public API. Same-process consumers could call
   _resetForTests('/some/path') to redirect the persist path and
   trigger arbitrary writes/renames/unlinks. Replace the wildcard
   with an explicit list of public exports
   (recordSlackThreadParticipation, hasSlackThreadParticipation,
   clearSlackThreadParticipationCache).

2. **Reinsert on update so Map insertion order reflects recency.**
   recordSlackThreadParticipation called Map.set on existing keys,
   which updates the value but keeps the key in its original
   insertion-order position. The serialise-time MAX_ENTRIES cap sorts
   by timestamp so this didn't affect on-disk output, but any future
   iteration-order-based eviction would silently use the wrong
   recency. Switch to delete-then-set so insertion order matches
   actual recency.

3. **Cap in-memory map size on insert, not just on serialise.**
   The pre-existing cap was applied only at serialisation time so
   busy workspaces accumulating thousands of unique threads between
   debounced persists could grow the in-memory Map unboundedly.
   Add insertion-time eviction: once persistState.entries.size
   exceeds MAX_ENTRIES, evict the oldest entry by Map insertion
   order (which now correctly reflects recency thanks to fix #2).

New tests:
- 'reinserts on update so Map insertion order reflects recency':
  records A, then B, then re-records A. Verifies on-disk insertion
  order is [B, A] (re-recorded A moves to most-recent position).
- 'caps in-memory map size on insert (not just on serialise)':
  records 5005 unique threads, verifies the persisted file has
  exactly 5000 entries with the oldest 5 evicted.

Tests: 17/17 pass (was 15; +2 new).

Not addressed in this commit (acknowledged in the PR-comment
follow-up): clawsweeper also recommended migrating from the
hand-rolled persistence to src/plugin-sdk/persistent-dedupe.ts
(or a future persistent-keyed-store from PR openclaw#70626) to inherit
file locking, atomic writes, bounded/quarantined parsing, and
the shared storage contract. That refactor is non-trivial because
the SDK helper is async while the current public API is sync;
maintainer guidance is needed on whether to make the API async
or wrap with sync semantics. Tracked as a follow-up.
zeroaltitude added a commit to zeroaltitude/openclaw that referenced this pull request Apr 27, 2026
opus-4-7 independent review found three issues missed by earlier
review rounds:

P1 — _persistPathOverride was still module-local
  Despite the Codex thread reply claiming the override moved into the
  shared persistState global, it was still a module-local 'let'. Tests
  passed by accident: in-memory dedupe shared via persistState.entries,
  but disk side effects from re-imported module copies leaked into
  STATE_DIR instead of the per-test temp dir. Moved into
  PersistState.persistPathOverride on the global singleton.

P1 — CWE-59 symlink-follow on tmpPath (Aisle medium #2, unaddressed)
  tmpPath was ${filePath}.${process.pid}.tmp + default 'w' flag. A
  predictable name combined with a truncating O_CREAT would let an
  attacker who can write to the state dir pre-create a symlink at the
  tmp path pointing to e.g. /etc/passwd, and our write would follow it.
  Now uses randomUUID() for the name AND flag 'wx' (O_CREAT | O_EXCL)
  so any existing entry — file or symlink — fails the write safely.

P2 — hydration ignored MAX_ENTRIES (Aisle medium #1)
  The 4 MB file-size cap permits ~250K entries through, blowing past
  the 5K in-memory MAX_ENTRIES on first load. Now collects valid
  candidates first, sorts DESC by ts, and slices to MAX_ENTRIES before
  inserting into the dedupe cache. Bounds memory regardless of file
  size up to MAX_PERSIST_FILE_BYTES.

Tests: 26/26 pass (3 new tests covering each fix).

Architectural note (per opus review): Land standalone. Don't migrate
to plugin-sdk/persistent-dedupe — its hasRecent reads JSON on every
miss, adding ~2-10ms to every Slack message in a thread the bot
hasn't replied to (the dominant path for hasSlackThreadParticipation
on the message-dispatch hot path). Warm-cache + debounced-write is
correct here. Don't block on openclaw#70626.
zeroaltitude added a commit to zeroaltitude/openclaw that referenced this pull request Apr 27, 2026
gpt-5.5 independent review (running in parallel, no shared context with
opus) confirmed both opus P1s and opus P2-A. It also surfaced three
additional findings worth landing in this PR:

P2-B — Duplicated persist logic between schedulePersist callback and
  _flushPersist. Maintenance trap: any future hardening had to be
  mirrored in two places. Extracted into a shared persistToDisk()
  function called from both. Single source of truth for prune/cap/
  serialize/atomic-write.

P2-C / P3-C — Future timestamps from a crafted persist file (or a
  backwards system-clock jump) survive forever. The previous guard
  was 'now - ts < TTL_MS' which trivially holds for ts > now.
  Added 'ts > 0 && ts <= now' to the hydration filter so:
   * future-near (now + 60s)        -> rejected
   * future-far (MAX_SAFE_INTEGER)  -> rejected
   * zero / negative                -> rejected
   * monotonic-clock-skew tolerance -> small backwards jumps still
     reject the affected entries on next hydrate; corrected on next
     write cycle. Trade-off vs. monotonic-clock complexity is
     intentional.

P2-D — mkdirSync called with no mode flag -> directory inherits umask
  (often 0o755 — group/other readable). Now passes mode 0o700 to
  match the SDK helper and complement the 0o600 file mode. Multi-tenant
  hosts can no longer enumerate or symlink-squat the persist directory.

P3-B — Iteration-order inversion after over-capacity slice. When >
  MAX_ENTRIES candidates were sorted DESC + sliced, the resulting
  Map insertion order was ts-DESC, but the eviction loop in
  recordSlackThreadParticipation assumes Map insertion order = ts-ASC
  (oldest first). Without this fix, the first post-hydration eviction
  drops the NEWEST entry instead of the oldest. Added an ASC re-sort
  on both hydration and serialize paths so insertion order always
  matches recency.

Tests: 29/29 pass (4 new tests covering each finding).

P3-A (TOCTOU stat->read race) deferred to follow-up: low-impact given
the 4 MB file cap already bounds the read, and a fix requires
fs.openSync + fstatSync + readSync changes that don't align with the
current stat-then-read structure. Worth its own PR.

Per gpt-5.5 architectural call: keep this PR standalone; converge
with plugin-sdk/persistent-keyed-store (openclaw#70626) as a follow-up after
this lands. The sync->async migration is straightforward (3 call sites,
all in async contexts) but should be coordinated with maintainers.
@amknight amknight force-pushed the ak/persistent-keyed-store-sdk branch from ebe44e6 to 36281ff Compare April 29, 2026 03:50
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 29, 2026

Codex review: keeping this open for maintainer follow-up; there is still a little grit to resolve.

Keep this PR open. Current main does not yet expose openclaw/plugin-sdk/persistent-keyed-store, and the proposed helper is still an open draft PR. It is also MEMBER-authored and carries the protected maintainer label, so this cleanup workflow should not close it even if it looked closeable. The PR discussion usefully maps follow-up consumers such as Slack thread participation, Discord component registries, MS Teams sent-message caches, Matrix approval reaction targets, and dedupe-related stores.

Best possible solution:

Keep this PR open for maintainer review. If accepted, land the narrow generic SDK subpath with package export, entrypoint metadata, API baseline, docs/changelog, focused storage/locking tests, and a storage-security review; then migrate the linked channel-specific registries in separate owner-scoped PRs.

What I checked:

  • Current main export map lacks the new subpath: The current package export map includes ./plugin-sdk/json-store, ./plugin-sdk/persistent-dedupe, and ./plugin-sdk/keyed-async-queue, but no ./plugin-sdk/persistent-keyed-store entry. (package.json:928, 7ddd815e469e)
  • Current SDK entrypoint list lacks the helper: The generated SDK entrypoint list contains json-store, persistent-dedupe, and keyed-async-queue, but not persistent-keyed-store, so the proposed public SDK import is not implemented on current main. (scripts/lib/plugin-sdk-entrypoints.json:216, 7ddd815e469e)
  • Current docs do not document the new helper: The runtime/storage SDK subpath catalog documents plugin-sdk/json-store, plugin-sdk/file-lock, and plugin-sdk/persistent-dedupe; the proposed plugin-sdk/persistent-keyed-store row is absent on current main. Public docs: docs/plugins/sdk-subpaths.md. (docs/plugins/sdk-subpaths.md:212, 7ddd815e469e)
  • No current source/test/docs symbol match: A repository search for persistent-keyed-store, PersistentKeyedStore, createPersistentKeyedStore, persistent keyed, and keyed store in package/source/scripts/docs/changelog returned no matches on current main. (7ddd815e469e)
  • Closest current helper is narrower than the PR goal: Current createPersistentDedupe provides claim/check/warmup-style disk-backed dedupe with an in-process file write queue, but it does not provide a generic namespaced keyed registry with register / lookup / consume / entries / clear. (src/plugin-sdk/persistent-dedupe.ts:149, 7ddd815e469e)
  • Protected routing context: The provided GitHub context shows this PR is open, draft, unmerged, authored by a MEMBER, and labeled maintainer; the review policy says such items stay open for explicit maintainer judgment. (7ddd815e469e)

Likely related people:

  • Vincent Koc: Local blame on the current main snapshot attributes the central plugin SDK storage/export/doc lines inspected here to commit df9d26eb, so this is a current-main routing signal for the touched area, though the local history is shallow/grafted and not enough to prove original authorship. (role: recent maintainer; confidence: low; commits: df9d26eb4335; files: src/plugin-sdk/persistent-dedupe.ts, src/plugin-sdk/file-lock.ts, package.json)
  • omarshahine: Provided related merged PRs fix(bluebubbles): dedupe inbound webhooks across restarts (#19176, #12053) #66816 and BlueBubbles/catchup: per-message retry cap for wedged messages (#66870) #67426 added and then hardened persistent channel dedupe behavior, including the lost-update queue concern that the proposed helper generalizes for small restart-safe registries. (role: adjacent owner; confidence: medium; commits: cbeb3ca13025, 39e3cf1df559; files: src/plugin-sdk/persistent-dedupe.ts, extensions/bluebubbles/src/inbound-dedupe.ts)
  • Sid-Qin: Provided related merged PR fix(feishu): persist dedup cache across gateway restarts via warmup #31605 added persistent dedupe warmup and Feishu restart-survival behavior, which is one of the existing disk-backed lifecycle-state patterns adjacent to this SDK helper. (role: introduced adjacent behavior; confidence: medium; commits: bb95a6601dfc; files: src/plugin-sdk/persistent-dedupe.ts, extensions/feishu/src/dedup.ts, extensions/feishu/src/monitor.account.ts)

Remaining risk / open question:

  • This is a public SDK/storage contract change, so API shape, generated baselines, docs, and backward compatibility need maintainer review before merge.
  • The proposed helper writes plugin-provided JSON under lifecycle state; review namespace/path validation, permissions, lock behavior, and corrupt/higher-version quarantine before landing.
  • The PR intentionally does not migrate consumers; Slack, Discord, MS Teams, Matrix, and dedupe follow-ups should remain separate and preserve owner-specific semantics.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 7ddd815e469e.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs Improvements or additions to documentation maintainer Maintainer-authored PR scripts Repository scripts size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant