fix(memory): retry reindex on socket errors#76311
Conversation
|
Codex review: passed. Summary Reproducibility: yes. Current main's retry classifier rejects the socket/fetch samples, and the reindex embedding path delegates to that classifier; a read-only regex probe confirmed the PR accepts the target samples. Next step before merge Security Review detailsBest possible solution: Land this narrowed classifier/test/changelog fix after exact-head gates pass, while keeping broader reindex resilience tracked in the related follow-ups. Do we have a high-confidence way to reproduce the issue? Yes. Current main's retry classifier rejects the socket/fetch samples, and the reindex embedding path delegates to that classifier; a read-only regex probe confirmed the PR accepts the target samples. Is this the best way to solve the issue? Yes for the narrowed scope. Extending the existing classifier with focused coverage is the smallest maintainable fix, while split/resume behavior belongs in separate work. What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 31ed93ff58db. |
bbd6976 to
7c5c7cc
Compare
9246efa to
6828ff4
Compare
b101558 to
f0574ce
Compare
|
Thanks for the review — both findings actioned in
AC rerun:
Please re-review when convenient. |
c930f53 to
1f7b38a
Compare
|
@clawsweeper re-review |
|
🦞🦞 I asked ClawSweeper to review this item again. |
|
@clawsweeper automerge |
|
🦞🦞 Source: What merged:
Automerge notes:
The automerge loop is complete. Automerge progress:
|
…#44166) Broaden the embedding reindex retry classifier to include transient socket-layer errors ('fetch failed', 'ECONNRESET', 'socket hang up', 'UND_ERR_*', 'closed') so memory reindex survives provider network hiccups instead of aborting mid-run. Previously the retry classifier only matched rate-limit/429/5xx/ cloudflare/token-quota wording; transport-layer failures got treated as fatal and aborted the whole reindex. Narrowed per clawsweeper review (gpt-5.5): this PR now ships retry-only. The archived .jsonl.reset/.deleted transcript indexing regression (buildSessionEntry empty-content + session-hit visibility stem filter) needs its own rebase onto the upstream session-files.ts API shift and will land as a separate PR. The skipped archive regression guard has been removed to keep scope matching the diff. Changes: * extensions/memory-core/src/memory/manager-embedding-policy.ts Add socket-layer error classes to the embedding retry whitelist. * extensions/memory-core/src/memory/manager-embedding-policy.test.ts New cases asserting socket errors are retried. * CHANGELOG.md Unreleased Fixes entry for the retry broadening. Related openclaw#56815, openclaw#44166.
1f7b38a to
b4618c4
Compare
…ive hits during session visibility filtering
Two coupled regressions together made post-reset memory_search return
empty on archived session content, even when the live session was
visible to the requester.
1. `buildSessionEntry` short-circuits every archive file (`.reset`,
`.deleted`, `.bak`) to `{ content: '', lineMap: [] }` via
`shouldSkipTranscriptFileForDreaming`. That is correct for opaque
backups and compaction checkpoints, but it also zeros out the two
usage-counted archive kinds (`.jsonl.reset.<iso>` and
`.jsonl.deleted.<iso>`) that are meant to be searchable as the
rotated copy of a real session. No chunks are ever produced from
them, so `memory_search` has nothing to return.
2. `extractTranscriptStemFromSessionsMemoryHit` recognises only
`.jsonl` and `.md` suffixes. Even if archive content is indexed,
hits land on `sessions/<stem>.jsonl.reset.<iso>` paths that the
stem extractor resolves to `null`, so
`filterMemorySearchHitsBySessionVisibility` drops them before the
guard check runs.
Fix
- `shouldSkipTranscriptFileForDreaming` now distinguishes
usage-counted archives from opaque ones: compaction checkpoints and
non-usage-counted archives (`.bak`, legacy store backups) stay
skipped; `.jsonl.reset.<iso>` and `.jsonl.deleted.<iso>` fall through
to the normal JSONL reader and produce searchable content,
`lineMap`, and timestamps like any other session transcript.
- `extractTranscriptStemFromSessionsMemoryHit` learns the archive
suffix shape (`\.jsonl\.(?:reset|deleted)\.<iso>$`) and maps those
hit paths back to the same live `<stem>`. The existing visibility
guard (`resolveTranscriptStemToSessionKeys` →
`createSessionVisibilityGuard`) is then authoritative and gates the
archive hit against the same session entry as a live `.jsonl` hit.
Security / scope
- No change to the authoritative visibility guard. An archive hit for
a session the requester cannot see is still dropped, just further
downstream than before (at the guard instead of silently at stem
extraction / empty-content). `.bak` and compaction checkpoints still
produce no searchable content.
Tests
- `packages/memory-host-sdk/src/host/session-files.test.ts` now
asserts that reset/deleted archives produce real content and
`lineMap`, while `.bak` and compaction checkpoints stay empty.
- `src/plugin-sdk/session-transcript-hit.test.ts` covers both
archive suffixes with and without the `sessions/` prefix, plus a
negative case rejecting unrelated `.jsonl.backup.` shaped suffixes.
Refs openclaw#56131. Follow-up to openclaw#76311 (socket retry).
AI-assisted: Prepared with Claude (Opus 4.7). Validated locally with
`pnpm check` (typecheck/lint/guards EXIT 0) and the exact vitest AC
set from the earlier Codex review (session-transcript-hit,
session-files, session-search-visibility, memory/index) all green.
… delta threshold for usage-counted archives so new reset/deleted artifacts index incrementally
Two coupled gaps together meant post-reset `.jsonl.reset.<iso>` /
`.jsonl.deleted.<iso>` archives never entered `chunks` on the
incremental memory sync path, only on a manual
`openclaw memory index --force` full reindex.
1. `archiveFileOnDisk` renamed the live `.jsonl` onto disk but never
emitted a `sessionTranscriptUpdate` event, so the memory sync's
`ensureSessionListener` never learned the archived path existed.
Every other in-process session-file mutation already goes through
`emitSessionTranscriptUpdate` (appendMessage, compaction,
tool-result rewrite, chat transcript inject, command execution,
pi-embedded tool-result truncation, `config/sessions/transcript`
append/truncate plumbing, session tool-result guard); archive was
the sole emit gap.
2. Even with the emit in place, `processSessionDeltaBatch` would
forward the event to `updateSessionDelta` and gate it behind the
`deltaBytes` / `deltaMessages` thresholds (defaults
`agents.defaults.memorySearch.sync.sessions.deltaBytes: 100000` and
`deltaMessages: 50`). Those thresholds are designed for live
transcripts accumulating appended messages; a one-shot archive
rename below the threshold would never be marked dirty and never
trigger a session-delta sync, so the archive would still not
reindex.
Fix
- `src/gateway/session-transcript-files.fs.ts::archiveFileOnDisk`
emits `emitSessionTranscriptUpdate({ sessionFile: archived })` after
`fs.renameSync`. Same pattern as the eight existing emit points.
- `extensions/memory-core/src/memory/manager-sync-ops.ts::`
`processSessionDeltaBatch` now short-circuits the delta accounting
for usage-counted archive artifacts (classified by
`isSessionArchiveArtifactName` + `isUsageCountedSessionTranscriptFileName`),
marking the archive path dirty directly and triggering a session-delta
sync regardless of size. `.jsonl.bak.<iso>` is explicitly NOT treated
as usage-counted and therefore does not bypass \u2014 it stays opaque to
the indexer the same way `buildSessionEntry` already skips it.
- `packages/memory-host-sdk/src/engine-qmd.ts` re-exports the two
classification helpers so memory-core can consume them via the
existing plugin-sdk boundary.
Scope
- No change to indexing policy for live transcripts, the chokidar
watch set, force-reindex behavior, visibility filtering, the
downstream `onSessionTranscriptUpdate` contract, or the archive
`.bak`/compaction-checkpoint skip list. The authoritative visibility
guard remains authoritative.
Tests
- `src/gateway/session-transcript-files.fs.archive-events.test.ts`
asserts archive emit for reset / deleted / bak.
- `extensions/memory-core/src/memory/manager-sync-ops.archive-delta-bypass.test.ts`
locks in the classification invariants the bypass depends on
(reset+deleted bypass, bak explicitly does not, live `.jsonl`
explicitly does not, compaction checkpoint does not), plus an
end-to-end verification that `emitSessionTranscriptUpdate` delivers
the archived path verbatim through the transcript-events bus.
Refs openclaw#56131. Follow-up to openclaw#76311 (socket retry, merged) and openclaw#76452
(archive content indexing + visibility stem). Together with those two
PRs this closes the archive-path memory indexing story: archive files
are read (`openclaw#76452`), mapped back to the live transcript stem
(`openclaw#76452`), the archive event is emitted (this PR), and the memory
sync incremental path now actually reindexes the archive instead of
dropping the event at the delta gate (this PR).
AI-assisted: Prepared with Claude (Opus 4.7). Validated locally with
`pnpm check` EXIT 0 and the two new test files (6 + 8 passed across
all three gateway test environments + extension-memory environment).
Summary: - The PR broadens memory-core's embedding retry classifier for socket/network errors, adds focused classifier tests, and adds an Unreleased changelog fix. - Reproducibility: yes. Current main's retry classifier rejects the socket/fetch samples, and the reindex embedding path delegates to that classifier; a read-only regex probe confirmed the PR accepts the target samples. Automerge notes: - PR branch already contained follow-up commit before automerge: fix(memory): retry reindex on socket errors Validation: - ClawSweeper review passed for head b4618c4. - Required merge gates passed before the squash merge. Prepared head SHA: b4618c4 Review: openclaw#76311 (comment) Co-authored-by: buyitsydney <buyitsydney@users.noreply.github.com> Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com>
Summary: - The PR broadens memory-core's embedding retry classifier for socket/network errors, adds focused classifier tests, and adds an Unreleased changelog fix. - Reproducibility: yes. Current main's retry classifier rejects the socket/fetch samples, and the reindex embedding path delegates to that classifier; a read-only regex probe confirmed the PR accepts the target samples. Automerge notes: - PR branch already contained follow-up commit before automerge: fix(memory): retry reindex on socket errors Validation: - ClawSweeper review passed for head b4618c4. - Required merge gates passed before the squash merge. Prepared head SHA: b4618c4 Review: openclaw#76311 (comment) Co-authored-by: buyitsydney <buyitsydney@users.noreply.github.com> Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com>
Summary: - The PR broadens memory-core's embedding retry classifier for socket/network errors, adds focused classifier tests, and adds an Unreleased changelog fix. - Reproducibility: yes. Current main's retry classifier rejects the socket/fetch samples, and the reindex embedding path delegates to that classifier; a read-only regex probe confirmed the PR accepts the target samples. Automerge notes: - PR branch already contained follow-up commit before automerge: fix(memory): retry reindex on socket errors Validation: - ClawSweeper review passed for head b4618c45324ad00e15ce8488f0c3aff9bf6d0ce1. - Required merge gates passed before the squash merge. Prepared head SHA: b4618c45324ad00e15ce8488f0c3aff9bf6d0ce1 Review: openclaw/openclaw#76311 (comment) Co-authored-by: buyitsydney <buyitsydney@users.noreply.github.com> Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com>
Summary: - The PR broadens memory-core's embedding retry classifier for socket/network errors, adds focused classifier tests, and adds an Unreleased changelog fix. - Reproducibility: yes. Current main's retry classifier rejects the socket/fetch samples, and the reindex embedding path delegates to that classifier; a read-only regex probe confirmed the PR accepts the target samples. Automerge notes: - PR branch already contained follow-up commit before automerge: fix(memory): retry reindex on socket errors Validation: - ClawSweeper review passed for head b4618c45324ad00e15ce8488f0c3aff9bf6d0ce1. - Required merge gates passed before the squash merge. Prepared head SHA: b4618c45324ad00e15ce8488f0c3aff9bf6d0ce1 Review: openclaw/openclaw#76311 (comment) Co-authored-by: buyitsydney <buyitsydney@users.noreply.github.com> Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com>
Summary: - The PR broadens memory-core's embedding retry classifier for socket/network errors, adds focused classifier tests, and adds an Unreleased changelog fix. - Reproducibility: yes. Current main's retry classifier rejects the socket/fetch samples, and the reindex embedding path delegates to that classifier; a read-only regex probe confirmed the PR accepts the target samples. Automerge notes: - PR branch already contained follow-up commit before automerge: fix(memory): retry reindex on socket errors Validation: - ClawSweeper review passed for head b4618c45324ad00e15ce8488f0c3aff9bf6d0ce1. - Required merge gates passed before the squash merge. Prepared head SHA: b4618c45324ad00e15ce8488f0c3aff9bf6d0ce1 Review: openclaw/openclaw#76311 (comment) Co-authored-by: buyitsydney <buyitsydney@users.noreply.github.com> Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com>
Summary: - The PR broadens memory-core's embedding retry classifier for socket/network errors, adds focused classifier tests, and adds an Unreleased changelog fix. - Reproducibility: yes. Current main's retry classifier rejects the socket/fetch samples, and the reindex embedding path delegates to that classifier; a read-only regex probe confirmed the PR accepts the target samples. Automerge notes: - PR branch already contained follow-up commit before automerge: fix(memory): retry reindex on socket errors Validation: - ClawSweeper review passed for head b4618c4. - Required merge gates passed before the squash merge. Prepared head SHA: b4618c4 Review: openclaw#76311 (comment) Co-authored-by: buyitsydney <buyitsydney@users.noreply.github.com> Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com>
Summary: - The PR broadens memory-core's embedding retry classifier for socket/network errors, adds focused classifier tests, and adds an Unreleased changelog fix. - Reproducibility: yes. Current main's retry classifier rejects the socket/fetch samples, and the reindex embedding path delegates to that classifier; a read-only regex probe confirmed the PR accepts the target samples. Automerge notes: - PR branch already contained follow-up commit before automerge: fix(memory): retry reindex on socket errors Validation: - ClawSweeper review passed for head b4618c4. - Required merge gates passed before the squash merge. Prepared head SHA: b4618c4 Review: openclaw#76311 (comment) Co-authored-by: buyitsydney <buyitsydney@users.noreply.github.com> Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com>
Summary: - The PR broadens memory-core's embedding retry classifier for socket/network errors, adds focused classifier tests, and adds an Unreleased changelog fix. - Reproducibility: yes. Current main's retry classifier rejects the socket/fetch samples, and the reindex embedding path delegates to that classifier; a read-only regex probe confirmed the PR accepts the target samples. Automerge notes: - PR branch already contained follow-up commit before automerge: fix(memory): retry reindex on socket errors Validation: - ClawSweeper review passed for head b4618c4. - Required merge gates passed before the squash merge. Prepared head SHA: b4618c4 Review: openclaw#76311 (comment) Co-authored-by: buyitsydney <buyitsydney@users.noreply.github.com> Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com>
Narrowed scope per clawsweeper review.
This PR now ships retry-only: broaden the embedding reindex retry classifier in
extensions/memory-core/src/memory/manager-embedding-policy.tsto treat transient socket-layer errors (fetch failed,ECONNRESET,socket hang up,UND_ERR_*,closed) as retryable instead of fatal.Why
During reindex, transient transport-layer failures from the embedding provider currently abort the whole reindex mid-run because the retry classifier only matches rate-limit/429/5xx/cloudflare/token-quota wording. Users hit this as spurious reindex failures on flaky links.
What changed
extensions/memory-core/src/memory/manager-embedding-policy.ts— extend the socket-error whitelist inisRetryableMemoryEmbeddingError.extensions/memory-core/src/memory/manager-embedding-policy.test.ts— new cases asserting socket-class errors are retried.packages/memory-host-sdk/src/host/session-files.test.ts— unchanged regression coverage for existing archive-path dreaming-skip behavior.CHANGELOG.md— Unreleased Fixes entry.Dropped from this PR
The archived
.jsonl.reset.*/.jsonl.deleted.*transcript indexing fix (buildSessionEntryempty-content +session-transcript-hit.tsstem filter) was split out — it needs to rebase onto the upstreambuildSessionEntryopts-object signature and will ship as a separate narrow PR. The skipped archive regression guard has been removed so the test surface matches the actual fix.Validation
pnpm test extensions/memory-core/src/memory/manager-embedding-policy.test.ts— greenpnpm test packages/memory-host-sdk/src/host/session-files.test.ts— green (archive guard removed)b101558ce662; rerunning onf0574ce305.Related #56815, #44166. (No longer closes #56131 — that fix lands in a follow-up PR.)
AI-assisted: prepared with Claude Opus 4.7.