fix #88009: [Feature]: batched memory embedding should batch over files#89138
Conversation
|
Codex review: needs real behavior proof before merge. Reviewed June 9, 2026, 6:38 AM ET / 10:38 UTC. Summary PR surface: Source +658, Tests +658, Docs +10. Total +1326 across 13 files. Reproducibility: Do we have a high-confidence way to reproduce the issue? Source inspection and reporter logs show current main batches provider work at file boundaries, but I did not run the CLI path in this read-only review. Review metrics: 2 noteworthy metrics.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Proof guidance:
Risk before merge
Maintainer options:
Next step before merge
Security Review findings
Review detailsBest possible solution: Land the source-wide opt-in approach only after the provider contract, docs, and current-head proof all describe the real staging and upload-cap behavior that maintainers want to support. Do we have a high-confidence way to reproduce the issue? Do we have a high-confidence way to reproduce the issue? Source inspection and reporter logs show current main batches provider work at file boundaries, but I did not run the CLI path in this read-only review. Is this the best way to solve the issue? Is this the best way to solve the issue? The explicit provider opt-in is a plausible and well-scoped architecture, but the merge-ready solution needs the source-wide contract and current-head proof aligned with the new session staging and upload-cap behavior. Full review comments:
Overall correctness: patch is incorrect AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against 2f02bbcbb3ef. Label changesLabel justifications:
Evidence reviewedPR surface: Source +658, Tests +658, Docs +10. Total +1326 across 13 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
|
@hartmark This is the focused respin for #88009. Could you try this branch against your OpenAI-compatible batch provider when you have a chance? The expected verbose signal is that dirty memory files and sessions are prepared together and submitted as one source-wide provider batch until the file/request caps require a split, for example: This version also keeps oversized-file chunks in the same provider batch, reports OpenAI I also ran the reporter path |
|
Cool, l'll take a look as soon as possible. Thanks for implementing this annoying blocker. |
|
Still no full reindex. memory status says batch is disabled, that's strange Running full reindex. The output hangs on this output for several minutes and then prints out: Then when running memory status yields this response: Note the message that the index metadata is missing and indexing is paused until rebuilt |
|
@clawsweeper re-review Current head The new proof covers the reporter's latest failure mode directly: two consecutive forced reindexes on the same memory source both report The Real behavior proof workflow is passing again: https://github.com/openclaw/openclaw/actions/runs/27048357559/job/79838839784 |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
|
@mushuiyu886 I can do the --force reindex and openclaw memory search foo works now. Only thing now is that --force should do a complete rerun of the embeddings, or is it just that the index is rebuilt using the previously saved embeddings? |
|
@hartmark Yes — During the safe reindex path, the existing embedding cache is carried forward, so unchanged chunks with the same provider/model/cache key and content hash can reuse previously saved embeddings. Changed, missing, or cache-mismatched chunks are embedded again. So your current result is expected: |
Cool, let's hope we get this merged soon |
|
@clawsweeper re-review |
|
@jalehman When will this PR be merged? I noticed that many new commits have been added. Is there anyone else who needs to be involved in the merge process? |
|
Merged via squash.
Thanks @mushuiyu886! |
Summary
Fixes #88009.
Rebase / CI unblocker status
mainnow carries the five hard Knip allowlist entries inscripts/deadcode-unused-files.allowlist.mjswith an explanatory SQLite scaffold comment.main, this PR no longer changesscripts/deadcode-unused-files.allowlist.mjsrelative to the base branch.Real behavior proof
/media/vdc/code/ai/aispace/openclaw-worktrees/issue-88009, isolatedOPENCLAW_HOMEandOPENCLAW_STATE_DIRunder/media/vdc/code/ai/aispace/openclaw-issue-88009-evidence/real-behavior-proof, local OpenAI-compatible batch endpoint on127.0.0.1, provideropenai, modeltext-embedding-3-small,remote.batch.enabled=true,remote.batch.wait=true,remote.batch.timeoutMinutes=1440,fallback=none.openclaw memory index --force --verbosepath submitted one OpenAI-compatible provider batch job for 1328 total chunks from 833 memory files and 495 session transcripts. The CLI log showssource=memory+sessions,reason=end, no inline embedding requests, progress counts during polling, retry of a 429 status check, and increasing waits from 20ms to 40ms to 80ms. The reporter'sgateway:watch:rawstartup path also reachedhttp server listeningandgateway readyunder an isolated local gateway config, then shut down cleanly after the bounded watch proof timeout. The byte-cap follow-up confirms shared embedding batch grouping splits by serialized JSONL UTF-8 bytes, the OpenAI-compatible upload path sends multiple under-cap JSONL files through the configured fetch implementation, and custom-id embedding order is preserved. The current-head safe-force follow-up also verifies the reporter's latest failure mode directly: two consecutivememory index --forceruns on the same memory source keep the rebuilt SQLite index clean, preserve index metadata, and do not leave vector recall paused pending a rebuild./v1/files,/v1/batches,/v1/batches/:id, and/v1/files/:id/contentbehavior needed for this indexing path. The byte-cap follow-up did not hit a hosted provider's real payload cap; it exercised the production upload runner with a local fake fetch implementation and smallmaxJsonlBytesoverride. The current-head safe-force follow-up used the local source CLI with a provider-none/FTS-only memory index to isolate the metadata identity path; it did not exercise hosted embeddings for that specific metadata regression.extensions/memory-core/src/memory/index.test.ts,extensions/openai/embedding-batch.test.ts,extensions/openai/memory-embedding-adapter.test.ts, andsrc/plugins/contracts/memory-embedding-provider.contract.test.ts.Regression Test Plan
git diff --checkwere run for the changed memory-core, OpenAI, provider runtime, and provider contract files.gateway:watch:rawwas run with isolated local gateway state to verify the startup path reachesgateway readywithout the prior dynamic-import/doctor-repair launch failure.Root Cause
batchEmbedindependently, so a workload with many small files produced many small provider batch jobs. The OpenAI batch poll loop also treated transient status failures as terminal and used a fixed poll interval, which made long-running provider jobs noisy and brittle.