Skip to content

fix(memory-core): stream embedding cache seed during reindex#73067

Closed
parkertoddbrooks wants to merge 1 commit into
openclaw:mainfrom
wipcomputer:kody/upstream-memory-core-seed-stream
Closed

fix(memory-core): stream embedding cache seed during reindex#73067
parkertoddbrooks wants to merge 1 commit into
openclaw:mainfrom
wipcomputer:kody/upstream-memory-core-seed-stream

Conversation

@parkertoddbrooks

Copy link
Copy Markdown
Contributor

Summary

  • stream embedding_cache rows during safe memory-core reindex instead of materializing the full table with .all()
  • yield to the event loop every 1000 seeded rows so gateway readiness probes stay responsive during large cache copies
  • add a regression test that fails if the seed-copy query falls back to .all()

Why

Large local embedding_cache tables can exceed the V8 heap when seedEmbeddingCache() materializes every row before copying into the temp reindex database. Streaming fixes the heap spike, and cooperative yielding prevents long reindex passes from blocking gateway health/readiness checks.

Tests

  • pnpm tsgo
  • pnpm test extensions/memory-core/src/memory/index.test.ts
  • pnpm test extensions/memory-core/src/memory/manager.sync-errors-do-not-crash.test.ts
  • pnpm test extensions/memory-core/src/memory

@greptile-apps

greptile-apps Bot commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Switches seedEmbeddingCache from .all() to .iterate() to avoid materialising the full embedding_cache table into the V8 heap during a safe reindex, and adds a setImmediate yield every 1 000 rows so gateway health probes remain responsive. A regression test verifies the streaming path is taken. No blocking issues — two minor style observations are noted inline.

Confidence Score: 4/5

Safe to merge; the streaming change is correct and the transaction/error-handling paths are preserved.

Only P2 findings: a harmless empty-transaction regression from removing the early-return guard, and a test fragility note about relying on an implicit identity between manager.db and sourceDb. No logic errors or data-safety issues.

No files require special attention.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: extensions/memory-core/src/memory/manager-sync-ops.ts
Line: 329-337

Comment:
**Empty-cache path now issues an unnecessary BEGIN/COMMIT**

Removing the `if (!rows.length) { return; }` guard means `this.db.exec("BEGIN")` and `this.db.exec("COMMIT")` are always executed, even when `embedding_cache` is empty. The old guard was there precisely to skip the transaction in that case. The behaviour is functionally harmless (an empty transaction completes fine), but it is a silent regression from the original intent and adds an unnecessary write-ahead log entry on every reindex for users who have the cache empty.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/memory-core/src/memory/index.test.ts
Line: 462-475

Comment:
**Spy targets the wrong database instance**

`db` is captured as `manager.db` (the original database) _before_ the sync call. During `safeReindex`, the manager temporarily sets `this.db = tempDb`, then calls `seedEmbeddingCache(originalDb)`. Inside `seedEmbeddingCache`, the SELECT goes through `sourceDb.prepare(...)` (`sourceDb === originalDb === db`) — so the spy _does_ intercept it. However, `this.db.prepare(INSERT ...)` executes against `tempDb`, which has no spy. If the manager ever reassigns `this.db` to a different object before the spy is installed (e.g., via a concurrent reindex or hot-reload path), the spy would silently fail to intercept the SELECT and the test would pass vacuously without detecting a regression back to `.all()`. A comment explicitly asserting `sourceDb === db` would make this invariant clear and prevent silent test drift.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(memory-core): stream embedding cache..." | Re-trigger Greptile

Comment on lines 329 to 337
@@ -338,6 +335,9 @@ export abstract class MemoryManagerSyncOps {
updated_at=excluded.updated_at`,
);
this.db.exec("BEGIN");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Empty-cache path now issues an unnecessary BEGIN/COMMIT

Removing the if (!rows.length) { return; } guard means this.db.exec("BEGIN") and this.db.exec("COMMIT") are always executed, even when embedding_cache is empty. The old guard was there precisely to skip the transaction in that case. The behaviour is functionally harmless (an empty transaction completes fine), but it is a silent regression from the original intent and adds an unnecessary write-ahead log entry on every reindex for users who have the cache empty.

Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/memory-core/src/memory/manager-sync-ops.ts
Line: 329-337

Comment:
**Empty-cache path now issues an unnecessary BEGIN/COMMIT**

Removing the `if (!rows.length) { return; }` guard means `this.db.exec("BEGIN")` and `this.db.exec("COMMIT")` are always executed, even when `embedding_cache` is empty. The old guard was there precisely to skip the transaction in that case. The behaviour is functionally harmless (an empty transaction completes fine), but it is a silent regression from the original intent and adds an unnecessary write-ahead log entry on every reindex for users who have the cache empty.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +462 to +475
const db = (
manager as unknown as {
db: {
prepare: (sql: string) => unknown;
};
}
).db;
const originalPrepare = db.prepare.bind(db);
const cachedRows = (
originalPrepare(
"SELECT provider, model, provider_key, hash, embedding, dims, updated_at FROM embedding_cache",
) as StatementWithAll
).all();
expect(cachedRows.length).toBeGreaterThan(0);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Spy targets the wrong database instance

db is captured as manager.db (the original database) before the sync call. During safeReindex, the manager temporarily sets this.db = tempDb, then calls seedEmbeddingCache(originalDb). Inside seedEmbeddingCache, the SELECT goes through sourceDb.prepare(...) (sourceDb === originalDb === db) — so the spy does intercept it. However, this.db.prepare(INSERT ...) executes against tempDb, which has no spy. If the manager ever reassigns this.db to a different object before the spy is installed (e.g., via a concurrent reindex or hot-reload path), the spy would silently fail to intercept the SELECT and the test would pass vacuously without detecting a regression back to .all(). A comment explicitly asserting sourceDb === db would make this invariant clear and prevent silent test drift.

Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/memory-core/src/memory/index.test.ts
Line: 462-475

Comment:
**Spy targets the wrong database instance**

`db` is captured as `manager.db` (the original database) _before_ the sync call. During `safeReindex`, the manager temporarily sets `this.db = tempDb`, then calls `seedEmbeddingCache(originalDb)`. Inside `seedEmbeddingCache`, the SELECT goes through `sourceDb.prepare(...)` (`sourceDb === originalDb === db`) — so the spy _does_ intercept it. However, `this.db.prepare(INSERT ...)` executes against `tempDb`, which has no spy. If the manager ever reassigns `this.db` to a different object before the spy is installed (e.g., via a concurrent reindex or hot-reload path), the spy would silently fail to intercept the SELECT and the test would pass vacuously without detecting a regression back to `.all()`. A comment explicitly asserting `sourceDb === db` would make this invariant clear and prevent silent test drift.

How can I resolve this? If you propose a fix, please make it concise.

steipete added a commit that referenced this pull request Apr 28, 2026
- stream safe-reindex embedding-cache seeding with SQLite iterate()
- avoid no-op empty-cache transactions and keep regression coverage explicit
- supersedes #73067

Thanks @parkertoddbrooks.
@steipete

Copy link
Copy Markdown
Contributor

Landed this via maintainer PR #73118 because GitHub rejected maintainer pushes to this fork branch despite maintainer edits being enabled.

The changelog keeps the #73067 reference and credits @parkertoddbrooks.

Thanks @parkertoddbrooks!

@steipete steipete closed this Apr 28, 2026
ogt-redknie pushed a commit to ogt-redknie/OPENX that referenced this pull request May 2, 2026
- stream safe-reindex embedding-cache seeding with SQLite iterate()
- avoid no-op empty-cache transactions and keep regression coverage explicit
- supersedes openclaw#73067

Thanks @parkertoddbrooks.
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026
- stream safe-reindex embedding-cache seeding with SQLite iterate()
- avoid no-op empty-cache transactions and keep regression coverage explicit
- supersedes openclaw#73067

Thanks @parkertoddbrooks.
globalcaos pushed a commit to globalcaos/tinkerclaw that referenced this pull request May 13, 2026
- stream safe-reindex embedding-cache seeding with SQLite iterate()
- avoid no-op empty-cache transactions and keep regression coverage explicit
- supersedes openclaw#73067

Thanks @parkertoddbrooks.
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 24, 2026
- stream safe-reindex embedding-cache seeding with SQLite iterate()
- avoid no-op empty-cache transactions and keep regression coverage explicit
- supersedes openclaw#73067

Thanks @parkertoddbrooks.
jameslcowan pushed a commit to jameslcowan/openclaw that referenced this pull request Jun 2, 2026
- stream safe-reindex embedding-cache seeding with SQLite iterate()
- avoid no-op empty-cache transactions and keep regression coverage explicit
- supersedes openclaw#73067

Thanks @parkertoddbrooks.
sablehead pushed a commit to sablehead/openclaw that referenced this pull request Jun 10, 2026
- stream safe-reindex embedding-cache seeding with SQLite iterate()
- avoid no-op empty-cache transactions and keep regression coverage explicit
- supersedes openclaw#73067

Thanks @parkertoddbrooks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants