Skip to content

fix: embed stale chunks by page identity#626

Closed
ArshyaAI wants to merge 2 commits into
garrytan:masterfrom
ArshyaAI:fix/embed-stale-source-scoping
Closed

fix: embed stale chunks by page identity#626
ArshyaAI wants to merge 2 commits into
garrytan:masterfrom
ArshyaAI:fix/embed-stale-source-scoping

Conversation

@ArshyaAI

@ArshyaAI ArshyaAI commented May 4, 2026

Copy link
Copy Markdown

Summary

  • Fixes gbrain embed --stale for brains with duplicate slugs across multiple sources.
  • listStaleChunks now returns page_id/source_id, and the stale embed path groups and upserts by page identity instead of ambiguous slug.
  • Adds a regression test that reproduces duplicate-slug stale chunks without calling ambiguous slug read/write paths.

Why
pages enforces uniqueness on source_id plus slug, not global slug. The stale embed fast path grouped rows by slug; when the same slug exists in multiple sources it can merge duplicate chunk_index rows and fail with Postgres ON CONFLICT DO UPDATE command cannot affect row a second time, leaving stale chunks unembedded.

Verification

  • bun test test/embed.serial.test.ts --timeout 30000
  • bun run typecheck

View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

@ArshyaAI ArshyaAI force-pushed the fix/embed-stale-source-scoping branch from 322a6eb to 116d780 Compare May 5, 2026 00:34
@ArshyaAI

ArshyaAI commented May 5, 2026

Copy link
Copy Markdown
Author

Rebased onto upstream master 9c2dc4cd544cd8013e0eee7a6ffb8536d3c2f13a / 0.26.8 and force-with-lease pushed. New head: 116d78098457c1c83af389e48033162fd60ad270.

Targeted verification after rebase:

  • bun test test/embed.serial.test.ts --timeout 30000 -> 12 pass, 0 fail
  • bun run typecheck -> pass

Production/AStack evidence before this PR was opened: stale embedding on duplicate slugs left chunks unembedded because the stale path grouped by global slug while the schema is unique by (source_id, slug). Runtime cherry-pick embedded the missing chunks and embed --stale --dry-run then returned 0 stale found.

@ArshyaAI ArshyaAI force-pushed the fix/embed-stale-source-scoping branch from 116d780 to 2ebb917 Compare May 5, 2026 05:38
@ArshyaAI

ArshyaAI commented May 5, 2026

Copy link
Copy Markdown
Author

Rebased onto upstream master ee9ceb327a39b0c705ee945c6cfe821de11d34ed / 0.27.0 and force-with-lease pushed. New head: 2ebb917cf703e73994e8fe9d599b9f7b49ed7994.\n\nTargeted verification after rebase:\n- bun test test/embed.serial.test.ts --timeout 30000 -> 12 pass, 0 fail\n- bun run typecheck -> pass\n\nControl check: upstream master still groups stale chunks by slug in embed --stale and listStaleChunks() still omits page_id/source_id, so this PR remains needed for duplicate-slug/source-scoped stale embedding correctness and for removing the downstream runtime cherry-pick.

@Qodo-Free-For-OSS

Copy link
Copy Markdown

Hi, In embedAllStale, when a stale row has no page_id, the code groups by slug only, which can still merge chunks from different sources that share a slug and recreate duplicate chunk_index rows in a single upsert group.

Severity: action required | Category: correctness

How to fix: Key fallback by source+slug

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

embedAllStale() groups stale rows by page_id when present, but falls back to grouping by slug alone when page_id is absent. Since slugs are only unique per source_id, this fallback can merge rows from different sources and re-trigger the duplicate (page_id, chunk_index) upsert conflict the PR is fixing.

Issue Context

StaleChunkRow now includes optional source_id, so even without page_id you can still safely scope a group key as (source_id, slug).

Fix Focus Areas

  • src/commands/embed.ts[363-371]
    • Change the fallback key from slug:${row.slug} to something source-scoped, e.g. src:${row.source_id ?? 'default'}:slug:${row.slug} (or a structured tuple).
    • Consider asserting/logging if both page_id and source_id are missing, since that row is inherently ambiguous.

Spotted by Qodo code review - free for open-source projects.

@ArshyaAI

ArshyaAI commented May 5, 2026

Copy link
Copy Markdown
Author

Live downstream evidence from AStack/OpenClaw dogfood cut on 2026-05-05:

  • Consumed in custom branch ArshyaAI/gbrain@de11d4c858b1a880931c008fbee3ceef92a8329a together with PRs fix: align resolver routing fixtures #619 and fix: return clean auth failures for invalid MCP bearer tokens #620.
  • Direct GBrain canary passed and covered explicit embed job plus global stale embed job.
  • Runtime health evidence after the cut: page_count=13975, chunk_count=36097, embedded_count=36097, missing_embeddings=0, embed_coverage=1.
  • Runtime gbrain doctor --json: status=healthy, health_score=100; OpenClaw-mediated, Remote MCP, local Codex, and local Claude Code canaries passed.

This PR remains one of the upstream-clean blockers for AStack FULL PASS; current dogfood status is intentionally custom/non-upstream-clean until merged and consumed.

@ArshyaAI

ArshyaAI commented May 6, 2026

Copy link
Copy Markdown
Author

Addressed the Qodo fallback grouping review. New head on fix/embed-stale-source-scoping: 5edd15a.

Change:

  • When page_id is unavailable, embedAllStale() now groups stale rows by source_id + slug instead of slug alone.
  • Added regression coverage for duplicate stale rows with the same slug from different sources and no page_id; they now produce separate groups/embed calls instead of one ambiguous slug group.

Verification:

  • bun test test/embed.serial.test.ts --timeout 30000 -> 13 pass, 0 fail
  • bun install --frozen-lockfile --ignore-scripts && bun run typecheck -> pass
  • git diff --check -> pass

@ArshyaAI

ArshyaAI commented May 6, 2026

Copy link
Copy Markdown
Author

2026-05-06 maintainer merge packet / downstream readiness refresh after addressing Qodo:

  • New PR head: 5edd15acf4d6dc8192702a277341bd18725ccedb.
  • Qodo fallback grouping concern is addressed: when page_id is unavailable, embedAllStale() groups fallback rows by source_id + slug, not slug alone.
  • Added regression coverage for duplicate stale rows with the same slug from different sources and no page_id.
  • Verification: bun test test/embed.serial.test.ts --timeout 30000 -> 13 pass, 0 fail; bun install --frozen-lockfile --ignore-scripts && bun run typecheck -> pass; git diff --check -> pass.
  • Fresh AStack gates: direct GBrain, Remote MCP/OAuth, local Codex, local Claude Code, runtime doctor, runtime risk logs, shell quota guard, and scheduler burn-in all pass on the custom cut.
  • Remaining FULL PASS blockers are now upstream-clean only: merge/consume fix: align resolver routing fixtures #619, fix: return clean auth failures for invalid MCP bearer tokens #620, fix: embed stale chunks by page identity #626, then move runtime/Docker pins from custom cut back to upstream master.

Runtime note: live AStack de11d4... still has the older fallback line, but the approved target's active Postgres/PGLite engines already return page_id and source_id from listStaleChunks(); this follow-up is for upstream completeness/future stale-row producers, not an emergency redeploy trigger.

@garrytan

garrytan commented Jun 8, 2026

Copy link
Copy Markdown
Owner

Thanks for this contribution — and apologies for the slow triage. We did a full pass over the entire PR backlog. gbrain has moved fast, and the maintainer's larger "cathedral" rewrites have superseded a big share of community PRs: the AI gateway + recipes + user_provided_models system replaced almost all individual provider PRs; #1805 fixed the whole Postgres module-singleton class; #1542 unified the type taxonomy; #1657 the retrieval path; #1802 the doctor; and so on.

We're closing this one in that cleanup — either the fix already landed on master, it duplicates another PR or merged change, or it's outside the current merge bar. Where a closed PR carried a genuinely valuable idea, we've recorded it in docs/designs/COMMUNITY_IDEAS.md so nothing good is lost (a few may graduate into TODOs).

Please don't read the close as a judgment of the work — thank you for contributing. If you believe the underlying issue is still live on the latest master, reopen with a quick note and we'll take another look. 🙏

@garrytan garrytan closed this Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants