fix: embed stale chunks by page identity#626
Conversation
322a6eb to
116d780
Compare
|
Rebased onto upstream master Targeted verification after rebase:
Production/AStack evidence before this PR was opened: stale embedding on duplicate slugs left chunks unembedded because the stale path grouped by global slug while the schema is unique by |
116d780 to
2ebb917
Compare
|
Rebased onto upstream master |
|
Hi, In embedAllStale, when a stale row has no page_id, the code groups by slug only, which can still merge chunks from different sources that share a slug and recreate duplicate chunk_index rows in a single upsert group. Severity: action required | Category: correctness How to fix: Key fallback by source+slug Agent prompt to fix - you can give this to your LLM of choice:
Spotted by Qodo code review - free for open-source projects. |
|
Live downstream evidence from AStack/OpenClaw dogfood cut on 2026-05-05:
This PR remains one of the upstream-clean blockers for AStack FULL PASS; current dogfood status is intentionally custom/non-upstream-clean until merged and consumed. |
|
Addressed the Qodo fallback grouping review. New head on Change:
Verification:
|
|
2026-05-06 maintainer merge packet / downstream readiness refresh after addressing Qodo:
Runtime note: live AStack |
|
Thanks for this contribution — and apologies for the slow triage. We did a full pass over the entire PR backlog. gbrain has moved fast, and the maintainer's larger "cathedral" rewrites have superseded a big share of community PRs: the AI gateway + recipes + user_provided_models system replaced almost all individual provider PRs; #1805 fixed the whole Postgres module-singleton class; #1542 unified the type taxonomy; #1657 the retrieval path; #1802 the doctor; and so on. We're closing this one in that cleanup — either the fix already landed on master, it duplicates another PR or merged change, or it's outside the current merge bar. Where a closed PR carried a genuinely valuable idea, we've recorded it in docs/designs/COMMUNITY_IDEAS.md so nothing good is lost (a few may graduate into TODOs). Please don't read the close as a judgment of the work — thank you for contributing. If you believe the underlying issue is still live on the latest master, reopen with a quick note and we'll take another look. 🙏 |
Summary
Why
pages enforces uniqueness on source_id plus slug, not global slug. The stale embed fast path grouped rows by slug; when the same slug exists in multiple sources it can merge duplicate chunk_index rows and fail with Postgres ON CONFLICT DO UPDATE command cannot affect row a second time, leaving stale chunks unembedded.
Verification
Need help on this PR? Tag
@codesmithwith what you need.