✨ feat(kb): extend BM25 search to file-backed documents by Innei · Pull Request #15247 · lobehub/lobehub

Innei · 2026-05-26T17:03:20Z

💻 Change Type

✨ feat

🔗 Related Issue

N/A

🔀 Description of Change

searchKnowledgeBaseDocuments (the BM25 leg of the knowledge-base hybrid retrieval) previously hard-filtered on documents.fileType === 'custom/document', so only inline KB pages were searchable. Parsed PDFs and other file-backed documents already had their content sitting in the documents table — and were already covered by documents_bm25_idx — but the query never matched them. They could only be retrieved via vector search on chunks.

This PR extends BM25 to cover file-backed documents by running two scoped ParadeDB queries in parallel and merging by score in JS:

inline path — documents.knowledge_base_id IN (...)
file-backed path — INNER JOIN knowledge_base_files ON file_id with kbf.knowledge_base_id IN (...)

A single OR-ed predicate trips ParadeDB's Unsupported query shape because paradedb.score() requires a conjunctive tantivy scan; splitting into two queries keeps each branch eligible for index-scan scoring. Folder rows are excluded (no content). The de-duped, score-sorted result set is then sliced to the original limit.

Hits now carry an optional fileId; downstream:

KnowledgeBaseDocumentResult (builtin-tool-knowledge-base) — mirror type updated
formatSearchResults — exposes fileId="..." on the <document> XML tag so the agent can call readKnowledge with either docs_* or file_*

No schema migration — the existing documents_bm25_idx already covers content/title/slug across all file types.

🧪 How to Test

Tested locally
Added/updated tests
No tests needed

Test plan:

packages/database — added searchKnowledgeBaseDocuments > file-backed documents (PDF / parsed files) sub-suite: covers PDF hit via knowledge_base_files join, mixed inline + file-backed in same call, folder exclusion, cross-user isolation. 58/58 tests pass against ParadeDB (DATABASE_TEST_URL=postgresql://lobe:lobe123@localhost:13333/lobechat_test TEST_SERVER_DB=1 bunx vitest run --config vitest.config.server.mts src/repositories/search/index.test.ts)
packages/prompts — added fileId attribute assertion to formatSearchResults tests (14/14 pass)
src/server/services/knowledgeBase — existing consumer suite (13/13) still passes
ESLint clean on all modified files; no new TS errors

📝 Additional Information

BM25 hits are document-grained, vector hits are chunk-grained — this PR doesn't change that asymmetry. Chunk-level BM25 (index on chunks.text) is a separate, larger follow-up if finer relevance is needed.
The fileId field is optional and additive; existing consumers that don't read it continue to work unchanged.

`searchKnowledgeBaseDocuments` only matched inline `custom/document` pages, so parsed PDFs and other file-backed documents never surfaced via the BM25 path — vector search was the sole way to retrieve them. Run two scoped ParadeDB queries in parallel (inline via `documents.knowledge_base_id`, file-backed via a `knowledge_base_files` join) and merge by score in JS. A single OR-ed predicate trips ParadeDB's `Unsupported query shape` because `paradedb.score()` requires a conjunctive tantivy scan. Folder rows are excluded; hits now carry an optional `fileId` so the agent can read with either `docs_*` or `file_*` ids. The XML formatter exposes the new attribute downstream.

vercel · 2026-05-26T17:03:26Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
lobehub	Ready	Preview, Comment	May 26, 2026 5:08pm

sourcery-ai

Sorry @Innei, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

codecov · 2026-05-26T17:08:42Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.81%. Comparing base (72d3404) to head (a646fb3).

Additional details and impacted files

@@            Coverage Diff             @@
##           canary   #15247      +/-   ##
==========================================
- Coverage   70.81%   70.81%   -0.01%     
==========================================
  Files        3154     3154              
  Lines      315154   315155       +1     
  Branches    34357    34358       +1     
==========================================
- Hits       223188   223187       -1     
- Misses      91797    91799       +2     
  Partials      169      169

Flag	Coverage Δ
app	`61.66% <ø> (-0.01%)`	⬇️
database	`92.22% <ø> (ø)`
packages/agent-runtime	`80.48% <ø> (ø)`
packages/builtin-tool-lobe-agent	`19.87% <ø> (ø)`
packages/context-engine	`84.13% <ø> (ø)`
packages/conversation-flow	`91.28% <ø> (ø)`
packages/file-loaders	`87.89% <ø> (ø)`
packages/memory-user-memory	`74.99% <ø> (ø)`
packages/model-bank	`99.99% <ø> (ø)`
packages/model-runtime	`83.90% <ø> (ø)`
packages/prompts	`72.68% <100.00%> (+<0.01%)`	⬆️
packages/python-interpreter	`92.90% <ø> (ø)`
packages/ssrf-safe-fetch	`0.00% <ø> (ø)`
packages/types	`35.07% <ø> (ø)`
packages/utils	`88.47% <ø> (ø)`
packages/web-crawler	`88.08% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
Store	`67.68% <ø> (ø)`
Services	`54.64% <ø> (ø)`
Server	`72.22% <ø> (-0.01%)`	⬇️
Libs	`56.97% <ø> (ø)`
Utils	`86.06% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

`searchKnowledgeBaseDocuments` only matched inline `custom/document` pages, so parsed PDFs and other file-backed documents never surfaced via the BM25 path — vector search was the sole way to retrieve them. Run two scoped ParadeDB queries in parallel (inline via `documents.knowledge_base_id`, file-backed via a `knowledge_base_files` join) and merge by score in JS. A single OR-ed predicate trips ParadeDB's `Unsupported query shape` because `paradedb.score()` requires a conjunctive tantivy scan. Folder rows are excluded; hits now carry an optional `fileId` so the agent can read with either `docs_*` or `file_*` ids. The XML formatter exposes the new attribute downstream.

@hezhijie0327

# 🚀 LobeHub Release (20260604) **Release Date:** June 4, 2026 **Since v2.2.1:** 88 merged PRs · 11 contributors > This week brings Execution Devices out of the lab — run agents and Claude Code on any configured local or remote machine — alongside Claude Opus 4.8, token-usage analytics, and Page sharing. --- ## ✨ Highlights - **Execution Devices** — Pick where an agent runs. Desktop and CLI devices auto-register with a stable machine ID, route through the gateway by channel, and surface a device switcher in the chat input. Run remote Claude Code on a configured device, with a recent-directory picker you can drag to reorder. (#15300, #15315, #15322, #15343, #15351, #15371) - **Claude Opus 4.8** — Day-one support for Anthropic's latest model. (#15314) - **Token-usage analytics** — A new token-usage mode on the activity heatmap, backed by a denormalized topic usage/cost rollup so totals stay accurate without recomputing from messages. (#15365, #15417, #15425) - **Page sharing** — Share a Page through a dedicated document share flow, plus new Workspace and Agent share tables. (#15309, #15439) - **Self-iteration agents** — Agent Signal's execAgent migration lands a server-runtime bridge, async memory writer, and a registered self-iteration tool package, with a CLI trigger command for testing. (#15360, #15364, #15392) - **Knowledge search** — BM25 search now extends to file-backed documents, and the portal ships an editable CodeMirror viewer for local files with document highlighting. (#15247, #15298) --- ## 🏗️ Core Agent & Architecture ### Agent Signal & Runtime - **execAgent migration** — Server-runtime bridge, completion projection, async memory writer, and removal of the legacy `executeSelfIteration` path. (#15392) - Registered the self-iteration builtin tool package and restored the three mode-specific self-iteration agent slugs. (#15202, #15364) - Added a CLI trigger command with a golden-snapshot fixture for Agent Signal. (#15360) - **Skill priority** — Agent Builder now emits a skill-priority instruction with matching server runtime. (#15409) - Retry empty LLM completions instead of silently finishing the turn. (#15355) - Classify topic/agent/session foreign-key violations as `ConversationParentMissing` for clearer recovery. (#15408) - Persist canonical nested usage/performance on assistant messages, and re-link orphan tool messages at the raw bucket write boundary. (#15359, #15438) - Guard `createAgent` against LLM double-encoded array fields. (#15381) --- ## 🖥️ Execution Devices & Gateway - Auto-register desktop and CLI devices with a stable machine ID, and add the `@lobechat/device-identity` package. (#15300, #15321) - New Devices settings page behind the Execution Device Switcher lab, with a device switcher shown for all agents in the chat input. (#15315, #15371) - `connectionId` + channel routing across the gateway client and device list; preset the local device on the first LLM request for the 本机 target. (#15322, #15435) - Run remote Claude Code on a configured device, with drag-to-reorder recent-directory management and client renders for device tool results. (#15343, #15351, #15437) - Preserve content and state across gateway tool calls, and prevent duplicate streaming from stale reconnects. (#15114, #15354) --- ## 🖥️ CLI & Desktop - Preserve content/state for connect local file and shell tools; render the `runCommand` tool result card. (#15441, #15442) - New `lh topic view` command; CLI now auto-registers its device on login, matching desktop. (#15340, #15377) - Resolve CLI tools from the shell `PATH`, and clarify local command session handling. (#15368, #15389) - Relocate visual-ref helpers to `@lobechat/const` to fix a renderer crash; upload `.blockmap` files to S3 for differential updates. (#15326, #15369) - Fix a market OAuth expiry that triggered the wrong re-login modal, and kill dev child processes on parent shutdown. (#15246, #15290) --- ## 🗂️ Pages, Library & Knowledge - Document share flow with business slot stubs, plus Workspace and Agent share tables. (#15309, #15439) - Export Agent profiles as Markdown, preserving an empty agent prompt on export. (#15312, #15316) - Editable CodeMirror viewer for local files with document highlighting; BM25 search extended to file-backed documents. (#15247, #15298) - Default new Agent-doc files to `.md` and preserve IME composition; refresh folder data on slug switch and dedupe breadcrumb fetches. (#15335, #15427) --- ## 💬 Chat & User Experience - Group-by-status mode for the Topic sidebar; dropped the legacy session→agentId compatibility path from Topic queries. (#15366, #15378) - Restore editor focus after the file picker closes, and close the skill dropdown before navigating to settings. (#15391, #15394) - Strip markdown tokens from fallback Topic titles; keep an open ActionBar popup when hovering another message. (#15303, #15372) - Stabilize home starter loading and stop transliterating model names in the home starter; show artifact source while streaming. (#15310, #15324, #15386) - Group the sidebar spacer with recents and agents. (#15373) --- ## 📊 Analytics, Tasks & Notifications - Token-usage mode on the activity heatmap, backed by a denormalized topic usage/cost rollup. (#15365, #15417, #15425) - Push: new `PushChannel`, receipt cron, and `pushToken` tRPC API. (#15233) - Tasks now support file and image attachments. (#15141) --- ## 🧩 Models & Providers - Support Claude Opus 4.8 and configurable model routing with starters. (#15314, #15384) - MiniMax M3: new model entry and an Anthropic video runtime. (#15380, #15403) - Add `intern-s2-preview` with `thinking_mode`, and `step-3.7-flash` support. (#15308, #15317) - Block disabling the official provider; fix default provider setup in business mode. (#15379, #15382) --- ## 🎨 UI & Modals - Migrate modals to `@lobehub/ui/base-ui` (LOBE-9711 + eval batch), including the create-custom-model and feedback/changelog modals. (#15401, #15416) - Restructure confirmModal title and content across deletion flows; polish the service-model form and migrate its Switch to base-ui. (#15426, #15440) - Wrap the BlueBubbles bridge config into a connection card; update `@lobehub/ui` to v5.15.5. (#15325, #15342) --- ## 🔒 Reliability - Replace hardcoded `session_context` values with template variables in credentials. (#15352) - Point `CHANGELOG_URL` to `/changelog`. (#15428) --- ## 👥 Contributors Huge thanks to **11 contributors** who shipped **88 merged PRs** this cycle. @hezhijie0327 · @qybaihe · @sxjeru · @arvinxx · @Innei · @tjx666 · @lijian · @sudongyuer · @cy948 · @rivertwilight · @AmAzing129 Plus @lobehubbot and renovate[bot] for maintenance. --- **Full Changelog**: v2.2.1...release/weekly-20260604

dosubot Bot added the size:M This PR changes 30-99 lines, ignoring generated files. label May 26, 2026

sourcery-ai Bot reviewed May 26, 2026

View reviewed changes

dosubot Bot added the feature:knowledge-base knowledge base / RAG / file chunk label May 26, 2026

vercel Bot deployed to Preview May 26, 2026 17:08 View deployment

Innei merged commit 65113ca into canary May 28, 2026
47 of 48 checks passed

Innei deleted the feat/kb-bm25-files branch May 28, 2026 17:01

arvinxx mentioned this pull request Jun 3, 2026

🚀 release: 20260604 #15447

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

✨ feat(kb): extend BM25 search to file-backed documents#15247

✨ feat(kb): extend BM25 search to file-backed documents#15247
Innei merged 1 commit into
canaryfrom
feat/kb-bm25-files

Innei commented May 26, 2026

Uh oh!

vercel Bot commented May 26, 2026 •

edited

Loading

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

codecov Bot commented May 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Innei commented May 26, 2026

💻 Change Type

🔗 Related Issue

🔀 Description of Change

🧪 How to Test

📝 Additional Information

Uh oh!

vercel Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented May 26, 2026 •

edited

Loading

codecov Bot commented May 26, 2026 •

edited

Loading