Skip to content

✨ feat(kb): extend BM25 search to file-backed documents#15247

Merged
Innei merged 1 commit into
canaryfrom
feat/kb-bm25-files
May 28, 2026
Merged

✨ feat(kb): extend BM25 search to file-backed documents#15247
Innei merged 1 commit into
canaryfrom
feat/kb-bm25-files

Conversation

@Innei

@Innei Innei commented May 26, 2026

Copy link
Copy Markdown
Member

💻 Change Type

  • ✨ feat

🔗 Related Issue

N/A

🔀 Description of Change

searchKnowledgeBaseDocuments (the BM25 leg of the knowledge-base hybrid retrieval) previously hard-filtered on documents.fileType === 'custom/document', so only inline KB pages were searchable. Parsed PDFs and other file-backed documents already had their content sitting in the documents table — and were already covered by documents_bm25_idx — but the query never matched them. They could only be retrieved via vector search on chunks.

This PR extends BM25 to cover file-backed documents by running two scoped ParadeDB queries in parallel and merging by score in JS:

  • inline pathdocuments.knowledge_base_id IN (...)
  • file-backed pathINNER JOIN knowledge_base_files ON file_id with kbf.knowledge_base_id IN (...)

A single OR-ed predicate trips ParadeDB's Unsupported query shape because paradedb.score() requires a conjunctive tantivy scan; splitting into two queries keeps each branch eligible for index-scan scoring. Folder rows are excluded (no content). The de-duped, score-sorted result set is then sliced to the original limit.

Hits now carry an optional fileId; downstream:

  • KnowledgeBaseDocumentResult (builtin-tool-knowledge-base) — mirror type updated
  • formatSearchResults — exposes fileId="..." on the <document> XML tag so the agent can call readKnowledge with either docs_* or file_*

No schema migration — the existing documents_bm25_idx already covers content/title/slug across all file types.

🧪 How to Test

  • Tested locally
  • Added/updated tests
  • No tests needed

Test plan:

  • packages/database — added searchKnowledgeBaseDocuments > file-backed documents (PDF / parsed files) sub-suite: covers PDF hit via knowledge_base_files join, mixed inline + file-backed in same call, folder exclusion, cross-user isolation. 58/58 tests pass against ParadeDB (DATABASE_TEST_URL=postgresql://lobe:lobe123@localhost:13333/lobechat_test TEST_SERVER_DB=1 bunx vitest run --config vitest.config.server.mts src/repositories/search/index.test.ts)
  • packages/prompts — added fileId attribute assertion to formatSearchResults tests (14/14 pass)
  • src/server/services/knowledgeBase — existing consumer suite (13/13) still passes
  • ESLint clean on all modified files; no new TS errors

📝 Additional Information

  • BM25 hits are document-grained, vector hits are chunk-grained — this PR doesn't change that asymmetry. Chunk-level BM25 (index on chunks.text) is a separate, larger follow-up if finer relevance is needed.
  • The fileId field is optional and additive; existing consumers that don't read it continue to work unchanged.

`searchKnowledgeBaseDocuments` only matched inline `custom/document`
pages, so parsed PDFs and other file-backed documents never surfaced
via the BM25 path — vector search was the sole way to retrieve them.

Run two scoped ParadeDB queries in parallel (inline via
`documents.knowledge_base_id`, file-backed via a `knowledge_base_files`
join) and merge by score in JS. A single OR-ed predicate trips
ParadeDB's `Unsupported query shape` because `paradedb.score()`
requires a conjunctive tantivy scan.

Folder rows are excluded; hits now carry an optional `fileId` so the
agent can read with either `docs_*` or `file_*` ids. The XML formatter
exposes the new attribute downstream.
@vercel

vercel Bot commented May 26, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
lobehub Ready Ready Preview, Comment May 26, 2026 5:08pm

Request Review

@dosubot dosubot Bot added the size:M This PR changes 30-99 lines, ignoring generated files. label May 26, 2026

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @Innei, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@dosubot dosubot Bot added the feature:knowledge-base knowledge base / RAG / file chunk label May 26, 2026
@codecov

codecov Bot commented May 26, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.81%. Comparing base (72d3404) to head (a646fb3).

Additional details and impacted files
@@            Coverage Diff             @@
##           canary   #15247      +/-   ##
==========================================
- Coverage   70.81%   70.81%   -0.01%     
==========================================
  Files        3154     3154              
  Lines      315154   315155       +1     
  Branches    34357    34358       +1     
==========================================
- Hits       223188   223187       -1     
- Misses      91797    91799       +2     
  Partials      169      169              
Flag Coverage Δ
app 61.66% <ø> (-0.01%) ⬇️
database 92.22% <ø> (ø)
packages/agent-runtime 80.48% <ø> (ø)
packages/builtin-tool-lobe-agent 19.87% <ø> (ø)
packages/context-engine 84.13% <ø> (ø)
packages/conversation-flow 91.28% <ø> (ø)
packages/file-loaders 87.89% <ø> (ø)
packages/memory-user-memory 74.99% <ø> (ø)
packages/model-bank 99.99% <ø> (ø)
packages/model-runtime 83.90% <ø> (ø)
packages/prompts 72.68% <100.00%> (+<0.01%) ⬆️
packages/python-interpreter 92.90% <ø> (ø)
packages/ssrf-safe-fetch 0.00% <ø> (ø)
packages/types 35.07% <ø> (ø)
packages/utils 88.47% <ø> (ø)
packages/web-crawler 88.08% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
Store 67.68% <ø> (ø)
Services 54.64% <ø> (ø)
Server 72.22% <ø> (-0.01%) ⬇️
Libs 56.97% <ø> (ø)
Utils 86.06% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Innei Innei merged commit 65113ca into canary May 28, 2026
47 of 48 checks passed
@Innei Innei deleted the feat/kb-bm25-files branch May 28, 2026 17:01
Coooolfan pushed a commit to Coooolfan/lobehub that referenced this pull request Jun 1, 2026
`searchKnowledgeBaseDocuments` only matched inline `custom/document`
pages, so parsed PDFs and other file-backed documents never surfaced
via the BM25 path — vector search was the sole way to retrieve them.

Run two scoped ParadeDB queries in parallel (inline via
`documents.knowledge_base_id`, file-backed via a `knowledge_base_files`
join) and merge by score in JS. A single OR-ed predicate trips
ParadeDB's `Unsupported query shape` because `paradedb.score()`
requires a conjunctive tantivy scan.

Folder rows are excluded; hits now carry an optional `fileId` so the
agent can read with either `docs_*` or `file_*` ids. The XML formatter
exposes the new attribute downstream.
@arvinxx arvinxx mentioned this pull request Jun 3, 2026
arvinxx added a commit that referenced this pull request Jun 4, 2026
# 🚀 LobeHub Release (20260604)

**Release Date:** June 4, 2026  
**Since v2.2.1:** 88 merged PRs · 11 contributors

> This week brings Execution Devices out of the lab — run agents and
Claude Code on any configured local or remote machine — alongside Claude
Opus 4.8, token-usage analytics, and Page sharing.

---

## ✨ Highlights

- **Execution Devices** — Pick where an agent runs. Desktop and CLI
devices auto-register with a stable machine ID, route through the
gateway by channel, and surface a device switcher in the chat input. Run
remote Claude Code on a configured device, with a recent-directory
picker you can drag to reorder. (#15300, #15315, #15322, #15343, #15351,
#15371)
- **Claude Opus 4.8** — Day-one support for Anthropic's latest model.
(#15314)
- **Token-usage analytics** — A new token-usage mode on the activity
heatmap, backed by a denormalized topic usage/cost rollup so totals stay
accurate without recomputing from messages. (#15365, #15417, #15425)
- **Page sharing** — Share a Page through a dedicated document share
flow, plus new Workspace and Agent share tables. (#15309, #15439)
- **Self-iteration agents** — Agent Signal's execAgent migration lands a
server-runtime bridge, async memory writer, and a registered
self-iteration tool package, with a CLI trigger command for testing.
(#15360, #15364, #15392)
- **Knowledge search** — BM25 search now extends to file-backed
documents, and the portal ships an editable CodeMirror viewer for local
files with document highlighting. (#15247, #15298)

---

## 🏗️ Core Agent & Architecture

### Agent Signal & Runtime

- **execAgent migration** — Server-runtime bridge, completion
projection, async memory writer, and removal of the legacy
`executeSelfIteration` path. (#15392)
- Registered the self-iteration builtin tool package and restored the
three mode-specific self-iteration agent slugs. (#15202, #15364)
- Added a CLI trigger command with a golden-snapshot fixture for Agent
Signal. (#15360)
- **Skill priority** — Agent Builder now emits a skill-priority
instruction with matching server runtime. (#15409)
- Retry empty LLM completions instead of silently finishing the turn.
(#15355)
- Classify topic/agent/session foreign-key violations as
`ConversationParentMissing` for clearer recovery. (#15408)
- Persist canonical nested usage/performance on assistant messages, and
re-link orphan tool messages at the raw bucket write boundary. (#15359,
#15438)
- Guard `createAgent` against LLM double-encoded array fields. (#15381)

---

## 🖥️ Execution Devices & Gateway

- Auto-register desktop and CLI devices with a stable machine ID, and
add the `@lobechat/device-identity` package. (#15300, #15321)
- New Devices settings page behind the Execution Device Switcher lab,
with a device switcher shown for all agents in the chat input. (#15315,
#15371)
- `connectionId` + channel routing across the gateway client and device
list; preset the local device on the first LLM request for the 本机
target. (#15322, #15435)
- Run remote Claude Code on a configured device, with drag-to-reorder
recent-directory management and client renders for device tool results.
(#15343, #15351, #15437)
- Preserve content and state across gateway tool calls, and prevent
duplicate streaming from stale reconnects. (#15114, #15354)

---

## 🖥️ CLI & Desktop

- Preserve content/state for connect local file and shell tools; render
the `runCommand` tool result card. (#15441, #15442)
- New `lh topic view` command; CLI now auto-registers its device on
login, matching desktop. (#15340, #15377)
- Resolve CLI tools from the shell `PATH`, and clarify local command
session handling. (#15368, #15389)
- Relocate visual-ref helpers to `@lobechat/const` to fix a renderer
crash; upload `.blockmap` files to S3 for differential updates. (#15326,
#15369)
- Fix a market OAuth expiry that triggered the wrong re-login modal, and
kill dev child processes on parent shutdown. (#15246, #15290)

---

## 🗂️ Pages, Library & Knowledge

- Document share flow with business slot stubs, plus Workspace and Agent
share tables. (#15309, #15439)
- Export Agent profiles as Markdown, preserving an empty agent prompt on
export. (#15312, #15316)
- Editable CodeMirror viewer for local files with document highlighting;
BM25 search extended to file-backed documents. (#15247, #15298)
- Default new Agent-doc files to `.md` and preserve IME composition;
refresh folder data on slug switch and dedupe breadcrumb fetches.
(#15335, #15427)

---

## 💬 Chat & User Experience

- Group-by-status mode for the Topic sidebar; dropped the legacy
session→agentId compatibility path from Topic queries. (#15366, #15378)
- Restore editor focus after the file picker closes, and close the skill
dropdown before navigating to settings. (#15391, #15394)
- Strip markdown tokens from fallback Topic titles; keep an open
ActionBar popup when hovering another message. (#15303, #15372)
- Stabilize home starter loading and stop transliterating model names in
the home starter; show artifact source while streaming. (#15310, #15324,
#15386)
- Group the sidebar spacer with recents and agents. (#15373)

---

## 📊 Analytics, Tasks & Notifications

- Token-usage mode on the activity heatmap, backed by a denormalized
topic usage/cost rollup. (#15365, #15417, #15425)
- Push: new `PushChannel`, receipt cron, and `pushToken` tRPC API.
(#15233)
- Tasks now support file and image attachments. (#15141)

---

## 🧩 Models & Providers

- Support Claude Opus 4.8 and configurable model routing with starters.
(#15314, #15384)
- MiniMax M3: new model entry and an Anthropic video runtime. (#15380,
#15403)
- Add `intern-s2-preview` with `thinking_mode`, and `step-3.7-flash`
support. (#15308, #15317)
- Block disabling the official provider; fix default provider setup in
business mode. (#15379, #15382)

---

## 🎨 UI & Modals

- Migrate modals to `@lobehub/ui/base-ui` (LOBE-9711 + eval batch),
including the create-custom-model and feedback/changelog modals.
(#15401, #15416)
- Restructure confirmModal title and content across deletion flows;
polish the service-model form and migrate its Switch to base-ui.
(#15426, #15440)
- Wrap the BlueBubbles bridge config into a connection card; update
`@lobehub/ui` to v5.15.5. (#15325, #15342)

---

## 🔒 Reliability

- Replace hardcoded `session_context` values with template variables in
credentials. (#15352)
- Point `CHANGELOG_URL` to `/changelog`. (#15428)

---

## 👥 Contributors

Huge thanks to **11 contributors** who shipped **88 merged PRs** this
cycle.

@hezhijie0327 · @qybaihe · @sxjeru · @arvinxx · @Innei · @tjx666 ·
@lijian · @sudongyuer · @cy948 · @rivertwilight · @AmAzing129

Plus @lobehubbot and renovate[bot] for maintenance.

---

**Full Changelog**: v2.2.1...release/weekly-20260604
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature:knowledge-base knowledge base / RAG / file chunk size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant