Skip to content

🐛 fix: add document parsing to knowledge base chunking pipeline#13221

Merged
arvinxx merged 4 commits into
canaryfrom
fix/kb-upload-document-parsing
Mar 24, 2026
Merged

🐛 fix: add document parsing to knowledge base chunking pipeline#13221
arvinxx merged 4 commits into
canaryfrom
fix/kb-upload-document-parsing

Conversation

@arvinxx

@arvinxx arvinxx commented Mar 24, 2026

Copy link
Copy Markdown
Member

💻 Change Type

  • 🐛 fix

🔗 Related Issue

Knowledge base file uploads were missing document parsing, causing detailed content to be unavailable.

🔀 Description of Change

When files are uploaded to a knowledge base, the parseFileToChunks async handler only performed chunking (splitting into chunks for RAG/semantic search) but did not create a documents record. This meant the parsed document content was not available for detailed viewing.

This fix adds document parsing as a pre-step in the chunking pipeline:

  • Before chunking begins, checks if a documents record already exists for the file
  • If not, calls DocumentService.parseFile() to create one
  • Wrapped in try-catch so document parsing failure does not block the chunking flow

🧪 How to Test

  • Tested locally
  • Added/updated tests
  • No tests needed
  1. Upload a file (PDF, DOCX, MD) to a knowledge base
  2. Verify that both chunking and document parsing complete
  3. Check that the file's detailed content is viewable in the knowledge base

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@vercel

vercel Bot commented Mar 24, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
lobehub Ready Ready Preview, Comment Mar 24, 2026 11:45am

Request Review

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've reviewed this pull request using the Sourcery rules engine

@lobehubbot

Copy link
Copy Markdown
Member

@rivertwilight @nekomeowww - This is a knowledge base fix that adds document parsing to the chunking pipeline in the server async router. Please take a look.

@codecov

codecov Bot commented Mar 24, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.37%. Comparing base (383cace) to head (0ecfc1f).
⚠️ Report is 1 commits behind head on canary.

Additional details and impacted files
@@             Coverage Diff             @@
##           canary   #13221       +/-   ##
===========================================
+ Coverage   74.20%   87.37%   +13.17%     
===========================================
  Files        1537      578      -959     
  Lines      126450    44024    -82426     
  Branches    13930     6854     -7076     
===========================================
- Hits        93828    38466    -55362     
+ Misses      32511     5447    -27064     
  Partials      111      111               
Flag Coverage Δ
app ?
database 97.89% <ø> (ø)
packages/agent-runtime 89.60% <ø> (ø)
packages/context-engine 83.57% <ø> (ø)
packages/conversation-flow 92.36% <ø> (ø)
packages/file-loaders 87.02% <ø> (ø)
packages/memory-user-memory 66.68% <ø> (ø)
packages/model-bank 99.84% <ø> (ø)
packages/model-runtime 84.79% <ø> (ø)
packages/prompts 74.60% <ø> (ø)
packages/python-interpreter 92.90% <ø> (ø)
packages/ssrf-safe-fetch 0.00% <ø> (ø)
packages/utils 90.09% <ø> (ø)
packages/web-crawler 88.82% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
Store ∅ <ø> (∅)
Services ∅ <ø> (∅)
Server ∅ <ø> (∅)
Libs ∅ <ø> (∅)
Utils 93.47% <ø> (+2.06%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@arvinxx arvinxx merged commit 72ba8c8 into canary Mar 24, 2026
31 checks passed
@arvinxx arvinxx deleted the fix/kb-upload-document-parsing branch March 24, 2026 11:49
@lobehubbot

Copy link
Copy Markdown
Member

❤️ Great PR @arvinxx ❤️

The growth of project is inseparable from user feedback and contribution, thanks for your contribution! If you are interesting with the lobehub developer community, please join our discord and then dm @arvinxx or @canisminor1990. They will invite you to our private developer channel. We are talking about the lobe-chat development or sharing ai newsletter around the world.

ONLY-yours added a commit that referenced this pull request Mar 27, 2026
# 🚀 release: 20260326

This release includes **91 commits**. Key updates are below.


- **Agent can now execute background tasks** — Agents can perform
long-running operations without blocking your conversation.
[#13289](#13289)
- **Better error messages** — Redesigned error UI across chat and image
generation with clearer explanations and recovery options.
[#13302](#13302)
- **Smoother topic switching** — No more full page reloads when
switching topics while an agent is responding.
[#13309](#13309)
- **Faster image uploads** — Large images are now automatically
compressed to 1920px before upload, reducing wait times.
[#13224](#13224)
- **Improved knowledge base** — Documents are now properly parsed before
chunking, improving retrieval accuracy.
[#13221](#13221)

### Bot Platform

- **WeChat Bot support** — You can now connect LobeChat to WeChat, in
addition to Discord.
[#13191](#13191)
- **Richer bot responses** — Bots now support custom markdown rendering
and context injection.
[#13294](#13294)
- **New bot commands** — Added `/new` to start fresh conversations and
`/stop` to halt generation.
[#13194](#13194)
- **Discord stability fixes** — Fixed thread creation issues and Redis
connection drops.
[#13228](#13228)
[#13205](#13205)

### Models & Providers

- **GLM-5** is now available in the LobeHub model list.
[#13189](#13189)
- **Coding Plan providers** — Added support for code planning assistant
providers. [#13203](#13203)
- **Tencent Hunyuan 3.0 ImageGen** — New image generation model from
Tencent. [#13166](#13166)
- **Gemini content handling** — Better handling when Gemini blocks
content due to safety filters.
[#13270](#13270)
- **Claude token limits fixed** — Corrected max window tokens for
Anthropic Claude models.
[#13206](#13206)

### Skills & Tools

- **Auto credential injection** — Skills can now automatically request
and use required credentials.
[#13124](#13124)
- **Smarter tool permissions** — Built-in tools skip confirmation for
safe paths like `/tmp`.
[#13232](#13232)
- **Model switcher improvements** — Quick access to provider settings
and visual highlight for default model.
[#13220](#13220)

### Memory

- **Bulk delete memories** — You can now delete all memory entries at
once. [#13161](#13161)
- **Per-agent memory control** — Memory injection now respects
individual agent settings.
[#13265](#13265)

### Desktop App

- **Gateway connection** — Desktop app can now connect to LobeHub
Gateway for enhanced features.
[#13234](#13234)
- **Connection status indicator** — See gateway connection status in the
titlebar. [#13260](#13260)
- **Settings persistence** — Gateway toggle state now persists across
app restarts. [#13300](#13300)

### CLI

- **API key authentication** — CLI now supports API key auth for
programmatic access.
[#13190](#13190)
- **Shell completion** — Tab completion for bash/zsh/fish shells.
[#13164](#13164)
- **Man pages** — Built-in manual pages for CLI commands.
[#13200](#13200)

### Security

- **XSS protection** — Sanitized search result image titles to prevent
script injection.
[#13303](#13303)
- **Workflow hardening** — Fixed potential shell injection in release
automation. [#13319](#13319)
- **Dependency update** — Updated nodemailer to address security
advisory. [#13326](#13326)

### Bug Fixes

- Fixed skill page not redirecting correctly after import.
[#13255](#13255)
[#13261](#13261)
- Fixed token counting in group chats.
[#13247](#13247)
- Fixed editor not resetting when switching to empty pages.
[#13229](#13229)
- Fixed manual tool toggle not working.
[#13218](#13218)
- Fixed Search1API response parsing.
[#13207](#13207)
[#13208](#13208)
- Fixed mobile topic menus rendering issues.
[#12477](#12477)
- Fixed history count calculation for accurate context.
[#13051](#13051)
- Added missing Turkish translations.
[#13196](#13196)

### Credits

Huge thanks to these contributors:

@bakiburakogun @hardy-one @Zhouguanyang @sxjeru @hezhijie0327 @arvinxx
@cy948 @CanisMinor @Innei @lijian @lobehubbot @neko @rdmclin2
@rivertwilight @tjx666
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants