♻️ refactor: remove langchain dependency, use direct document loaders by arvinxx · Pull Request #13304 · lobehub/lobehub

arvinxx · 2026-03-26T12:15:37Z

Summary

Remove langchain and @langchain/community dependencies, replacing with self-implemented text splitters and direct usage of underlying libraries (pdf-parse, d3-dsv, mammoth, officeparser, epub2 + html-to-text)
Addresses CVE-2026-26019 (SSRF bypass in @langchain/community's RecursiveUrlLoader)
Rename src/libs/langchain → src/libs/document-loaders for clearer semantics

Motivation

The project only used LangChain for document loading and text chunking in the knowledge base pipeline. All LangChain loaders are thin wrappers around other npm packages, and the text splitters are pure string algorithms. Keeping langchain + @langchain/community (~massive transitive dependency tree) for this limited usage introduced unnecessary bloat and security surface.

Changes

Before (LangChain)	After (Direct)
`PDFLoader` from `@langchain/community`	Direct `pdf-parse` usage
`CSVLoader` from `@langchain/community`	Direct `d3-dsv` usage
`DocxLoader` from `@langchain/community`	Direct `mammoth` usage
`PPTXLoader` from `@langchain/community`	Direct `officeparser` usage
`EPubLoader` from `@langchain/community`	Direct `epub2` + `html-to-text` usage
`RecursiveCharacterTextSplitter` from `langchain`	Self-implemented (~150 lines)
`MarkdownTextSplitter` / `LatexTextSplitter`	Self-implemented variants
Language-aware code splitter	Self-implemented with same separator lists

Test plan

All 8 test files pass (31 tests total)
Text splitter produces correct chunk boundaries and line metadata
Code splitter handles all 16 supported languages
CSV, PDF, DOCX, PPTX, EPUB loaders produce correct output
ContentChunk integration tests pass with updated imports
TypeScript type check passes with no errors

🤖 Generated with Claude Code

Replace langchain and @langchain/community with self-implemented text splitters and direct usage of underlying libraries (pdf-parse, d3-dsv, mammoth, officeparser, epub2). This eliminates unnecessary dependency bloat and addresses CVE-2026-26019 in @langchain/community. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vercel · 2026-03-26T12:15:43Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
lobehub	Ready	Preview, Comment	Mar 26, 2026 0:56am

sourcery-ai

Sorry @arvinxx, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

lobehubbot · 2026-03-26T12:19:05Z

@rivertwilight @nekomeowww - This PR removes the LangChain dependency and replaces it with direct document loaders for the knowledge base pipeline. It also updates the server-side ContentChunk module. Please coordinate on review.

codecov · 2026-03-26T12:22:31Z

Codecov Report

❌ Patch coverage is 89.19926% with 58 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.79%. Comparing base (093fa7b) to head (679f506).
⚠️ Report is 4 commits behind head on canary.

Additional details and impacted files

@@            Coverage Diff             @@
##           canary   #13304      +/-   ##
==========================================
+ Coverage   66.71%   66.79%   +0.07%     
==========================================
  Files        1884     1886       +2     
  Lines      150871   151317     +446     
  Branches    15184    17963    +2779     
==========================================
+ Hits       100660   101070     +410     
- Misses      50100    50136      +36     
  Partials      111      111

Flag	Coverage Δ
app	`58.23% <89.19%> (+0.14%)`	⬆️
database	`96.64% <ø> (ø)`
packages/agent-runtime	`89.61% <ø> (ø)`
packages/context-engine	`83.22% <ø> (ø)`
packages/conversation-flow	`92.36% <ø> (ø)`
packages/file-loaders	`87.02% <ø> (ø)`
packages/memory-user-memory	`66.68% <ø> (ø)`
packages/model-bank	`99.85% <ø> (ø)`
packages/model-runtime	`84.53% <ø> (ø)`
packages/prompts	`67.76% <ø> (ø)`
packages/python-interpreter	`92.90% <ø> (ø)`
packages/ssrf-safe-fetch	`0.00% <ø> (ø)`
packages/utils	`90.41% <ø> (ø)`
packages/web-crawler	`88.82% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
Store	`66.07% <ø> (ø)`
Services	`49.56% <ø> (ø)`
Server	`67.40% <87.50%> (+<0.01%)`	⬆️
Libs	`51.06% <89.22%> (+5.60%)`	⬆️
Utils	`91.01% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

+  separators: string[],
+  config: SplitterConfig,
+): string[] {
+  const { chunkSize, chunkOverlap } = config;


+/**
+ * Calculate line location metadata for a chunk within the original text.
+ */
+function getLineLocation(fullText: string, chunk: string): { from: number; to: number } {


chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f46ed13b07

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-26T12:24:38Z

+      const chunkLines = chunk.split('\n').length;
+      loc = { from, to: from + chunkLines - 1 };
+      // Advance search position past this match (but allow overlap)
+      searchFrom = index + 1;


Advance metadata search offset by chunk span

createDocuments moves searchFrom forward by one character after each match, so on repetitive/overlapping content indexOf can match inside the previous chunk instead of the next chunk boundary. In practice this makes metadata.loc.lines drift (e.g., repeated lines can advance 1 line per chunk instead of by the non-overlapped span), which breaks line-based references/citations produced from these chunks. The offset should advance by the consumed chunk span (accounting for overlap), not index + 1.

Useful? React with 👍 / 👎.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

lobehubbot · 2026-03-26T13:14:06Z

❤️ Great PR @arvinxx ❤️

The growth of project is inseparable from user feedback and contribution, thanks for your contribution! If you are interesting with the lobehub developer community, please join our discord and then dm @arvinxx or @canisminor1990. They will invite you to our private developer channel. We are talking about the lobe-chat development or sharing ai newsletter around the world.

sourcery-ai Bot reviewed Mar 26, 2026

View reviewed changes

github-code-quality Bot found potential problems Mar 26, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Mar 26, 2026

View reviewed changes

🐛 fix: add missing @types/html-to-text and @types/pdf-parse

679f506

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vercel Bot deployed to Preview March 26, 2026 12:56 View deployment

arvinxx merged commit 3f14800 into canary Mar 26, 2026
34 checks passed

arvinxx deleted the refactor/remove-langchain branch March 26, 2026 13:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

♻️ refactor: remove langchain dependency, use direct document loaders#13304

♻️ refactor: remove langchain dependency, use direct document loaders#13304
arvinxx merged 2 commits into
canaryfrom
refactor/remove-langchain

arvinxx commented Mar 26, 2026

Uh oh!

vercel Bot commented Mar 26, 2026 •

edited

Loading

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

lobehubbot commented Mar 26, 2026

Uh oh!

codecov Bot commented Mar 26, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 26, 2026

Uh oh!

Uh oh!

lobehubbot commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

arvinxx commented Mar 26, 2026

Summary

Motivation

Changes

Test plan

Uh oh!

vercel Bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

lobehubbot commented Mar 26, 2026

Uh oh!

codecov Bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lobehubbot commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel Bot commented Mar 26, 2026 •

edited

Loading

codecov Bot commented Mar 26, 2026 •

edited

Loading