♻️ refactor: remove langchain dependency, use direct document loaders#13304
Conversation
Replace langchain and @langchain/community with self-implemented text splitters and direct usage of underlying libraries (pdf-parse, d3-dsv, mammoth, officeparser, epub2). This eliminates unnecessary dependency bloat and addresses CVE-2026-26019 in @langchain/community. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
@rivertwilight @nekomeowww - This PR removes the LangChain dependency and replaces it with direct document loaders for the knowledge base pipeline. It also updates the server-side ContentChunk module. Please coordinate on review. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## canary #13304 +/- ##
==========================================
+ Coverage 66.71% 66.79% +0.07%
==========================================
Files 1884 1886 +2
Lines 150871 151317 +446
Branches 15184 17963 +2779
==========================================
+ Hits 100660 101070 +410
- Misses 50100 50136 +36
Partials 111 111
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
| separators: string[], | ||
| config: SplitterConfig, | ||
| ): string[] { | ||
| const { chunkSize, chunkOverlap } = config; |
| /** | ||
| * Calculate line location metadata for a chunk within the original text. | ||
| */ | ||
| function getLineLocation(fullText: string, chunk: string): { from: number; to: number } { |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f46ed13b07
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const chunkLines = chunk.split('\n').length; | ||
| loc = { from, to: from + chunkLines - 1 }; | ||
| // Advance search position past this match (but allow overlap) | ||
| searchFrom = index + 1; |
There was a problem hiding this comment.
Advance metadata search offset by chunk span
createDocuments moves searchFrom forward by one character after each match, so on repetitive/overlapping content indexOf can match inside the previous chunk instead of the next chunk boundary. In practice this makes metadata.loc.lines drift (e.g., repeated lines can advance 1 line per chunk instead of by the non-overlapped span), which breaks line-based references/citations produced from these chunks. The offset should advance by the consumed chunk span (accounting for overlap), not index + 1.
Useful? React with 👍 / 👎.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
❤️ Great PR @arvinxx ❤️ The growth of project is inseparable from user feedback and contribution, thanks for your contribution! If you are interesting with the lobehub developer community, please join our discord and then dm @arvinxx or @canisminor1990. They will invite you to our private developer channel. We are talking about the lobe-chat development or sharing ai newsletter around the world. |
Summary
langchainand@langchain/communitydependencies, replacing with self-implemented text splitters and direct usage of underlying libraries (pdf-parse,d3-dsv,mammoth,officeparser,epub2+html-to-text)@langchain/community'sRecursiveUrlLoader)src/libs/langchain→src/libs/document-loadersfor clearer semanticsMotivation
The project only used LangChain for document loading and text chunking in the knowledge base pipeline. All LangChain loaders are thin wrappers around other npm packages, and the text splitters are pure string algorithms. Keeping
langchain+@langchain/community(~massive transitive dependency tree) for this limited usage introduced unnecessary bloat and security surface.Changes
PDFLoaderfrom@langchain/communitypdf-parseusageCSVLoaderfrom@langchain/communityd3-dsvusageDocxLoaderfrom@langchain/communitymammothusagePPTXLoaderfrom@langchain/communityofficeparserusageEPubLoaderfrom@langchain/communityepub2+html-to-textusageRecursiveCharacterTextSplitterfromlangchainMarkdownTextSplitter/LatexTextSplitterTest plan
🤖 Generated with Claude Code