feat(core): auto-extract description from first contentful line in extractPageData#3006
feat(core): auto-extract description from first contentful line in extractPageData#3006
Conversation
## 变更总结 在 `extractPageData.ts` 中添加了两个新函数来实现自动提取 description 功能: ### 1. `extractTextFromNode` (第 92-100 行) 递归提取节点的纯文本内容,处理文本节点和包含子节点的情况。 ### 2. `extractDescription` (第 106-118 行) 从 AST 中提取第一个 h2 之前的第一段文本作为 description: - 遍历根节点的子节点 - 遇到 h2 (`heading` with `depth === 2`) 时停止搜索 - 找到第一个 `paragraph` 节点时,返回其文本内容 ### 3. 集成到 `getPageIndexInfoByRoute` - 第 171-174 行:当 `frontmatter.description` 不存在时,调用 `extractDescription` 提取 description - 第 239 行:在返回的 frontmatter 中设置 description,优先使用 frontmatter 中的值,否则使用提取的值 ### 新增测试 - 测试自动提取 description(无 frontmatter.description 时) - 测试 frontmatter.description 优先级(有 frontmatter.description 时不提取)
## 变更总结 ### 1. `packages/shared/src/types/index.ts` 在 `PageIndexInfo` 接口中添加了新的 `description?: string` 字段。 ### 2. `packages/core/src/node/route/extractPageData.ts` - 添加了 `extractTextFromNode` 函数:递归提取节点的纯文本内容 - 添加了 `extractDescription` 函数:从 AST 中提取第一个 h2 之前的第一段文本 - 在返回结果中设置独立的 `description` 字段,而 `frontmatter.description` 保持原样 ### 3. 测试更新 - 新增 3 个测试用例验证 description 提取逻辑 - 验证当 `frontmatter.description` 存在时,`description` 字段使用它的值,同时 `frontmatter.description` 保持不变 - 验证当 `frontmatter.description` 不存在时,`description` 字段使用从内容中提取的值,而 `frontmatter.description` 保持 `undefined`
✅ Deploy Preview for rspress-v2 ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Rsdoctor Bundle Diff AnalysisFound 3 projects in monorepo, 3 projects with changes. 📊 Quick Summary
📋 Detailed Reports (Click to expand)📁 nodePath:
📦 Download Diff Report: node Bundle Diff 📁 node_mdPath:
📦 Download Diff Report: node_md Bundle Diff 📁 webPath:
📦 Download Diff Report: web Bundle Diff Generated by Rsdoctor GitHub Action |
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This pull request adds functionality to automatically extract page descriptions from markdown content when they are not explicitly provided in frontmatter. The description is extracted from the first paragraph that appears before any h2 heading.
Changes:
- Added optional
descriptionfield toPageIndexInfotype - Implemented description extraction logic that uses the first paragraph before h2 as fallback
- Added comprehensive test coverage including new test fixture and test cases
- Updated existing test snapshots to include description field
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| packages/shared/src/types/index.ts | Added optional description field to PageIndexInfo interface |
| packages/core/src/node/route/fixtures/content-processing/with-description.mdx | New test fixture for description extraction feature |
| packages/core/src/node/route/extractPageData.ts | Implemented extractTextFromNode and extractDescription helper functions, integrated description extraction into page data processing |
| packages/core/src/node/route/extractPageData.test.ts | Added three new test cases and updated existing test snapshots to include description field |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Update plugin-llms to use page.description for llms.txt generation - Update Layout component to use page.description for meta tags This ensures the auto-extracted description is used when frontmatter.description is not provided. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
现在 llmsTxt 拿不到 createPageData 中的 page.description, 尝试解决,可能的解决方案是 createPageData 后的内容放在全局变量上,llmsTxt.ts 中去拿取
1. 完善测试拆分成两个 describe ,getPageIndexInfoByRoute一个单独 describe
2. 修复错误
FAIL packages/core/src/node/route/extractPageData.test.ts > extractPageData > basic
TypeError: routeService.getRoutePageByRoutePath is not a function
251 | const pageIndexInfo = await getPageIndexInfoByRoute(routeMeta, options);
252 | // Store pageIndexInfo in RoutePage for llmsTxt to access
> 253 | const routePage = routeService.getRoutePageByRoutePath(
| ^
254 | routeMeta.routePath,
255 | );
at packages/core/src/node/route/extractPageData.ts:253:38
at extractPageData (packages/core/src/node/route/extractPageData.ts:249:20)
at packages/core/src/node/route/extractPageData.test.ts:28:22
Refactor extractDescription to collect all paragraph text between h1 and h2 instead of just the first paragraph. Skip code blocks, HTML, imports, tables, and images following Docusaurus createExcerpt strategy. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Reuse processor instances at module level instead of creating new ones per file - Use structuredClone to reuse parsed AST tree, avoiding duplicate parsing - Use unist-util-visit for more efficient tree traversal in remarkStripLinkUrls These optimizations reduce the overhead when processing multiple markdown files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The tree is no longer needed after extracting TOC and description, so we can pass it directly to stringifyProcessor.run() without cloning. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove separate parseProcessor, use the same processor for both parse() and run()+stringify() operations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Summary
Add automatic description extraction to
extractPageDatawhenfrontmatter.descriptionis not provided. The description is extracted from the first paragraph before the first h2 heading in the markdown content.Related Issue
Checklist
AI Summary
Changes Made
Added
descriptionfield toPageIndexInfotype (packages/shared/src/types/index.ts)description?: stringfield added to the interfaceImplemented description extraction in
extractPageData.ts(packages/core/src/node/route/extractPageData.ts)extractTextFromNode()function: recursively extracts plain text content from AST nodesextractDescription()function: extracts the first paragraph's text content before any h2 headingdescriptionfield is set at the top level ofPageIndexInfo, whilefrontmatter.descriptionpreserves the original frontmatter valueAdded test cases (
packages/core/src/node/route/extractPageData.test.ts)frontmatter.descriptionis not setfrontmatter.descriptiontakes priority when providedfrontmatter.descriptionremains unchanged (preserves original value)How It Works
frontmatter.descriptionexists, use it as thedescriptionfield valuefrontmatter.descriptionis not set, extract the first paragraph before the first h2 heading as the descriptionfrontmatter.descriptionalways preserves the original frontmatter value (not modified by extraction)This PR was written using Vibe Kanban