Skip to content

feat(core): auto-extract description from first contentful line in extractPageData#3006

Merged
SoonIter merged 12 commits intomainfrom
syt-vibe-kanban/99b2-extractpagedata
Jan 15, 2026
Merged

feat(core): auto-extract description from first contentful line in extractPageData#3006
SoonIter merged 12 commits intomainfrom
syt-vibe-kanban/99b2-extractpagedata

Conversation

@SoonIter
Copy link
Copy Markdown
Member

@SoonIter SoonIter commented Jan 14, 2026

Summary

Add automatic description extraction to extractPageData when frontmatter.description is not provided. The description is extracted from the first paragraph before the first h2 heading in the markdown content.

https://docusaurus.io/docs/markdown-features/head-metadata#markdown-page-description

image

Related Issue

Checklist

  • Tests updated (or not required).
  • Documentation updated (or not required).

AI Summary

Changes Made

  1. Added description field to PageIndexInfo type (packages/shared/src/types/index.ts)

    • New optional description?: string field added to the interface
  2. Implemented description extraction in extractPageData.ts (packages/core/src/node/route/extractPageData.ts)

    • Added extractTextFromNode() function: recursively extracts plain text content from AST nodes
    • Added extractDescription() function: extracts the first paragraph's text content before any h2 heading
    • The description field is set at the top level of PageIndexInfo, while frontmatter.description preserves the original frontmatter value
  3. Added test cases (packages/core/src/node/route/extractPageData.test.ts)

    • Tests for automatic description extraction when frontmatter.description is not set
    • Tests to verify frontmatter.description takes priority when provided
    • Tests to ensure frontmatter.description remains unchanged (preserves original value)

How It Works

  • When processing a markdown file, if frontmatter.description exists, use it as the description field value
  • If frontmatter.description is not set, extract the first paragraph before the first h2 heading as the description
  • The frontmatter.description always preserves the original frontmatter value (not modified by extraction)

This PR was written using Vibe Kanban


## 变更总结

在 `extractPageData.ts` 中添加了两个新函数来实现自动提取 description 功能:

### 1. `extractTextFromNode` (第 92-100 行)
递归提取节点的纯文本内容,处理文本节点和包含子节点的情况。

### 2. `extractDescription` (第 106-118 行)
从 AST 中提取第一个 h2 之前的第一段文本作为 description:
- 遍历根节点的子节点
- 遇到 h2 (`heading` with `depth === 2`) 时停止搜索
- 找到第一个 `paragraph` 节点时,返回其文本内容

### 3. 集成到 `getPageIndexInfoByRoute`
- 第 171-174 行:当 `frontmatter.description` 不存在时,调用 `extractDescription` 提取 description
- 第 239 行:在返回的 frontmatter 中设置 description,优先使用 frontmatter 中的值,否则使用提取的值

### 新增测试
- 测试自动提取 description(无 frontmatter.description 时)
- 测试 frontmatter.description 优先级(有 frontmatter.description 时不提取)
## 变更总结

### 1. `packages/shared/src/types/index.ts`
在 `PageIndexInfo` 接口中添加了新的 `description?: string` 字段。

### 2. `packages/core/src/node/route/extractPageData.ts`
- 添加了 `extractTextFromNode` 函数:递归提取节点的纯文本内容
- 添加了 `extractDescription` 函数:从 AST 中提取第一个 h2 之前的第一段文本
- 在返回结果中设置独立的 `description` 字段,而 `frontmatter.description` 保持原样

### 3. 测试更新
- 新增 3 个测试用例验证 description 提取逻辑
- 验证当 `frontmatter.description` 存在时,`description` 字段使用它的值,同时 `frontmatter.description` 保持不变
- 验证当 `frontmatter.description` 不存在时,`description` 字段使用从内容中提取的值,而 `frontmatter.description` 保持 `undefined`
Copilot AI review requested due to automatic review settings January 14, 2026 12:10
@SoonIter SoonIter changed the title extractPageData 增加 remark 插件用来获取 description (vibe-kanban) feat(core): auto-extract description from first paragraph before h2 in extractPageData Jan 14, 2026
@netlify
Copy link
Copy Markdown

netlify bot commented Jan 14, 2026

Deploy Preview for rspress-v2 ready!

Name Link
🔨 Latest commit 160dd7f
🔍 Latest deploy log https://app.netlify.com/projects/rspress-v2/deploys/696887bb39875600082893ed
😎 Deploy Preview https://deploy-preview-3006--rspress-v2.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 14, 2026

Rsdoctor Bundle Diff Analysis

Found 3 projects in monorepo, 3 projects with changes.

📊 Quick Summary
Project Total Size Change
node 10.4 MB +33.5 KB (0.3%)
node_md 1.3 MB 📈 +36.2 KB (+2.8%)
web 15.9 MB +79.8 KB (0.5%)
📋 Detailed Reports (Click to expand)

📁 node

Path: website/doc_build/diff-rsdoctor/node/rsdoctor-data.json

📌 Baseline Commit: 8b7c0550c4 | PR: #3005

Metric Current Baseline Change
📊 Total Size 10.4 MB 10.4 MB +33.5 KB (0.3%)
📄 JavaScript 0 B 0 B 0
🎨 CSS 0 B 0 B 0
🌐 HTML 10.4 MB 10.4 MB +33.5 KB (0.3%)
📁 Other Assets 0 B 0 B 0

📦 Download Diff Report: node Bundle Diff

📁 node_md

Path: website/doc_build/diff-rsdoctor/node_md/rsdoctor-data.json

📌 Baseline Commit: 8b7c0550c4 | PR: #3005

Metric Current Baseline Change
📊 Total Size 1.3 MB 1.3 MB +36.2 KB (+2.8%)
📄 JavaScript 0 B 0 B 0
🎨 CSS 0 B 0 B 0
🌐 HTML 0 B 0 B 0
📁 Other Assets 1.3 MB 1.3 MB +36.2 KB (+2.8%)

📦 Download Diff Report: node_md Bundle Diff

📁 web

Path: website/doc_build/diff-rsdoctor/web/rsdoctor-data.json

📌 Baseline Commit: 8b7c0550c4 | PR: #3005

Metric Current Baseline Change
📊 Total Size 15.9 MB 15.8 MB +79.8 KB (0.5%)
📄 JavaScript 15.1 MB 15.0 MB +39.8 KB (0.3%)
🎨 CSS 126.3 KB 126.3 KB 0
🌐 HTML 0 B 0 B 0
📁 Other Assets 713.0 KB 672.9 KB +40.0 KB (+5.9%)

📦 Download Diff Report: web Bundle Diff

Generated by Rsdoctor GitHub Action

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds functionality to automatically extract page descriptions from markdown content when they are not explicitly provided in frontmatter. The description is extracted from the first paragraph that appears before any h2 heading.

Changes:

  • Added optional description field to PageIndexInfo type
  • Implemented description extraction logic that uses the first paragraph before h2 as fallback
  • Added comprehensive test coverage including new test fixture and test cases
  • Updated existing test snapshots to include description field

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
packages/shared/src/types/index.ts Added optional description field to PageIndexInfo interface
packages/core/src/node/route/fixtures/content-processing/with-description.mdx New test fixture for description extraction feature
packages/core/src/node/route/extractPageData.ts Implemented extractTextFromNode and extractDescription helper functions, integrated description extraction into page data processing
packages/core/src/node/route/extractPageData.test.ts Added three new test cases and updated existing test snapshots to include description field

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

SoonIter and others added 6 commits January 14, 2026 20:22
- Update plugin-llms to use page.description for llms.txt generation
- Update Layout component to use page.description for meta tags

This ensures the auto-extracted description is used when
frontmatter.description is not provided.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
现在 llmsTxt 拿不到 createPageData 中的 page.description,

尝试解决,可能的解决方案是 createPageData 后的内容放在全局变量上,llmsTxt.ts 中去拿取
1. 完善测试拆分成两个 describe ,getPageIndexInfoByRoute一个单独 describe
2. 修复错误
 FAIL  packages/core/src/node/route/extractPageData.test.ts > extractPageData > basic

TypeError: routeService.getRoutePageByRoutePath is not a function

  251 |       const pageIndexInfo = await getPageIndexInfoByRoute(routeMeta, options);

  252 |       // Store pageIndexInfo in RoutePage for llmsTxt to access

> 253 |       const routePage = routeService.getRoutePageByRoutePath(

      |                                      ^

  254 |         routeMeta.routePath,

  255 |       );

        at packages/core/src/node/route/extractPageData.ts:253:38

        at extractPageData (packages/core/src/node/route/extractPageData.ts:249:20)

        at packages/core/src/node/route/extractPageData.test.ts:28:22
Refactor extractDescription to collect all paragraph text between h1 and h2
instead of just the first paragraph. Skip code blocks, HTML, imports, tables,
and images following Docusaurus createExcerpt strategy.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@SoonIter SoonIter changed the title feat(core): auto-extract description from first paragraph before h2 in extractPageData feat(core): auto-extract description from first contentful line in extractPageData Jan 14, 2026
SoonIter and others added 3 commits January 15, 2026 12:14
- Reuse processor instances at module level instead of creating new ones per file
- Use structuredClone to reuse parsed AST tree, avoiding duplicate parsing
- Use unist-util-visit for more efficient tree traversal in remarkStripLinkUrls

These optimizations reduce the overhead when processing multiple markdown files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The tree is no longer needed after extracting TOC and description,
so we can pass it directly to stringifyProcessor.run() without cloning.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove separate parseProcessor, use the same processor for both
parse() and run()+stringify() operations.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@SoonIter SoonIter requested a review from Timeless0911 January 15, 2026 06:59
@SoonIter
Copy link
Copy Markdown
Member Author

SoonIter commented Jan 15, 2026

This PR also includes some performance improvements and adds some cold start time.

image

@SoonIter SoonIter merged commit 4d0815f into main Jan 15, 2026
8 checks passed
@SoonIter SoonIter deleted the syt-vibe-kanban/99b2-extractpagedata branch January 15, 2026 08:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants