Skip to content

docs: add LLM-friendly content export (llms.txt / llms-full.txt)#20993

Closed
bloxster wants to merge 2 commits into
release/3.4from
docs/llms-txt
Closed

docs: add LLM-friendly content export (llms.txt / llms-full.txt)#20993
bloxster wants to merge 2 commits into
release/3.4from
docs/llms-txt

Conversation

@bloxster

@bloxster bloxster commented May 5, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Adds docusaurus-plugin-llms-txt v0.1.3 to docs/site
  • At build time the plugin generates two static files served at the site root:
    • https://docs.erigon.tech/llms.txt — page index with short descriptions (LLM routing)
    • https://docs.erigon.tech/llms-full.txt — full text of all docs (long-context LLMs)
  • Matches the LLM content export already live at cocoon.erigon.tech/llms-full.txt

Test plan

  • cd docs/site && npm run build — clean build
  • Verify out/llms.txt and out/llms-full.txt are generated
  • Check https://docs.erigon.tech/llms.txt is accessible after deploy
  • Check https://docs.erigon.tech/llms-full.txt is accessible after deploy

🤖 Generated with Claude Code

Adds docusaurus-plugin-llms-txt@0.1.3 which generates two files at
build time:
- /llms.txt     — page index with short descriptions (for LLM routing)
- /llms-full.txt — full text of all docs (for long-context LLMs)

Matches the LLM content export already live at cocoon.erigon.tech.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@bloxster bloxster enabled auto-merge (squash) May 5, 2026 09:23

@yperbasis yperbasis left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concerns worth addressing before merge

  1. Supply-chain attribution doesn't match what's on npm

The PR description links to https://github.com/PaloAltoNetworks/docusaurus-plugin-llms-txt. That URL does not exist — gh api repos/PaloAltoNetworks/docusaurus-plugin-llms-txt returns
404. PaloAltoNetworks publishes docusaurus-openapi-docs, not this one.

What you'd actually be installing:

  • npm metadata: no repository, no homepage, no bugs field
  • Sole maintainer: jverre jverre@gmail.com (personal Gmail, single account)
  • Last published: 2024-12-30 (≈16 months stale as of today)
  • Versions ever published: 0.1.0 → 0.1.3 (4 patch-level releases, no minor/major)

Compare with two healthier alternatives that exist on npm under the same name pattern:

 ┌───────────────────────────────────────────┬─────────┬──────────────────────────────────────────┬───────────────────────────────┐                                                     
 │                  Package                  │ Version │                   Repo                   │          Maintenance          │                                                     
 ├───────────────────────────────────────────┼─────────┼──────────────────────────────────────────┼───────────────────────────────┤                                                     
 │ docusaurus-plugin-llms-txt (this PR)      │ 0.1.3   │ none declared                            │ last release 2024-12-30       │
 ├───────────────────────────────────────────┼─────────┼──────────────────────────────────────────┼───────────────────────────────┤
 │ @signalwire/docusaurus-plugin-llms-txt    │ 1.2.2   │ github.com/signalwire/docusaurus-plugins │ active, scoped, repo declared │                                                     
 ├───────────────────────────────────────────┼─────────┼──────────────────────────────────────────┼───────────────────────────────┤                                                     
 │ din0s/docusaurus-plugin-llms-txt (GitHub) │ —       │ github.com/din0s/... (32★)               │ last commit 2026-04-23        │                                                     
 └───────────────────────────────────────────┴─────────┴──────────────────────────────────────────┴───────────────────────────────┘                                                     

Recommendation: Either switch to @signalwire/docusaurus-plugin-llms-txt (mature, scoped, declared repo, 1.x semver), or — if there's a specific reason to keep the unscoped one — fix
the PR description to point at the actual upstream and call out that it's a single-maintainer 0.x package with no repo metadata. As written, the description tells reviewers it's a
Palo Alto Networks plugin, which it isn't.

  1. The docs site build isn't in CI

I grepped .github/workflows/ for any docusaurus / docs-site / cd docs/site invocation. None found. That means:

  • No PR check verifies that npm run build still succeeds with this plugin added.
  • The four-item Test plan is entirely manual and won't be run in CI.
  • A future Docusaurus or React bump could silently break the docs build, and you'd discover it on deploy.

This isn't a blocker for this PR (the change is too small to break much), but adding a docs-site build job is cheap insurance and would have been the natural place to gate this
change.

  1. Plugin is configured with defaults — verify it indexes both docs plugins

docs/site/docusaurus.config.ts registers two docs plugins:

  • the default docs plugin via preset-classic (root /)
  • a second @docusaurus/plugin-content-docs instance with id: 'help-center', routeBasePath: 'help-center'

docusaurus-plugin-llms-txt is added as a bare string with no config. Worth confirming locally that both plugin instances end up in llms-full.txt — the test plan currently only checks
the file exists, not that help-center pages are included. If they aren't, you'll need an explicit { include: [...] } or similar option.

  1. Drive-by observations
  • The PR description is marked "🤖 Generated with Claude Code." The PaloAltoNetworks attribution looks like a model hallucination of the upstream — a quick npm-page check would have
    caught it. Worth adding a manual review step for PR descriptions that name external orgs.
  • package-lock.json adds only the plugin and no transitive deps beyond fs-extra / gray-matter, both of which were already resolved in the tree (no new install footprint). Good.
  • engines.node: ">=16.14" from the plugin is satisfied by your >=20.0 floor; no action needed.

Verdict

Request changes — not for the code itself (the diff is fine), but to:

  1. Correct the upstream attribution in the PR description, or swap to @signalwire/docusaurus-plugin-llms-txt which has declared provenance and active maintenance.
  2. Manually verify the help-center docs make it into llms-full.txt before merging (the current test-plan checkboxes don't cover this).
  3. Optional follow-up: add a tiny docs-site build job to CI so future plugin/dep bumps are gated.

If (1) is "we deliberately picked the jverre package, here's why," that should be stated in the description so reviewers don't have to dig into npm metadata to discover the gap
between the PR text and what's actually being installed.

@bloxster

bloxster commented May 5, 2026

Copy link
Copy Markdown
Collaborator Author

Closing in favour of #21000.

After review, we decided against docusaurus-plugin-llms-txt:

  1. Wrong conversion direction — the plugin compiles MDX → HTML then converts it back to markdown. Our source is already clean markdown; the HTML round-trip is lossy and unnecessary.
  2. Supply chain concerns — no declared source repo, personal Gmail maintainer, 16 months without updates.
  3. Build coupling — requires a full Docusaurus build to generate the files, adding CI weight for a task that doesn't need it.

PR #21000 replaces this with a ~130-line pure Python stdlib script (docs/site/scripts/generate-llms.py) that reads the .mdx source files directly, strips MDX syntax, and produces cleaner output with zero npm dependencies.

@bloxster bloxster closed this May 5, 2026
auto-merge was automatically disabled May 5, 2026 13:29

Pull request was closed

bloxster added a commit that referenced this pull request May 9, 2026
## Summary

- Adds `docs/site/scripts/generate-llms.py` — a pure Python (stdlib
only, zero npm deps) script that generates LLM-friendly content exports
from the Docusaurus source files directly
- Generates `docs/site/static/llms.txt` (page index, 71 pages) and
`docs/site/static/llms-full.txt` (full clean markdown, ~351 KB), served
at `docs.erigon.tech/llms.txt` and `docs.erigon.tech/llms-full.txt`
- Updates the repo-root `llms.txt`, which was pointing to the deleted
`docs/gitbook/` folder — now mirrors the Docusaurus-generated index with
live `docs.erigon.tech` URLs
- Adds a CI guard in `.github/workflows/docs-deploy.yml` that runs
`generate-llms.py --check` and the unit tests before the npm build,
blocking drift between any of the four committed files (root +
`static/`)
- Adds a unit test suite (`docs/site/scripts/test_generate_llms.py`, 25
tests) covering placeholder preservation, fence transparency, JSX
stripping, multi-line expr blocks, frontmatter parsing, and landing-page
synthesis

## Why a custom script instead of `docusaurus-plugin-llms-txt` (replaces
#20993)

PR #20993 used the `docusaurus-plugin-llms-txt@0.1.3` npm package. After
review, we decided against it:

- **Wrong approach**: the plugin works on *compiled HTML output* and
converts it back to markdown — a lossy round-trip. Our source is already
markdown.
- **Supply chain risk**: the package has no declared source repo, is
maintained by a personal Gmail address, and has not been updated in 16
months.
- **Unnecessary dependency**: a Python stdlib script does the same job
with no external dependencies, no build-time coupling, and cleaner
output.

The custom script reads `.md`/`.mdx` files directly, strips MDX-specific
syntax (imports, JSX components, HTML tags, expressions), extracts
frontmatter titles and descriptions, and maps file paths to their
deployed `docs.erigon.tech` URLs. Both Docusaurus plugin instances (main
docs and help-center) are supported. Card-grid landing pages (e.g.
`docs/index.mdx`) are detected via the `lp-card` JSX pattern and
synthesized into structured "## Sections" + bullet lists rather than
collapsing into a soup of title/desc fragments.

## How to update

Re-run the script whenever doc content changes:

```bash
python3 docs/site/scripts/generate-llms.py
```

To verify on-disk files match what the script would generate (used by
CI):

```bash
python3 docs/site/scripts/generate-llms.py --check
```

The CI guard in `docs-deploy.yml` runs `--check` and the unittest suite
on every push touching `docs/site/**`, so a forgotten regeneration after
a docs edit will fail the build before deploy.

## Updates after review (commit `05a81fcd`)

Addressing yperbasis CHANGES_REQUESTED + Copilot follow-ups:

**Blockers**
- ✅ Preserve `{ERIGON_VERSION}` and other ALL_CAPS identifier
placeholders in prose and table cells. The brace-strip regex now skips
pure-uppercase identifier braces, mirroring the existing `<IP>`/`<PID>`
angle-tag guard. Verified against the install-instructions table cell
(`erigon_{ERIGON_VERSION}_amd64.deb`) and the version selector prose
(`(e.g., v{ERIGON_VERSION})`) the reviewer flagged.
- ✅ Test-plan H1 assertion replaced — the prior `^# ` count incorrectly
counted shell comments inside `bash` fences (e.g. `# Reduce disk latency
impact`). Now uses `^URL: ` (one synthetic URL line per page = 71).
- ✅ Drift guard via CI rather than `prebuild` (catches drift in all 4
files, no Python coupling in the npm build path).

**Non-blocking review items**
- ✅ Singleton "## Erigon Docs" header dropped — the Introduction bullet
sits directly under the preamble now.
- ✅ Landing-page MDX synthesis (no more title/desc soup for
`docs/index.mdx`, `staking/index.mdx`, `help-center/index.mdx`, etc.).
- ✅ `parse_frontmatter` hardened: skip indented YAML continuations,
`_safe_int` wrapper for `sidebar_position`.
- ✅ Nested `_category_.json` honored via `ancestor_positions()` for sort
tie-breaking.
- ✅ `--check` flag for CI.
- ✅ `first_description` tightened — only skip lines that *look like* JSX
leaks (`^<tag`, `^{`, arrow-fn) instead of skipping any line that
mentions those tokens mid-sentence.
- ✅ `# Requires: Python 3.8+` documented at the top.

## Test plan

### Deployment
- [ ] `llms.txt` renders correctly at `docs.erigon.tech/llms.txt` after
deploy
- [ ] `llms-full.txt` renders at `docs.erigon.tech/llms-full.txt`
- [ ] Root `llms.txt` no longer references deleted `docs/gitbook/` paths
- [ ] Re-running the script produces identical output (`--check` returns
OK)

### Output quality — run after regenerating

**Page index (`llms.txt`)**
- [ ] Every section header (`## Get Started`, `## Fundamentals`, etc.)
appears exactly once
- [ ] No singleton section header (the Introduction bullet should sit
directly under the preamble, no `## Erigon Docs` line above it)
- [ ] Index pages (e.g. `get-started/index.mdx`) appear before their
siblings within each section
- [ ] No entry has a blank or missing title
- [ ] No entry description contains raw JSX (`<Component`, `{props.`,
`import `)

**Full export (`llms-full.txt`)**
- [ ] No page has back-to-back duplicate H1 headings (synthetic title +
body's own H1)
- [ ] Fenced code blocks are intact — content between fences is
unchanged, including shell `export VAR=…` lines
- [ ] Inline code placeholders survive — `{ERIGON_VERSION}`,
`<YOUR_ADDRESS>` style tokens are preserved both inside backtick spans
and in bare prose / table cells
- [ ] No truncated shell commands — `curl`, `docker run`, `erigon`
invocations with `{…}` args are complete
- [ ] Nested list indentation is preserved — sublists appear indented,
not flush-left
- [ ] No raw HTML/JSX tags leak into prose (`<Link`, `<Tabs`, `<div`,
`<section`)
- [ ] No raw MDX imports/exports leak (`import Link from`, `export
const`)
- [ ] Landing pages (`docs/index.mdx`, `help-center/index.mdx`, etc.)
emit a `## Sections` heading + bullet list, not unstructured title/desc
fragments

**Sanity checks (quick greps)**
```bash
# Page count — synthetic URL line per page (should equal 71)
grep -c '^URL: ' docs/site/static/llms-full.txt

# Real JSX component leaks — uppercase-then-lowercase tag pattern (should be 0)
grep -cE '<[A-Z][a-z][a-zA-Z]+' docs/site/static/llms-full.txt

# MDX imports/exports leaked outside fences (should be 0)
grep -cE '^(import|export const|export function|export default)' docs/site/static/llms-full.txt

# Identifier placeholders preserved — should be > 0 if source uses any
grep -c '{ERIGON_VERSION}' docs/site/static/llms-full.txt

# Shell `export VAR=` lines preserved inside ```bash fences — should be > 0
grep -c '^export ' docs/site/static/llms-full.txt
```

Current values (regenerated, commit `05a81fcd`): URL 71, JSX leaks 0,
MDX imports/exports 0, `{ERIGON_VERSION}` 15, `^export ` 9.

### Tests
```bash
python3 -m unittest discover docs/site/scripts -v
# Ran 25 tests in 0.001s — OK
```

🤖 Generated with [Claude Code](https://claude.ai/claude-code)

---------

Co-authored-by: Bloxster <bloxster@proton.me>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
bloxster added a commit that referenced this pull request May 11, 2026
…21074)

## Summary

The `Deploy Docs` CI was failing on `release/3.4` ([run
#25594249089](https://github.com/erigontech/erigon/actions/runs/25594249089))
because the llms.txt artifacts were out of date.

**Root cause:** The w19 maintenance PR (#21018) updated the
`sync-modes.md` description to add `"blocks"` mode, but the llms.txt
files were not regenerated. When the llms generator PR (#20993) merged,
CI ran `generate-llms.py --check` and found the committed files did not
match the generator output.

**Fix:** Regenerated all four artifacts (`llms.txt`, `llms-full.txt`,
`docs/site/static/llms.txt`, `docs/site/static/llms-full.txt`) by
running `python3 docs/site/scripts/generate-llms.py`.

Only change is one line in the Sync Modes description:
```
- Full, minimal, and archive sync...
+ Full, minimal, blocks, and archive sync...
```

---------

Co-authored-by: bloxster <bloxster@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants