Outline: Shrink llms-txt output to <200_000 input tokens by amotl · Pull Request #57 · crate/about

amotl · 2025-07-27T22:45:41Z

Problem

When using the llms-full.txt file with Anthropic Claude Sonnet or Opus, it fails because too many input tokens.

anthropic.BadRequestError: Error code: 400 - {'type': 'error', 'error':
{'type': 'invalid_request_error', 'message': 'prompt is too long: 207831 tokens > 200000 maximum'}}

Solution

Omit a few resources to shrink the bundle file.

coderabbitai · 2025-07-27T22:45:55Z

Warning

Rate limit exceeded

@amotl has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 0 minutes and 53 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between b5edd78 and 16bc361.

📒 Files selected for processing (4)

CHANGES.md (1 hunks)
src/cratedb_about/bundle/llmstxt.py (1 hunks)
src/cratedb_about/outline/cratedb-outline.yaml (13 hunks)
src/cratedb_about/outline/model.py (2 hunks)

Walkthrough

This change introduces a markdown_enabled flag to documentation outline items, updates the YAML outline to disable Markdown rendering for selected large or optional entries, and modifies the outline-to-Markdown export logic to exclude these disabled items. The process for generating llms.txt and llms-full.txt files is refactored, with output now limited to fewer than 200,000 input tokens.

Changes

Cohort / File(s)	Change Summary
`src/cratedb_about/outline/model.py`	Added `markdown_enabled: bool = True` attribute to `OutlineItem`. Modified Markdown export to skip items with `markdown_enabled = False`.
`src/cratedb_about/outline/cratedb-outline.yaml`	Added `markdown_enabled: false` to many entries including large SQL reference sections, drivers, clients, examples, optional cloud features, and blog entries to disable Markdown rendering without changing content or structure.
`src/cratedb_about/bundle/llmstxt.py`	Refactored writing of `llms.txt` and `llms-full.txt` to use intermediate `Path` variables. Changed full doc generation to `optional=False`. No control flow or error handling changes.
`tests/test_outline.py`	Updated test search terms from "gtfs" to "toolkit" and adjusted assertions to reflect changes in outline entries marked with `markdown_enabled: false`.
`CHANGES.md`	Updated changelog to note reduction of outline output for `llms-txt` to fewer than 200,000 input tokens.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant OutlineLoader
    participant OutlineModel
    participant llmstxt.py

    User->>llmstxt.py: Generate llms.txt / llms-full.txt
    llmstxt.py->>OutlineLoader: Load YAML outline
    OutlineLoader->>OutlineModel: Parse items (with markdown_enabled flag)
    llmstxt.py->>OutlineModel: Export to Markdown (skip markdown_enabled: false)
    OutlineModel-->>llmstxt.py: Filtered Markdown output
    llmstxt.py->>User: Write llms.txt / llms-full.txt files

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

Bundle: Provide README in HTML format per readme.html #35: Refactors llmstxt.py file output generation and adds HTML output, related to bundle output generation.
Bundle: Fix semantics of llms.txt vs. llms-full.txt #42: Introduces subclassing and modularizes outline loading and resource management, related to llmstxt.py and outline handling.
Outline: Add to_llms_txt API method and --format=llms-txt CLI option #23: Introduces the to_llms_txt API and CLI support, directly related to generating llms.txt files.

Suggested reviewers

bmunkholm
kneth
surister

Poem

🐇 In docs' vast fields, some paths now rest,
"Markdown disabled" — a gentle jest.
Outlines trimmed, tokens light and few,
Files neat and tidy, just for you!
Hop, hop, hooray, the rabbits say,
Cleaner docs to brighten your day! 🌿✨

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch shrink-llms-txt

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

amotl · 2025-07-27T22:47:19Z

src/cratedb_about/outline/cratedb-outline.yaml

        They also influence the behaviour when the records are queried.
      parents: [ sql-syntax ]
      tags: [ sql ]
+      # FIXME: This needs about 40_000 input tokens. Maybe a stripped-down variant could help?
+      enabled: false


The section # SQL: Data types needed to be disabled to save a whole bunch of input tokens.

FIXME: This needs about 40_000 input tokens. Maybe a stripped-down variant could help?

Please bring it back.

Deferred:

Outline/Bundle: Bring back "SQL: Data types" section #58

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 481111b and c46b3b2.

📒 Files selected for processing (4)

CHANGES.md (1 hunks)
src/cratedb_about/bundle/llmstxt.py (1 hunks)
src/cratedb_about/outline/cratedb-outline.yaml (13 hunks)
src/cratedb_about/outline/model.py (2 hunks)

🧰 Additional context used

🧠 Learnings (4)

📓 Common learnings

Learnt from: amotl
PR: crate/about#0
File: :0-0
Timestamp: 2025-04-16T14:20:35.508Z
Learning: When creating content for an `llms.txt` file (following the llmstxt.org specification), consistent and straightforward language takes precedence over stylistic variation since the primary audience is language models rather than human readers.

Learnt from: amotl
PR: crate/about#0
File: :0-0
Timestamp: 2025-04-16T14:20:35.508Z
Learning: When creating content for an `llms.txt` file (following the llmstxt.org specification), consistent and straightforward language takes precedence over stylistic variation since the primary audience is language models rather than human readers.

Learnt from: amotl
PR: crate/about#0
File: :0-0
Timestamp: 2025-04-16T14:16:33.171Z
Learning: When creating content for an `llms.txt` file (following the llmstxt.org specification), consistent and straightforward language takes precedence over stylistic variation since the primary audience is language models rather than human readers.

CHANGES.md (4)

Learnt from: amotl
PR: crate/about#0
File: :0-0
Timestamp: 2025-04-16T14:20:35.508Z
Learning: When creating content for an llms.txt file (following the llmstxt.org specification), consistent and straightforward language takes precedence over stylistic variation since the primary audience is language models rather than human readers.

Learnt from: amotl
PR: crate/about#0
File: :0-0
Timestamp: 2025-04-16T14:16:33.171Z
Learning: When creating content for an llms.txt file (following the llmstxt.org specification), consistent and straightforward language takes precedence over stylistic variation since the primary audience is language models rather than human readers.

Learnt from: amotl
PR: crate/about#0
File: :0-0
Timestamp: 2025-04-16T14:20:35.508Z
Learning: When creating content for an llms.txt file (following the llmstxt.org specification), consistent and straightforward language takes precedence over stylistic variation since the primary audience is language models rather than human readers.

Learnt from: amotl
PR: #32
File: src/cratedb_about/outline/cratedb-outline.yaml:321-329
Timestamp: 2025-05-15T21:25:54.870Z
Learning: In the CrateDB outline YAML, content organization prioritizes thematic grouping (keeping related topics together) over content type grouping (separating tutorials from reference docs), as demonstrated by placing the multi-tenancy tutorial alongside user management and privileges documentation in the API section.

src/cratedb_about/bundle/llmstxt.py (3)

Learnt from: amotl
PR: crate/about#0
File: :0-0
Timestamp: 2025-04-16T14:20:35.508Z
Learning: When creating content for an llms.txt file (following the llmstxt.org specification), consistent and straightforward language takes precedence over stylistic variation since the primary audience is language models rather than human readers.

Learnt from: amotl
PR: crate/about#0
File: :0-0
Timestamp: 2025-04-16T14:16:33.171Z
Learning: When creating content for an llms.txt file (following the llmstxt.org specification), consistent and straightforward language takes precedence over stylistic variation since the primary audience is language models rather than human readers.

src/cratedb_about/outline/cratedb-outline.yaml (1)

Learnt from: amotl
PR: #32
File: src/cratedb_about/outline/cratedb-outline.yaml:321-329
Timestamp: 2025-05-15T21:25:54.870Z
Learning: In the CrateDB outline YAML, content organization prioritizes thematic grouping (keeping related topics together) over content type grouping (separating tutorials from reference docs), as demonstrated by placing the multi-tenancy tutorial alongside user management and privileges documentation in the API section.

🧬 Code Graph Analysis (1)

src/cratedb_about/bundle/llmstxt.py (2)

src/cratedb_about/cli.py (1)

outline (55-75)

src/cratedb_about/outline/model.py (2)

to_markdown (72-84)

to_llms_txt (92-115)

🔇 Additional comments (6)

src/cratedb_about/outline/model.py (1)

28-28: Good addition of the enabled flag with appropriate default.

The enabled: bool = True attribute provides a clean way to control item visibility while maintaining backward compatibility.

CHANGES.md (1)

4-4: Clear and accurate changelog entry.

The entry succinctly documents the token limit reduction achievement, aligning with the PR objectives and using straightforward language appropriate for the llms.txt specification.

src/cratedb_about/bundle/llmstxt.py (1)

47-51: Clean refactoring with improved separation of concerns.

The changes correctly implement the intended behavior:

llms.txt now contains the filtered outline via to_markdown() (respecting enabled flags)

llms-full.txt contains the complete expanded documentation with optional=False

The intermediate Path variables improve code readability

This aligns well with the PR objective to reduce token count in the outline while maintaining full documentation access.
src/cratedb_about/outline/cratedb-outline.yaml (3)

296-297: Strategic disabling of high-token content.

The comment indicates this data types section requires ~40,000 tokens, making it a logical choice for exclusion to meet the <200,000 token target. The decision is well-documented.

530-552: Appropriate exclusion of optional and example content.

Disabling example projects, cloud-specific documentation, and some blog content is a reasonable approach to reduce token count while preserving core functionality documentation. These sections are typically less critical for basic CrateDB usage.

Also applies to: 566-603, 724-750

393-407: Enable Core Driver Documentation

Disabling all driver entries reduces token count but removes essential connection guidance. At a minimum, please re-enable the CrateDB Python Client reference:

• File: src/cratedb_about/outline/cratedb-outline.yaml
Section: Drivers (around the “CrateDB Python Client” entry)

Suggested diff:
   - title: "CrateDB Python Client"
     tags: [ driver ]
-    enabled: false
+    enabled: true
This ensures users still have a basic, well-supported example for connecting to CrateDB.
⛔ Skipped due to learnings
Learnt from: amotl
PR: crate/about#32
File: src/cratedb_about/outline/cratedb-outline.yaml:321-329
Timestamp: 2025-05-15T21:25:54.870Z
Learning: In the CrateDB outline YAML, content organization prioritizes thematic grouping (keeping related topics together) over content type grouping (separating tutorials from reference docs), as demonstrated by placing the multi-tenancy tutorial alongside user management and privileges documentation in the API section.

src/cratedb_about/outline/model.py

amotl temporarily deployed to pypi July 27, 2025 22:45 — with GitHub Actions Inactive

amotl commented Jul 27, 2025

View reviewed changes

amotl mentioned this pull request Jul 27, 2025

Outline/Bundle: Bring back "SQL: Data types" section #58

Open

coderabbitai bot reviewed Jul 27, 2025

View reviewed changes

src/cratedb_about/outline/model.py Outdated Show resolved Hide resolved

amotl force-pushed the shrink-llms-txt branch from c46b3b2 to e8d773f Compare July 27, 2025 23:12

amotl temporarily deployed to pypi July 27, 2025 23:12 — with GitHub Actions Inactive

amotl requested review from WalBeh and hammerhead July 27, 2025 23:38

amotl marked this pull request as ready for review July 27, 2025 23:39

amotl force-pushed the shrink-llms-txt branch from e8d773f to b5edd78 Compare July 28, 2025 00:35

amotl temporarily deployed to pypi July 28, 2025 00:35 — with GitHub Actions Inactive

amotl force-pushed the shrink-llms-txt branch from b5edd78 to 173220a Compare July 28, 2025 01:43

amotl temporarily deployed to pypi July 28, 2025 01:43 — with GitHub Actions Inactive

Outline: Shrink llms-txt output to <200_000 input tokens

16bc361

amotl force-pushed the shrink-llms-txt branch from 173220a to 16bc361 Compare July 28, 2025 01:44

amotl temporarily deployed to pypi July 28, 2025 01:44 — with GitHub Actions Inactive

hammerhead approved these changes Jul 28, 2025

View reviewed changes

amotl merged commit 1fa9370 into main Jul 28, 2025
6 checks passed

amotl deleted the shrink-llms-txt branch July 28, 2025 08:26

coderabbitai bot mentioned this pull request Jul 28, 2025

Bundle: Added "count-tokens" procedure to inform about the size #59

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Outline: Shrink llms-txt output to <200_000 input tokens#57

Outline: Shrink llms-txt output to <200_000 input tokens#57
amotl merged 1 commit intomainfrom
shrink-llms-txt

amotl commented Jul 27, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Jul 27, 2025 •

edited

Loading

Rate limit exceeded

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

amotl Jul 27, 2025

Uh oh!

amotl Jul 27, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

amotl commented Jul 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Uh oh!

coderabbitai bot commented Jul 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

amotl Jul 27, 2025

Choose a reason for hiding this comment

Uh oh!

amotl Jul 27, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

amotl commented Jul 27, 2025 •

edited

Loading

coderabbitai bot commented Jul 27, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)