Skip to content

Outline: Shrink llms-txt output to <200_000 input tokens#57

Merged
amotl merged 1 commit intomainfrom
shrink-llms-txt
Jul 28, 2025
Merged

Outline: Shrink llms-txt output to <200_000 input tokens#57
amotl merged 1 commit intomainfrom
shrink-llms-txt

Conversation

@amotl
Copy link
Member

@amotl amotl commented Jul 27, 2025

Problem

When using the llms-full.txt file with Anthropic Claude Sonnet or Opus, it fails because too many input tokens.

anthropic.BadRequestError: Error code: 400 - {'type': 'error', 'error':
{'type': 'invalid_request_error', 'message': 'prompt is too long: 207831 tokens > 200000 maximum'}}

Solution

Omit a few resources to shrink the bundle file.

@coderabbitai
Copy link

coderabbitai bot commented Jul 27, 2025

Warning

Rate limit exceeded

@amotl has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 0 minutes and 53 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between b5edd78 and 16bc361.

📒 Files selected for processing (4)
  • CHANGES.md (1 hunks)
  • src/cratedb_about/bundle/llmstxt.py (1 hunks)
  • src/cratedb_about/outline/cratedb-outline.yaml (13 hunks)
  • src/cratedb_about/outline/model.py (2 hunks)

Walkthrough

This change introduces a markdown_enabled flag to documentation outline items, updates the YAML outline to disable Markdown rendering for selected large or optional entries, and modifies the outline-to-Markdown export logic to exclude these disabled items. The process for generating llms.txt and llms-full.txt files is refactored, with output now limited to fewer than 200,000 input tokens.

Changes

Cohort / File(s) Change Summary
src/cratedb_about/outline/model.py Added markdown_enabled: bool = True attribute to OutlineItem. Modified Markdown export to skip items with markdown_enabled = False.
src/cratedb_about/outline/cratedb-outline.yaml Added markdown_enabled: false to many entries including large SQL reference sections, drivers, clients, examples, optional cloud features, and blog entries to disable Markdown rendering without changing content or structure.
src/cratedb_about/bundle/llmstxt.py Refactored writing of llms.txt and llms-full.txt to use intermediate Path variables. Changed full doc generation to optional=False. No control flow or error handling changes.
tests/test_outline.py Updated test search terms from "gtfs" to "toolkit" and adjusted assertions to reflect changes in outline entries marked with markdown_enabled: false.
CHANGES.md Updated changelog to note reduction of outline output for llms-txt to fewer than 200,000 input tokens.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant OutlineLoader
    participant OutlineModel
    participant llmstxt.py

    User->>llmstxt.py: Generate llms.txt / llms-full.txt
    llmstxt.py->>OutlineLoader: Load YAML outline
    OutlineLoader->>OutlineModel: Parse items (with markdown_enabled flag)
    llmstxt.py->>OutlineModel: Export to Markdown (skip markdown_enabled: false)
    OutlineModel-->>llmstxt.py: Filtered Markdown output
    llmstxt.py->>User: Write llms.txt / llms-full.txt files
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

Suggested reviewers

  • bmunkholm
  • kneth
  • surister

Poem

🐇 In docs' vast fields, some paths now rest,
"Markdown disabled" — a gentle jest.
Outlines trimmed, tokens light and few,
Files neat and tidy, just for you!
Hop, hop, hooray, the rabbits say,
Cleaner docs to brighten your day! 🌿✨

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch shrink-llms-txt

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Comment on lines +293 to +297
They also influence the behaviour when the records are queried.
parents: [ sql-syntax ]
tags: [ sql ]
# FIXME: This needs about 40_000 input tokens. Maybe a stripped-down variant could help?
enabled: false
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The section # SQL: Data types needed to be disabled to save a whole bunch of input tokens.

FIXME: This needs about 40_000 input tokens. Maybe a stripped-down variant could help?

Please bring it back.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 481111b and c46b3b2.

📒 Files selected for processing (4)
  • CHANGES.md (1 hunks)
  • src/cratedb_about/bundle/llmstxt.py (1 hunks)
  • src/cratedb_about/outline/cratedb-outline.yaml (13 hunks)
  • src/cratedb_about/outline/model.py (2 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: amotl
PR: crate/about#0
File: :0-0
Timestamp: 2025-04-16T14:20:35.508Z
Learning: When creating content for an `llms.txt` file (following the llmstxt.org specification), consistent and straightforward language takes precedence over stylistic variation since the primary audience is language models rather than human readers.
Learnt from: amotl
PR: crate/about#0
File: :0-0
Timestamp: 2025-04-16T14:20:35.508Z
Learning: When creating content for an `llms.txt` file (following the llmstxt.org specification), consistent and straightforward language takes precedence over stylistic variation since the primary audience is language models rather than human readers.
Learnt from: amotl
PR: crate/about#0
File: :0-0
Timestamp: 2025-04-16T14:16:33.171Z
Learning: When creating content for an `llms.txt` file (following the llmstxt.org specification), consistent and straightforward language takes precedence over stylistic variation since the primary audience is language models rather than human readers.
CHANGES.md (4)

Learnt from: amotl
PR: crate/about#0
File: :0-0
Timestamp: 2025-04-16T14:20:35.508Z
Learning: When creating content for an llms.txt file (following the llmstxt.org specification), consistent and straightforward language takes precedence over stylistic variation since the primary audience is language models rather than human readers.

Learnt from: amotl
PR: crate/about#0
File: :0-0
Timestamp: 2025-04-16T14:16:33.171Z
Learning: When creating content for an llms.txt file (following the llmstxt.org specification), consistent and straightforward language takes precedence over stylistic variation since the primary audience is language models rather than human readers.

Learnt from: amotl
PR: crate/about#0
File: :0-0
Timestamp: 2025-04-16T14:20:35.508Z
Learning: When creating content for an llms.txt file (following the llmstxt.org specification), consistent and straightforward language takes precedence over stylistic variation since the primary audience is language models rather than human readers.

Learnt from: amotl
PR: #32
File: src/cratedb_about/outline/cratedb-outline.yaml:321-329
Timestamp: 2025-05-15T21:25:54.870Z
Learning: In the CrateDB outline YAML, content organization prioritizes thematic grouping (keeping related topics together) over content type grouping (separating tutorials from reference docs), as demonstrated by placing the multi-tenancy tutorial alongside user management and privileges documentation in the API section.

src/cratedb_about/bundle/llmstxt.py (3)

Learnt from: amotl
PR: crate/about#0
File: :0-0
Timestamp: 2025-04-16T14:20:35.508Z
Learning: When creating content for an llms.txt file (following the llmstxt.org specification), consistent and straightforward language takes precedence over stylistic variation since the primary audience is language models rather than human readers.

Learnt from: amotl
PR: crate/about#0
File: :0-0
Timestamp: 2025-04-16T14:20:35.508Z
Learning: When creating content for an llms.txt file (following the llmstxt.org specification), consistent and straightforward language takes precedence over stylistic variation since the primary audience is language models rather than human readers.

Learnt from: amotl
PR: crate/about#0
File: :0-0
Timestamp: 2025-04-16T14:16:33.171Z
Learning: When creating content for an llms.txt file (following the llmstxt.org specification), consistent and straightforward language takes precedence over stylistic variation since the primary audience is language models rather than human readers.

src/cratedb_about/outline/cratedb-outline.yaml (1)

Learnt from: amotl
PR: #32
File: src/cratedb_about/outline/cratedb-outline.yaml:321-329
Timestamp: 2025-05-15T21:25:54.870Z
Learning: In the CrateDB outline YAML, content organization prioritizes thematic grouping (keeping related topics together) over content type grouping (separating tutorials from reference docs), as demonstrated by placing the multi-tenancy tutorial alongside user management and privileges documentation in the API section.

🧬 Code Graph Analysis (1)
src/cratedb_about/bundle/llmstxt.py (2)
src/cratedb_about/cli.py (1)
  • outline (55-75)
src/cratedb_about/outline/model.py (2)
  • to_markdown (72-84)
  • to_llms_txt (92-115)
🔇 Additional comments (6)
src/cratedb_about/outline/model.py (1)

28-28: Good addition of the enabled flag with appropriate default.

The enabled: bool = True attribute provides a clean way to control item visibility while maintaining backward compatibility.

CHANGES.md (1)

4-4: Clear and accurate changelog entry.

The entry succinctly documents the token limit reduction achievement, aligning with the PR objectives and using straightforward language appropriate for the llms.txt specification.

src/cratedb_about/bundle/llmstxt.py (1)

47-51: Clean refactoring with improved separation of concerns.

The changes correctly implement the intended behavior:

  • llms.txt now contains the filtered outline via to_markdown() (respecting enabled flags)
  • llms-full.txt contains the complete expanded documentation with optional=False
  • The intermediate Path variables improve code readability

This aligns well with the PR objective to reduce token count in the outline while maintaining full documentation access.

src/cratedb_about/outline/cratedb-outline.yaml (3)

296-297: Strategic disabling of high-token content.

The comment indicates this data types section requires ~40,000 tokens, making it a logical choice for exclusion to meet the <200,000 token target. The decision is well-documented.


530-552: Appropriate exclusion of optional and example content.

Disabling example projects, cloud-specific documentation, and some blog content is a reasonable approach to reduce token count while preserving core functionality documentation. These sections are typically less critical for basic CrateDB usage.

Also applies to: 566-603, 724-750


393-407: Enable Core Driver Documentation

Disabling all driver entries reduces token count but removes essential connection guidance. At a minimum, please re-enable the CrateDB Python Client reference:

• File: src/cratedb_about/outline/cratedb-outline.yaml
Section: Drivers (around the “CrateDB Python Client” entry)

Suggested diff:

   - title: "CrateDB Python Client"
     tags: [ driver ]
-    enabled: false
+    enabled: true

This ensures users still have a basic, well-supported example for connecting to CrateDB.

⛔ Skipped due to learnings
Learnt from: amotl
PR: crate/about#32
File: src/cratedb_about/outline/cratedb-outline.yaml:321-329
Timestamp: 2025-05-15T21:25:54.870Z
Learning: In the CrateDB outline YAML, content organization prioritizes thematic grouping (keeping related topics together) over content type grouping (separating tutorials from reference docs), as demonstrated by placing the multi-tenancy tutorial alongside user management and privileges documentation in the API section.

@amotl amotl force-pushed the shrink-llms-txt branch from c46b3b2 to e8d773f Compare July 27, 2025 23:12
@amotl amotl requested review from WalBeh and hammerhead July 27, 2025 23:38
@amotl amotl marked this pull request as ready for review July 27, 2025 23:39
@amotl amotl force-pushed the shrink-llms-txt branch from e8d773f to b5edd78 Compare July 28, 2025 00:35
@amotl amotl force-pushed the shrink-llms-txt branch from b5edd78 to 173220a Compare July 28, 2025 01:43
@amotl amotl merged commit 1fa9370 into main Jul 28, 2025
6 checks passed
@amotl amotl deleted the shrink-llms-txt branch July 28, 2025 08:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants