Skip to content

Bundle: Fix semantics of llms.txt vs. llms-full.txt#42

Merged
amotl merged 3 commits intomainfrom
llms-txt-fix-semantics
May 19, 2025
Merged

Bundle: Fix semantics of llms.txt vs. llms-full.txt#42
amotl merged 3 commits intomainfrom
llms-txt-fix-semantics

Conversation

@amotl
Copy link
Member

@amotl amotl commented May 18, 2025

Problem

The current llms.txt was wrong. Many other publications demonstrate it should be a Markdown file with referenced content NOT inlined.

References

@coderabbitai
Copy link

coderabbitai bot commented May 18, 2025

Walkthrough

This update refactors the bundle generation process by introducing a subclass for resource management, adds HTML outline generation, and updates the CLI and tests to accommodate these changes. It also clarifies the semantics of bundle files in the changelog and enhances the outline model with HTML conversion functionality.

Changes

File(s) Change Summary
CHANGES.md Updated the "Unreleased" changelog section to clarify the distinction between llms.txt and llms-full.txt, note the addition of outline.html, and reference issue ABOUT-39.
src/cratedb_about/bundle/llmstxt.py Refactored LllmsTxtBuilder to use instance fields for resources and outline; added outline, readme_md, and outline_yaml fields; introduced CrateDbLllmsTxtBuilder subclass for resource defaults and outline loading; added HTML output.
src/cratedb_about/cli.py Changed the CLI bundle command to use CrateDbLllmsTxtBuilder instead of LllmsTxtBuilder.
src/cratedb_about/outline/model.py Added to_html method to OutlineDocument for converting the outline to HTML via Markdown.
tests/test_cli.py Updated test to check for outline.html instead of outline.md in the bundle output.

Sequence Diagram(s)

sequenceDiagram
    participant CLI
    participant CrateDbLllmsTxtBuilder
    participant OutlineDocument

    CLI->>CrateDbLllmsTxtBuilder: Instantiate with resource paths and outline_url
    CrateDbLllmsTxtBuilder->>OutlineDocument: Load outline from outline_url
    CLI->>CrateDbLllmsTxtBuilder: run()
    CrateDbLllmsTxtBuilder->>OutlineDocument: to_html()
    CrateDbLllmsTxtBuilder->>FileSystem: Write outline.html
Loading

Possibly related PRs

Suggested reviewers

  • bmunkholm
  • kneth

Poem

A bundle now builds with a hop and a cheer,
Outline in HTML, crisp and clear!
Resources managed, the code refines,
Subclassed builders drawing new lines.
With every hop, our features grow—
🐇 Bundling knowledge, on we go!

Note

⚡️ AI Code Reviews for VS Code, Cursor, Windsurf

CodeRabbit now has a plugin for VS Code, Cursor and Windsurf. This brings AI code reviews directly in the code editor. Each commit is reviewed immediately, finding bugs before the PR is raised. Seamless context handoff to your AI code agent ensures that you can easily incorporate review feedback.
Learn more here.


Note

⚡️ Faster reviews with caching

CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.
Enjoy the performance boost—your workflow just got faster.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e20be49 and 99f4e21.

📒 Files selected for processing (5)
  • CHANGES.md (1 hunks)
  • src/cratedb_about/bundle/llmstxt.py (4 hunks)
  • src/cratedb_about/cli.py (2 hunks)
  • src/cratedb_about/outline/model.py (2 hunks)
  • tests/test_cli.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • tests/test_cli.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • CHANGES.md
  • src/cratedb_about/outline/model.py
  • src/cratedb_about/cli.py
  • src/cratedb_about/bundle/llmstxt.py
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate Unit Tests
  • Create PR with Unit Tests
  • Commit Unit Tests in branch llms-txt-fix-semantics
  • Post Copyable Unit Tests in Comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai auto-generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@amotl amotl requested review from bmunkholm and kneth May 18, 2025 17:20
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
src/cratedb_about/outline/model.py (1)

83-84: Add a docstring to maintain API documentation consistency.

The new to_html method would benefit from a docstring to maintain consistency with other methods in the class and provide clear documentation for users of this API.

def to_html(self) -> str:
+    """Convert outline into HTML format using Markdown as an intermediate step."""
    return markdown(self.to_markdown())
src/cratedb_about/bundle/llmstxt.py (3)

25-25: Use a more specific type annotation for the outline field.

The outline field is annotated with t.Any, but based on its usage in the code and the initialization in __post_init__, it's clearly an instance of OutlineDocument returned by CrateDbKnowledgeOutline.load().

-    outline: t.Any = dataclasses.field(init=False)
+    outline: "OutlineDocument" = dataclasses.field(init=False)

Consider adding an import for OutlineDocument or using a forward reference string as shown.


65-74: Consider adding error handling for HTML generation.

While the copy_readme method has error handling for HTML generation, the copy_sources method doesn't have similar error handling for self.outline.to_html().

    def copy_sources(self):
        """
        Provide the source document in the original YAML format, but also converted to HTML.
        The intermediary Markdown format is already covered by the `llms.txt` file itself.
        """
        shutil.copy(
            str(self.outline_yaml),
            self.outdir / "outline.yaml",
        )
-        Path(self.outdir / "outline.html").write_text(self.outline.to_html())
+        try:
+            Path(self.outdir / "outline.html").write_text(self.outline.to_html())
+        except Exception as e:
+            logger.warning(f"Failed to generate HTML outline: {e}")

23-23: Add type annotation for outline_url parameter.

For consistency with other fields, consider adding a type annotation for the outline_url parameter.

-    outline_url: str
+    outline_url: t.Optional[str]

This would match the type used in CrateDbKnowledgeOutline.load() (from the provided relevant code snippets), which accepts an optional URL.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 68fe7b7 and 6451737.

📒 Files selected for processing (5)
  • CHANGES.md (1 hunks)
  • src/cratedb_about/bundle/llmstxt.py (4 hunks)
  • src/cratedb_about/cli.py (2 hunks)
  • src/cratedb_about/outline/model.py (2 hunks)
  • tests/test_cli.py (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
src/cratedb_about/cli.py (1)
src/cratedb_about/bundle/llmstxt.py (2)
  • CrateDbLllmsTxtBuilder (78-87)
  • run (29-47)
src/cratedb_about/bundle/llmstxt.py (3)
src/cratedb_about/cli.py (1)
  • outline (54-74)
src/cratedb_about/outline/model.py (3)
  • to_markdown (71-81)
  • to_llms_txt (86-109)
  • to_html (83-84)
src/cratedb_about/outline/core.py (2)
  • CrateDbKnowledgeOutline (9-76)
  • load (60-76)
🔇 Additional comments (6)
tests/test_cli.py (1)

68-68: LGTM! Test updated to verify the new HTML output format.

The test has been correctly updated to check for the existence of the new HTML output file, which aligns with the changes made in the bundle generation process.

src/cratedb_about/cli.py (1)

8-8: LGTM! Updated to use the CrateDB-specific builder class.

The change correctly updates the import and instantiation to use CrateDbLllmsTxtBuilder, which centralizes resource paths and outline loading for CrateDB resources.

Also applies to: 98-98

CHANGES.md (1)

5-8: LGTM! Clear documentation of fixed semantics and new features.

The changelog clearly explains the important semantic fix for bundle files and documents the new HTML outline feature. The reference link to the relevant issue is a good practice for traceability.

src/cratedb_about/bundle/llmstxt.py (3)

37-45: Comprehensive documentation of the llms.txt and llms-full.txt semantics.

The updated comments clearly explain the purpose and differences between llms.txt and llms-full.txt files, which aligns with the PR objectives of fixing the semantics between these files.


49-63: Good addition of docstring and improved resource handling.

The method now uses the class field self.readme_md instead of a hardcoded path, making the code more flexible and maintainable. The added docstring clearly explains the purpose of the method.


77-88: Well-structured subclass for CrateDB-specific implementation.

The new CrateDbLllmsTxtBuilder subclass appropriately sets default values for resource paths and initializes the outline field. This improves modularity by separating the generic builder logic from CrateDB-specific resource handling.

However, consider adding a docstring to the __post_init__ method to explain its purpose:

    def __post_init__(self):
+        """Initialize the outline by loading it from the provided URL."""
        self.outline = CrateDbKnowledgeOutline.load(self.outline_url)

@amotl amotl marked this pull request as ready for review May 18, 2025 17:28
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/cratedb_about/bundle/llmstxt.py (2)

25-27: Consider providing initializers for non-init fields.

These fields are marked with init=False but aren't initialized in the base class. While they're set in the subclass, it would be helpful to add documentation or default values to clarify how these fields should be populated when using the base class directly.

-    outline: OutlineDocument = dataclasses.field(init=False)
-    readme_md: Traversable = dataclasses.field(init=False)
-    outline_yaml: Traversable = dataclasses.field(init=False)
+    outline: OutlineDocument = dataclasses.field(init=False, default=None)
+    readme_md: Traversable = dataclasses.field(init=False, default=None)
+    outline_yaml: Traversable = dataclasses.field(init=False, default=None)

80-91: Well-structured subclass implementation.

The creation of CrateDbLllmsTxtBuilder subclass is a good approach to provide specific implementations while maintaining the flexibility of the base class:

  1. Default values for resource paths are provided
  2. __post_init__ is used to properly initialize the outline field
  3. This design supports dependency injection and testing

However, consider adding a comment to the base class explaining that it's intended to be subclassed, with the non-init fields expected to be set by subclasses.

 @dataclasses.dataclass
 class LllmsTxtBuilder:
     """
     Build llms.txt files for CrateDB.
+    
+    This is a base class intended to be subclassed. The non-init fields
+    (outline, readme_md, outline_yaml) should be initialized by subclasses.
     """
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6451737 and e20be49.

📒 Files selected for processing (2)
  • src/cratedb_about/bundle/llmstxt.py (4 hunks)
  • src/cratedb_about/outline/model.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/cratedb_about/outline/model.py
🧰 Additional context used
🪛 GitHub Check: codecov/patch
src/cratedb_about/bundle/llmstxt.py

[warning] 76-77: src/cratedb_about/bundle/llmstxt.py#L76-L77
Added lines #L76 - L77 were not covered by tests

🔇 Additional comments (7)
src/cratedb_about/bundle/llmstxt.py (7)

5-5: Appropriate import addition for type annotations.

Adding the Traversable import is necessary for the new type annotations in the dataclass fields.


11-11: Good import update for the OutlineDocument class.

The import of OutlineDocument is properly added to support the new dataclass field.


37-45: Well-documented semantic clarification of llms.txt files.

The addition of clear comments explaining the purpose and differences between llms.txt and llms-full.txt files is excellent. The implementation now properly uses the instance's outline field, improving code organization.


50-52: Good addition of descriptive docstring.

Adding a clear docstring to the copy_readme method improves code readability and maintainability.


55-55: Appropriate use of instance field.

Using the instance's readme_md field instead of a hardcoded path improves flexibility and testability.


66-69: Clear docstring explains file purpose and relationships.

The docstring effectively explains the purpose of the source files and their relationship to the llms.txt file.


71-71: Good use of instance field for improved flexibility.

Using the instance's outline_yaml field instead of a hardcoded path enhances configurability.

@amotl amotl force-pushed the llms-txt-fix-semantics branch from e20be49 to 3179fbb Compare May 18, 2025 17:31
@amotl amotl requested a review from surister May 19, 2025 09:32
@amotl amotl force-pushed the llms-txt-fix-semantics branch from 3179fbb to 99f4e21 Compare May 19, 2025 21:10
@amotl amotl merged commit dbb8cd3 into main May 19, 2025
6 checks passed
@amotl amotl deleted the llms-txt-fix-semantics branch May 19, 2025 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant