llms-txt: Introduce caching for expanding outline into Markdown file#33
llms-txt: Introduce caching for expanding outline into Markdown file#33
Conversation
## Walkthrough
This update introduces a centralized HTTP cache client utility, refactors related components to use it, and adds configuration via a new settings class. Caching is now used in both outline expansion and query loading. Dependency and test updates support the new caching and configuration infrastructure.
## Changes
| File(s) | Change Summary |
|-------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
| CHANGES.md | Updated changelog to note caching for outline-to-Markdown expansion in `llms-txt`. |
| pyproject.toml | Added `platformdirs<5` as a new dependency. |
| src/cratedb_about/settings.py | Introduced `Settings` class for global config, including HTTP timeout, cache TTL, and cache path logic using `platformdirs`.|
| src/cratedb_about/util.py | Added `get_cache_client` function to create a configured, cached HTTP client with error handling and logging. |
| src/cratedb_about/outline/model.py | Updated `to_llms_txt` in `OutlineDocument` to use the new cache client for HTTP requests via `llms_txt`. |
| src/cratedb_about/query/model.py | Refactored to use `get_cache_client` and global cache TTL from settings, replacing manual client setup. |
| src/cratedb_about/cli.py | Enhanced `outline_url_option` with clearer help, metavar, and a callback converting non-URL strings to expanded Paths. |
| tests/test_util.py | Added tests for `get_cache_client`, covering normal operation and error handling with logging. |
## Sequence Diagram(s)
```mermaid
sequenceDiagram
participant User
participant OutlineDocument
participant get_cache_client
participant llms_txt
participant HTTP
User->>OutlineDocument: to_llms_txt()
OutlineDocument->>get_cache_client: obtain cache-enabled client
get_cache_client-->>OutlineDocument: returns CacheClient
OutlineDocument->>llms_txt: create_ctx (with patched HTTP client)
llms_txt->>HTTP: Fetch resources (cached)
HTTP-->>llms_txt: Response (possibly from cache)
llms_txt-->>OutlineDocument: Context as string
OutlineDocument-->>User: Markdown outputPossibly related PRs
Suggested reviewers
Poem
|
5a341f8 to
919362e
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (8)
docs/backlog.md (1)
12-12: Link the timeout issue fortest_cli_bundle.
Consider creating or linking to a GitHub issue for the heavytest_cli_bundletest to ensure it’s tracked and addressed systematically.src/cratedb_about/outline/model.py (2)
94-108: Good implementation of caching for llms-txt packageThe integration of the hishel caching library for llms-txt HTTP requests is a solid approach. This will significantly improve performance when expanding outlines into Markdown files by avoiding repeated network requests.
A few suggestions to further enhance this implementation:
- # Patch `llms_txt` package to use caching per Hishel. - # https://hishel.com/ - controller = hishel.Controller(allow_stale=True) - storage = hishel.SQLiteStorage(ttl=3600) + # Patch `llms_txt` package to use caching per Hishel. + # https://hishel.com/ + # Get TTL from environment variable or use default (3600s = 1 hour) + ttl = int(os.getenv("ABOUT_CACHE_TTL", 3600)) + controller = hishel.Controller(allow_stale=True) + storage = hishel.SQLiteStorage(ttl=ttl)Also, consider stopping the mock patch after use to prevent any unintended side effects:
with hishel.CacheClient(controller=controller, storage=storage, timeout=10.0) as client: # Patch the client object. - mock.patch("llms_txt.core.httpx", client).start() + patcher = mock.patch("llms_txt.core.httpx", client) + patcher.start() # Import module lazily to relax dependency surface. from llms_txt import create_ctx # Expand links and output in Markdown format. markdown = self.to_markdown() ctx = create_ctx(markdown, optional=optional, n_workers=None) + # Stop the patcher to avoid affecting other code + patcher.stop() return str(ctx)
3-5: Don't forget to import os for the suggested improvementIf you implement the suggestion to make the TTL configurable via environment variable, you'll need to add an import for the os module.
import typing as t from io import StringIO from unittest import mock +import os import hishelsrc/cratedb_about/cli.py (2)
16-24: Consider tightening validation around remote URLs and expose aPathtype for local files
outline_url_optioncurrently accepts a free-form string that may point either to a local file or to a remote URL.
Two issues come up in practice:
- Passing a malformed URL is silently accepted and only fails later in
CrateDbKnowledgeOutline.load, far away from the CLI surface.- A local relative path is accepted, but it is not clear from the help text that this is supported.
You could improve UX and early-fail behaviour by:
-outline_url_option = click.option( - "--url", - "-u", - envvar="ABOUT_OUTLINE_URL", - type=str, - required=False, - default=None, - help="URL to the outline file. By default, the built-in outline is used.", -) +outline_url_option = click.option( + "--url", + "-u", + envvar="ABOUT_OUTLINE_URL", + type=str, + required=False, + default=None, + metavar="URL|FILE", + help=( + "Outline source. Provide either an HTTP(S) URL or a local file path. " + "If omitted, the built-in outline is used." + ), + callback=lambda _, __, v: v if not v or v.startswith(("http://", "https://")) else Path(v).expanduser(), +)This surfaces “file vs. URL” semantics early and prevents obviously invalid schemes such as
htp://….
72-90:--formatinconsistency may confuse usersThe
outlinecommand still advertises/accepts"llms-txt"whilebundlenow hard-codes the alias"llm". From a UX perspective having to remember two spellings for the same output is surprising.Two options:
- Accept both aliases on both commands, using
click.Choice(["llm", "llms-txt"]), and normalise internally.- Rename everywhere to a single canonical value (
"llm"seems shorter) and emit a deprecation warning when the old alias is detected.Either way, removing the manual
if format_ != "llm": …guard is possible becauseclick.Choicealready guarantees validity.src/cratedb_about/query/model.py (1)
91-110:fetch()lacks support forfile://scheme and treats non-HTTP URLs as errorsThe current path detection works for local plain paths, but:
file:///absolute/path(a common way to reference local files in environment variables) is not supported.- Any other remote scheme (
ftp://,s3://, etc.) raisesNotImplementedError, making the error handling path noisy.A small refactor makes the function more future-proof:
- path = Path(url) - path = path.expanduser().resolve() - if path.exists() and path.is_file(): - return path.read_text() - if url.startswith("http"): + if url.startswith("file://"): + return Path(url[7:]).expanduser().read_text() + # Local file (absolute or relative path) + path = Path(url).expanduser() + if path.is_file(): + return path.read_text() + # HTTP(S) resources + if url.startswith(("http://", "https://")): response = self.http_client.get(url) response.raise_for_status() return response.text - raise NotImplementedError(f"Unable to load context file. Source: {url}") + raise NotImplementedError(f"Unsupported scheme in context URL: {url}")This preserves existing behaviour while gracefully extending support.
tests/test_query.py (2)
116-135: HTTP payload test is skipped – hishel provides aMockTransporthelperInstead of skipping
test_model_payload_from_http, you can simulate the network response deterministically:from hishel.mock import MockTransport def test_model_payload_from_http(monkeypatch): url = "https://example.com/context.txt" content = "HTTP context" monkeypatch.setenv("ABOUT_CONTEXT_URL", url) transport = MockTransport() transport.add(url, text=content) loader = CrateDbKnowledgeContextLoader() loader.http_client._transport = transport # inject mocked transport result = loader.get_prompt() assert content in resultThis keeps coverage high and exercises the caching layer.
137-151:caplogneeds explicit log-level to capture the message
logger.exception()logs atERRORlevel.
If the root logger’s level is higher, the assertion oncaplog.messagesmay fail spuriously.Add:
with caplog.at_level(logging.ERROR): result = loader.get_prompt()to guarantee the message is captured regardless of test runner settings.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (16)
.gitignore(1 hunks)CHANGES.md(1 hunks)README.md(3 hunks)codecov.yml(1 hunks)docs/backlog.md(1 hunks)pyproject.toml(2 hunks)src/cratedb_about/bundle/llmstxt.py(2 hunks)src/cratedb_about/cli.py(3 hunks)src/cratedb_about/outline/model.py(2 hunks)src/cratedb_about/query/core.py(5 hunks)src/cratedb_about/query/model.py(2 hunks)tests/assets/outline.yaml(1 hunks)tests/conftest.py(1 hunks)tests/test_cli.py(5 hunks)tests/test_outline.py(2 hunks)tests/test_query.py(2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/cratedb_about/query/core.py (1)
src/cratedb_about/query/model.py (3)
CrateDbKnowledgeContextLoader(125-135)KnowledgeContextLoader(46-122)get_prompt(111-122)
🔇 Additional comments (27)
pyproject.toml (2)
106-106: Approve test dependency addition.
Addingrequests-mock<2under[tool].optional-dependencies.testaligns with mocking HTTP in tests. No concerns.
76-76:Details
❓ Verification inconclusive
Verify compatibility and licensing of
hishel.
Ensure thathishel<0.2provides the required HTTP caching and SQLite storage functionality and that its license is compatible with Apache-2.0. Confirm that the latest published version under 0.2 supports Python >=3.9.Run:
🏁 Script executed:
#!/bin/bash # Verify `hishel` versions and metadata on PyPI curl -s https://pypi.org/pypi/hishel/json | jq '.info.version, .info.license'Length of output: 92
🏁 Script executed:
#!/bin/bash # Check required Python versions and classifiers for hishel curl -s https://pypi.org/pypi/hishel/json | jq '.info.requires_python, .info.license, .info.classifiers'Length of output: 709
Verify
hishelcompatibility and functionalityThe PyPI metadata for hishel 0.1.2 shows:
- Python requirement “>=3.9” ✔
- License classifier “BSD License” (OSI-approved, compatible with Apache-2.0) ✔
licensefield is unset (you may want to add it explicitly)I could not confirm HTTP caching or SQLite storage support from the metadata alone. Please review the package’s documentation or source to ensure it provides the required caching and storage features.
• File: pyproject.toml
Lines: 76"hishel<0.2",.gitignore (1)
7-7: Approve ignoring SQLite cache files.
Adding*.sqliteto.gitignoreproperly excludes cache artifacts from commits, preventing accidental check-ins of local database files.tests/assets/outline.yaml (1)
26-28: Approve updated outline test fixture.
The replacement of the "Testing README" entry with "Example Domain" matches the intended CLI output change. Ensure that corresponding tests intests/test_outline.pyhave been updated to assert on "Example Domain".Please verify that
tests/test_outline.pyassertions reference "Example Domain".codecov.yml (1)
1-14: Approve Codecov configuration.
The newcodecov.ymlsets a project coverage target of 85% and makes patch coverage informational, which aligns with the CI strategy. Indentation and YAML syntax are correct.docs/backlog.md (1)
9-11: Approve backlog item forllms-txtsize comparison.
Tracking the comparison ofllms.txtfile sizes in issue #20 is clear. Ensure the issue remains up-to-date with measurement results.tests/test_outline.py (2)
109-109: Assertion aligned with updated test asset content.The change appropriately updates the test expectation to look for "Example Domain" instead of the previous "Testing README", reflecting changes in the test asset file.
125-125: Assertion aligned with updated test asset content.The change appropriately updates the test expectation to look for "Example Domain" instead of the previous "Testing README", reflecting changes in the test asset file.
src/cratedb_about/bundle/llmstxt.py (3)
19-19: Good addition of configurable outline source.Adding the
outline_urlparameter to the LllmsTxtBuilder dataclass enables flexible specification of alternative outline sources, supporting the new CLI--urloption andABOUT_OUTLINE_URLenvironment variable.
23-23: Improved logging with more descriptive message.The updated log message is more informative, clearly indicating both the format and output directory.
38-38: Good implementation of configurable outline loading.The change properly uses the provided
outline_urlparameter when loading the outline, enabling users to specify alternative outline sources.Note: There's a TODO comment above this line about optimizing to avoid redundant data processing, which is a good candidate for future improvement.
tests/conftest.py (1)
9-16: Improved test environment cleanup.Refactoring the environment variable cleanup to use a list-based approach improves maintainability and ensures tests run in a clean environment. The inclusion of the new environment variables
ABOUT_CONTEXT_URLandABOUT_OUTLINE_URLproperly supports the new features introduced in this PR.CHANGES.md (1)
17-23: Comprehensive changelog entries.The added changelog entries accurately document all the key enhancements in this PR:
- Support for
--url/ABOUT_OUTLINE_URLoption in the Bundle command- Improved handling of
--formatoption- Support for loading context files from local filesystem
- Caching for HTTP context payloads
- Caching for expanding outlines into Markdown
These detailed entries will help users understand the new capabilities.
src/cratedb_about/query/core.py (4)
43-45: Well-designed integration of the KnowledgeContextLoaderGood decision to use a dataclass field with a default factory to instantiate the context loader. This provides flexibility while maintaining backward compatibility.
86-90: Clean refactoring to use the new context loaderThe transition from using static Settings class to the instance-based context loader is clean and maintains the same functionality.
117-129: Good simplification of the prompt message handlingThe refactored logic for adding the prompt message to input_messages is cleaner and easier to follow than the previous implementation.
157-158: Consistent use of context loader throughout the codeThe instructions field is consistently accessed from the context loader.
README.md (4)
156-157: Updated bundle command example with correct format optionThe documentation now correctly reflects the required
--format=llmparameter and explicitly includes the output directory.
158-161: Clear explanation of outline file optionsGood addition explaining the default behavior and override options through the
--urlflag or environment variable.
170-173: Helpful tip about list-questions commandThis is a useful addition that helps users discover the built-in question suggestions.
182-188: Well-documented configuration options for cachingExcellent addition of notes explaining how to configure the context URL and cache TTL using environment variables. This clearly documents the new caching functionality introduced in this PR.
tests/test_cli.py (5)
48-56: Updated test to match new CLI requirementsThe test has been properly updated to use the required
--format llmargument and set the appropriate environment variables.
61-61: Updated log assertion to match new message formatGood adjustment to the expected log message.
75-76: Updated test for missing outdir to include formatTest correctly includes the now required format parameter.
85-99: New test for missing format parameterExcellent addition of a test case to verify that the CLI properly requires the
--formatoption. This improves test coverage for the new CLI behavior.
114-114: Updated error message assertionThe assertion has been correctly updated to match the more specific error message format from Click.
tests/test_query.py (1)
28-31: Patch env vars before instantiating the loader to ensure consistencyHere the loader fixture is created before
ABOUT_CONTEXT_URLis patched.
Although the propertyurlis evaluated lazily, the HTTP client and cache were already initialised with the previous value (raising odd edge-cases if both values point to different hosts).Safer pattern:
def test_model_loader_url_env_success(mocker): mocker.patch.dict("os.environ", {"ABOUT_CONTEXT_URL": "http://example.com"}) loader = CrateDbKnowledgeContextLoader() assert loader.url == "http://example.com"Repeating this approach in similar tests avoids hidden state discrepancies.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/cratedb_about/cli.py (2)
24-25: Suggestion: Improve help text formattingThe help text is well-written but could benefit from more consistent formatting.
- help="Outline source. Provide either an HTTP(S) URL or a local file path. " - "If omitted, the built-in outline is used.", + help="Outline source. Provide either an HTTP(S) URL or a local file path. " + "If omitted, the built-in outline is used.",
26-28: Consider adding path existence validationWhile the current implementation correctly handles path expansion, it doesn't validate if the file actually exists when a local path is provided. This might lead to confusing errors later in the execution flow.
- callback=lambda _, __, v: v - if not v or v.startswith(("http://", "https://")) - else Path(v).expanduser(), + callback=lambda _, __, v: v + if not v or v.startswith(("http://", "https://")) + else Path(v).expanduser().resolve(),Additionally, you might want to add a more comprehensive validation function:
def validate_outline_source(ctx, param, value): if not value or value.startswith(("http://", "https://")): return value path = Path(value).expanduser().resolve() if not path.exists(): raise click.BadParameter(f"File does not exist: {path}") return path
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/cratedb_about/cli.py(1 hunks)
🔇 Additional comments (1)
src/cratedb_about/cli.py (1)
23-28: Good improvement to CLI option usability and flexibility!The enhancements to the
outline_url_optionimprove both documentation and behavior:
- Adding
metavar="URL|FILE"makes the expected input format clear in help output- The updated help text clearly explains the accepted input types and default behavior
- The callback function intelligently handles both URL and file path inputs, with proper home directory expansion for file paths
About
Apply a bit of monkey patching to support the llms-txt package with content caching.