Add CLI program `cratedb-about ask` for conversations about CrateDB by amotl · Pull Request #4 · crate/about

amotl · 2025-04-16T19:03:27Z

About

Making sense of it.

References

coderabbitai · 2025-04-16T19:03:38Z

Warning

Rate limit exceeded

@amotl has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 13 minutes and 44 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between c4c7111 and f9794d8.

📒 Files selected for processing (11)

.github/dependabot.yml (1 hunks)
.github/workflows/tests.yml (1 hunks)
.gitignore (1 hunks)
CHANGES.md (1 hunks)
README.md (1 hunks)
docs/backlog.md (1 hunks)
docs/sandbox.md (1 hunks)
pyproject.toml (3 hunks)
src/cratedb_about/cli.py (1 hunks)
src/cratedb_about/core.py (1 hunks)
src/cratedb_about/model.py (1 hunks)

Walkthrough

This update introduces a new command-line interface (CLI) tool named cratedb-about for interacting with CrateDB-related knowledge using large language models (LLMs) via subcommands such as ask and list-questions. The project configuration is updated to include new dependencies and scripts, and documentation is revised to reflect the new installation and usage instructions. Automated testing and dependency management are set up using GitHub Actions and Dependabot. Additional documentation is provided for development workflows and future feature planning. The .gitignore file is expanded to exclude more build artifacts.

Changes

File(s)	Change Summary
CHANGES.md, README.md	Updated changelog to mention the new CLI and its subcommands; revised README to show new installation and usage instructions for `cratedb-about` and removed old manual build instructions.
pyproject.toml	Added dependencies: `claudette`, `openai`, `requests<3`; added script entry for `cratedb-about` CLI; updated mypy configuration with `mypy_path` and package list.
.github/dependabot.yml, .github/workflows/tests.yml	Added Dependabot configuration for automated dependency and GitHub Actions updates; introduced a GitHub Actions workflow for testing on Ubuntu with Python 3.9 and 3.13, including linting, building, and running CLI commands.
.gitignore	Added patterns to ignore `bdist.*` and `__pycache__` directories.
docs/backlog.md, docs/sandbox.md	Added a backlog document outlining future iterations and a sandbox guide for setting up and using the project in a development environment.
src/cratedb_about/cli.py	New CLI module using `click`, providing `ask` and `list_questions` commands for querying and listing CrateDB-related questions via LLM backends.
src/cratedb_about/core.py	Introduced `CrateDBConversation` dataclass to manage conversational interactions with CrateDB knowledge via "openai" or "claude" LLM backends, including methods for asking questions and handling backend selection.
src/cratedb_about/model.py	Added `Example` class with a list of CrateDB-related questions and `Settings` class for fetching and caching prompt instructions from a remote source for use in LLM interactions.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI (cratedb-about)
    participant Core (CrateDBConversation)
    participant Model (Settings/Example)
    participant LLM Backend

    User->>CLI (cratedb-about): Run "cratedb-about ask <question>"
    CLI (cratedb-about)->>Core (CrateDBConversation): ask(question)
    Core (CrateDBConversation)->>Model (Settings): get_prompt()
    Model (Settings)->>Model (Settings): Fetch and cache prompt from URL
    Core (CrateDBConversation)->>LLM Backend: Send prompt and question
    LLM Backend-->>Core (CrateDBConversation): Return answer
    Core (CrateDBConversation)-->>CLI (cratedb-about): Return answer
    CLI (cratedb-about)-->>User: Display answer

sequenceDiagram
    participant User
    participant CLI (cratedb-about)
    participant Model (Example)

    User->>CLI (cratedb-about): Run "cratedb-about list-questions"
    CLI (cratedb-about)->>Model (Example): Get example questions
    Model (Example)-->>CLI (cratedb-about): Return questions list
    CLI (cratedb-about)-->>User: Display questions

Poem

🐇
In the meadow of code, a new CLI appears,
With questions and answers for CrateDB peers.
From Claude or OpenAI, wisdom flows out,
Just type and inquire—no need to shout!
Docs and tests now hop along too,
As this rabbit cheers, "Great work, woohoo!"

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

amotl

Just sharing a quick example what's the possible outcome of this experiment.

amotl · 2025-04-16T19:04:41Z

examples/ask.py

+    questions = [
+        "What are the benefits of CrateDB?",
+        "Tell me about why CrateDB is different.",
+        "Tell me about CrateDB Cloud.",
+        "How to use sequences with CrateDB?",
+        "CrateDB does not seem to provide an AUTOINCREMENT feature?",
+        "How do I apply sharding properly?",
+        "How much data can CrateDB store?",
+        "Please tell me how CrateDB stores data.",
+    ]


Let me share responses to a few questions for demonstration purposes. wdyt?

$ uv run examples/ask.py "How much data can CrateDB store?"

Question

How much data can CrateDB store?

Answer

CrateDB has no built‑in “upper” data limit – it’s designed to grow horizontally. You simply add more nodes (and disks), and CrateDB will shard and distribute your tables across them.

Things that determine how much you can store:

• Disk capacity on your data‑nodes – each node holds as many shards (and their replicas) as its disks allow.
• Shard count per node – we recommend keeping primary + replica shards under 1 000 per node to preserve stability.
• Cluster size – more nodes means more aggregate storage (and more CPUs for parallel queries).

In practice you can start with a single machine (tens or hundreds of gigabytes) and grow to clusters holding terabytes or even petabytes of data. On CrateDB Cloud you can spin up clusters from an 8 GB free tier all the way up to multi‑TB enterprise clusters.

In short: your capacity is only limited by the hardware (or cloud plan) you choose and how you size your shards and nodes – there’s no hard‑coded CrateDB ceiling.

uv run examples/ask.py "Please tell me how CrateDB stores data."

Question

Please tell me how CrateDB stores data.

Answer

Under the hood CrateDB is really a distributed, column‑oriented Lucene database. Here’s how it works end‑to‑end:

Shared‑nothing cluster
• Every CrateDB node is equal—there’s no single master for data.
• Tables are split into shards; each shard is a “mini‐database” (a Lucene index) kept on one node.
• You configure a shard count per table (and per partition, if you use partitioned tables). CrateDB then automatically places primaries and replicas of those shards across your nodes.

Column‑oriented storage on Lucene
• Inside each shard, CrateDB stores each column as its own compressed Lucene segment (“columnar store”).
• Numeric, date, geo, full‑text and even nested‐object fields each get their own inverted‐ or BKD‐style Lucene index.
• Every field is automatically indexed on write — there’s no manual mapping step.

Ingestion and durability
• Writes are first appended to a transaction log (translog) for durability.
• CrateDB periodically “refreshes” shards—flushing memtables into new Lucene segments and making data visible to queries.
• Under heavy write you’ll see many small segments; Lucene automatically merges them down in the background to optimize reads.

Querying
• When you run SQL, the query planner breaks it into per‑shard jobs.
• Each shard scans just the columns it needs, leveraging Lucene’s indexes and columnar compression to minimize I/O.
• Results are aggregated and joined across shards and nodes in a fully distributed fashion.

Partitioning and sharding
• Partitioned tables physically create one index (set of shards) per partition key (e.g. by date).
• Sharding (including hash sharding on your chosen primary key) lets you spread data uniformly across the cluster for parallelism.

On‑disk layout
• Data lives under each node’s path.data (directory), in subdirectories per Lucene index (table or partition), then per shard.
• Each shard directory holds its Lucene segment files plus a translog folder for recent writes.

Replication and failover
• Every shard has a configurable number of replicas.
• If a node fails, its primaries or replicas are automatically re‑elected or re‑allocated to other nodes, ensuring high availability.

Managed options
• On CrateDB Cloud the same storage model applies—your data sits in Lucene shards on managed VMs, with automated backups/snapshots to object storage.

In short: your SQL tables are really Lucene‐indexed column stores, split into shards and balanced across a cluster. Every column (and nested JSON field) is indexed for blazing‑fast queries, and Lucene’s segment/merge architecture gives you near‑real‑time visibility on your data.

uv run examples/ask.py "CrateDB does not seem to provide an AUTOINCREMENT feature?"

Question

CrateDB does not seem to provide an AUTOINCREMENT feature?

Answer

You’re right – CrateDB does not have a built‑in SERIAL or AUTO_INCREMENT type. Because it’s a distributed system it avoids the single‑node bottleneck of a coordinated counter. You have a few options instead:

• Use a timestamp as your key

CREATE TABLE events ( id BIGINT DEFAULT now() PRIMARY KEY, ...);

(Gaps are fine, but you lose true row‑count sequencing.)

• Use UUIDs

CREATE TABLE things ( id TEXT DEFAULT gen_random_text_uuid() PRIMARY KEY, ...);

(Or bring in a UDF to generate UUIDv7 for roughly time‑ordered IDs.)

• Use IDs from your source system
If you’re importing from another database you can just keep its primary keys.

• Roll your own sequence table (optimistic concurrency):

Create a tiny one‑shard table to hold your counters:
CREATE TABLE sequences ( name TEXT PRIMARY KEY, last_value BIGINT ) CLUSTERED INTO 1 SHARDS; INSERT INTO sequences (name, last_value) VALUES ('myseq', 0);

When you want a new ID, atomically bump it and grab the value:
WITH s AS ( SELECT last_value + 1 AS next, _seq_no, _primary_term FROM sequences WHERE name = 'myseq' ), u AS ( UPDATE sequences SET last_value = s.next WHERE name = 'myseq' AND _seq_no = s._seq_no AND _primary_term = s._primary_term RETURNING s.next AS id ) INSERT INTO mytable (id, …) SELECT id, … FROM u;
If there’s a contention (nobody returned in u), you retry with exponential back‑off.

That last pattern builds a true sequential counter in a distributed‑safe way. Pick the approach that best fits your ingestion rate and consistency needs.

$ uv run examples/ask.py "Does CrateDB support SQLAlchemy?"

Yes – CrateDB has first‑class SQLAlchemy support via the “sqlalchemy‑cratedb” dialect.

• Project & docs: https://github.com/crate/sqlalchemy-cratedb / https://cratedb.com/docs/sqlalchemy-cratedb/
• Install:
  pip install sqlalchemy-cratedb
• Connection string (HTTP):
  crate://:@:4200/?schema=doc
• Example:
  from sqlalchemy import create_engine
  engine = create_engine("crate://crate@localhost:4200/?schema=doc")
  with engine.connect() as conn:
     result = conn.execute("SELECT * FROM my_table LIMIT 10")
     for row in result: print(row)

Evaluation: There are certainly errors inside.

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (3)

examples/ask.py (3)

36-38: Add type hints to improve code quality and maintainability.

Adding type hints would improve code clarity and enable static type checking, which could help catch issues early.

-    backend: t.Literal["claude", "openai"] = "openai"
-    use_knowledge: bool = True
+    backend: t.Literal["claude", "openai"] = "openai"
+    use_knowledge: bool = True
+
+    def ask(self, question: str) -> None:

This relates to the commented-out mypy in pyproject.toml - adding proper type hints would allow you to re-enable type checking.

122-140: Make backend configurable and add error handling in main.

The main function has hardcoded values and minimal error handling.

Consider enhancing the main function to:

Accept backend configuration from environment variables or command line
Add better error handling
Make the interface more user-friendly

 def main():
     """
     Ready.
     """
+    # Get backend from environment variable or use default
+    backend = os.environ.get("CRATEDB_LLM_BACKEND", "openai").lower()
+    if backend not in ["openai", "claude"]:
+        print(f"Warning: Unknown backend '{backend}'. Using 'openai' instead.", file=sys.stderr)  # noqa: T201
+        backend = "openai"
+        
     wizard = CrateDBConversation(
-        backend="openai",
+        backend=backend,
         use_knowledge=True,
     )
     if len(sys.argv) >= 2:
         question = sys.argv[1]
         if question == "list-questions":
             print("A few example questions:\n")  # noqa: T201
-            print("\n\n".join(Example.questions))  # noqa: T201
+            for i, q in enumerate(Example.questions, 1):
+                print(f"{i}. {q}")  # noqa: T201
             return
+        elif question in ["-h", "--help"]:
+            print("Usage: python ask.py [QUESTION|list-questions|--help]")  # noqa: T201
+            print("\nExamples:")  # noqa: T201
+            print("  python ask.py 'What are the benefits of CrateDB?'")  # noqa: T201
+            print("  python ask.py list-questions")  # noqa: T201
+            print("\nEnvironment variables:")  # noqa: T201
+            print("  OPENAI_API_KEY - Required when using the OpenAI backend")  # noqa: T201
+            print("  CRATEDB_LLM_BACKEND - Set to 'openai' or 'claude' (default: openai)")  # noqa: T201
+            return
     else:
         question = Example.questions[4]
     print(f"Question: {question}\nAnswer:")  # noqa: T201
-    wizard.ask(question)
+    try:
+        wizard.ask(question)
+    except Exception as e:
+        print(f"Error: {e}", file=sys.stderr)  # noqa: T201
+        sys.exit(1)

1-11: Enhance script dependency declaration.

The script uses a comment-based format for dependencies, but it could be improved.

Consider converting to a more standardized format like pyproject.toml for the script, or add a requirements.txt file. This would make it easier for users to install dependencies:

# requirements.txt
claudette>=0.1.0
openai>=1.0.0
requests>=2.0.0

Or update the comment to better explain how to install these dependencies manually:

 # /// script
 # requires-python = ">=3.8"
 # dependencies = [
 #     "claudette",
 #     "openai",
 #     "requests",
 # ]
 # ///
+# To install dependencies: pip install claudette openai requests

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6876fed and 0599200.

📒 Files selected for processing (4)

CHANGES.md (1 hunks)
README.md (1 hunks)
examples/ask.py (1 hunks)
pyproject.toml (1 hunks)

🧰 Additional context used

🪛 Gitleaks (8.21.2)

README.md

36-36: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

🔇 Additional comments (1)

CHANGES.md (1)

7-7: LGTM! Changelog entry is clear and concise.

The changelog entry correctly documents the addition of the new example program and follows the project's existing format.

pyproject.toml

README.md

examples/ask.py

coderabbitai

Actionable comments posted: 9

🧹 Nitpick comments (7)

docs/backlog.md (1)
6-10: Improve Markdown formatting for URL references.

The Markdown linter identified bare URLs in the document. Consider using proper Markdown link syntax for better readability and maintainability.
-## Iteration +2
-- Unlock Discourse: https://community.cratedb.com/raw/1015
-- Unlock HTML resources: https://www.urltoany.com/url-to-markdown.
-  => Find the best standalone program.
-- Unlock GitHub projects: https://github.com/mattduck/gh2md
+## Iteration +2
+- Unlock Discourse: [Discourse Raw Content](https://community.cratedb.com/raw/1015)
+- Unlock HTML resources: [URL to Markdown Converter](https://www.urltoany.com/url-to-markdown).
+  => Find the best standalone program.
+- Unlock GitHub projects: [gh2md GitHub Repository](https://github.com/mattduck/gh2md)
🧰 Tools

🪛 markdownlint-cli2 (0.17.2)

7-7: Bare URL used
null

(MD034, no-bare-urls)

8-8: Bare URL used
null

(MD034, no-bare-urls)

10-10: Bare URL used
null

(MD034, no-bare-urls)
src/cratedb_about/cli.py (1)
31-33: Use Click's echo functions instead of direct stdout writes.

Click provides click.echo() which handles Unicode and different output streams better than direct writes to stdout.
-    sys.stdout.write(f"Question: {question}\nAnswer:\n")
-    sys.stdout.write(wizard.ask(question))
-    sys.stdout.write("\n")
+    click.echo(f"Question: {question}")
+    click.echo("Answer:")
+    click.echo(wizard.ask(question))
.github/workflows/tests.yml (1)
1-65: Well-structured GitHub Actions workflow for automated testing

The workflow is well-configured with appropriate triggers for push events, pull requests, and manual dispatch. The job matrix strategy with multiple Python versions is a good practice for ensuring compatibility.

A few suggestions for enhancement:

Consider adding a job timeout to prevent runaway workflows

Add a step to verify the installation by checking the CLI version

Consider creating a separate job for linting to provide faster feedback
# Add timeout to prevent runaway workflows
 jobs:

   test:
+    timeout-minutes: 15
     name: "
     Python ${{ matrix.python-version }}
     "

# Add version check after installation
     - name: Set up project
       run: |
         uv pip install --editable='.[develop,test]'
+        cratedb-about --version
src/cratedb_about/core.py (4)
1-2: Update or remove stale comment reference

The comment appears to be a reference to source inspiration, but it would be more valuable to provide context about what specific patterns or ideas were derived from the referenced URL.

Either expand on what was derived from the referenced website or consider removing if not directly relevant to understanding the code.

18-23: Method implementation looks good but could be more robust

The ask method effectively routes questions to the appropriate backend implementation.

Consider adding error handling to catch and provide helpful messages for potential API failures:
def ask(self, question: str) -> str:
    try:
        if self.backend == "openai":
            return self.ask_gpt(question)
        if self.backend == "claude":
            return self.ask_claude(question)
        raise NotImplementedError("Please select an available LLM backend")
+    except Exception as e:
+        return f"Error querying LLM backend: {str(e)}"
52-53: Optimize imports for better code organization

The import for Message is placed after instantiating the OpenAI client, which disrupts the logical flow of imports.

Move the import to be grouped with the other OpenAI imports:
from openai import OpenAI
from openai.types.responses import ResponseInputTextParam
from openai.types.shared_params import Reasoning
+from openai.types.responses.response_input_param import Message

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
-from openai.types.responses.response_input_param import Message
56-57: Document model selection rationale

The code comments out "gpt-4o" and uses "o4-mini" without explaining the rationale or performance/cost tradeoffs.

Add a comment explaining the model selection reasoning:
- # model="gpt-4o",  # noqa: ERA001
- model="o4-mini",
+ # Using o4-mini as it provides a good balance of performance and cost
+ # Consider upgrading to gpt-4o for more complex queries if needed
+ model="o4-mini",

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0599200 and d7fb2be.

📒 Files selected for processing (11)

.github/dependabot.yml (1 hunks)
.github/workflows/tests.yml (1 hunks)
.gitignore (1 hunks)
CHANGES.md (1 hunks)
README.md (1 hunks)
docs/backlog.md (1 hunks)
docs/sandbox.md (1 hunks)
pyproject.toml (3 hunks)
src/cratedb_about/cli.py (1 hunks)
src/cratedb_about/core.py (1 hunks)
src/cratedb_about/model.py (1 hunks)

✅ Files skipped from review due to trivial changes (3)

.gitignore
.github/dependabot.yml
docs/sandbox.md

🚧 Files skipped from review as they are similar to previous changes (2)

CHANGES.md
README.md

🧰 Additional context used

🧬 Code Graph Analysis (2)

src/cratedb_about/cli.py (2)

src/cratedb_about/core.py (2)

CrateDBConversation (10-79)

ask (18-23)

src/cratedb_about/model.py (1)

Example (6-21)

src/cratedb_about/core.py (2)

src/cratedb_about/model.py (2)

Settings (24-44)

get_prompt (35-44)

src/cratedb_about/cli.py (1)

ask (20-33)

🪛 markdownlint-cli2 (0.17.2)

docs/backlog.md

7-7: Bare URL used
null

(MD034, no-bare-urls)

8-8: Bare URL used
null

(MD034, no-bare-urls)

10-10: Bare URL used
null

(MD034, no-bare-urls)

🔇 Additional comments (5)

pyproject.toml (3)

71-78: LGTM! Dependencies are appropriate for the new feature.

The added dependencies (claudette, openai, requests) align well with the project's new conversational capabilities about CrateDB.

89-89: LGTM! CLI script entry point properly defined.

The CLI script entry point correctly references the main function in the CLI module.

125-129: LGTM! Proper mypy configuration for the new package.

The mypy configuration correctly specifies the source directory and package for type checking.

src/cratedb_about/model.py (1)

11-21: LGTM! Example questions list is comprehensive.

The example questions cover a good range of CrateDB topics and will be helpful for users exploring the CLI.

src/cratedb_about/core.py (1)

9-17: Add type annotations for return value of initialization methods

The dataclass could benefit from improved type annotations, especially for any post-initialization logic.

The dataclass structure is clean and the type annotations for the backend are appropriate, using a Literal type to restrict to valid options.

src/cratedb_about/model.py

coderabbitai · 2025-04-16T22:46:33Z

src/cratedb_about/model.py

+    Configure the language model to support conversations about CrateDB.
+    """
+
+    llms_txt_url = "https://raw.githubusercontent.com/crate/about/6876fedee57f59b693f37996e04f53c6446f2ad6/build/llm/llms-ctx.txt"


🛠️ Refactor suggestion

Avoid hardcoding GitHub commit hash in URL.

Using a specific commit hash in the URL creates a tight coupling to that exact version. If the repository changes or the commit becomes unavailable, this will break.

Consider one of these alternatives:

Use a versioned, stable URL without commit hash

Include the context file directly in the package

Make the URL configurable through environment variables or settings

- llms_txt_url = "https://raw.githubusercontent.com/crate/about/6876fedee57f59b693f37996e04f53c6446f2ad6/build/llm/llms-ctx.txt" + llms_txt_url = os.getenv( + "CRATEDB_CONTEXT_URL", + "https://raw.githubusercontent.com/crate/about/main/build/llm/llms-ctx.txt" + )

Don't forget to add import os at the top of the file.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

llms_txt_url = "https://raw.githubusercontent.com/crate/about/6876fedee57f59b693f37996e04f53c6446f2ad6/build/llm/llms-ctx.txt"

# At the top of the file, add:

import os

# …

# Around line 29, replace the hard‑coded URL with an env‑driven one:

- llms_txt_url = "https://raw.githubusercontent.com/crate/about/6876fedee57f59b693f37996e04f53c6446f2ad6/build/llm/llms-ctx.txt"

+ llms_txt_url = os.getenv(

+ "CRATEDB_CONTEXT_URL",

+ "https://raw.githubusercontent.com/crate/about/main/build/llm/llms-ctx.txt"

+ )

Let's keep it hard-coded like this until v0.0.1.

src/cratedb_about/cli.py

src/cratedb_about/core.py

coderabbitai

Caution

Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 1

♻️ Duplicate comments (1)

src/cratedb_about/core.py (1)

26-34: 🛠️ Refactor suggestion

Add error handling and respect use_knowledge flag in ask_claude method

The Claude implementation has several issues that need to be addressed:

It doesn't respect the use_knowledge flag that controls whether to use CrateDB context
It lacks error handling for API failures
The hardcoded model index issue is acknowledged with a FIXME comment

def ask_claude(self, question: str) -> str:
    from claudette import Chat, contents, models

    # FIXME: The API does not provide name-based lookup.
    model = models[1]  # Sonnet 3.5
    chat = Chat(model, sp=Settings.instructions)
-    chat(Settings.get_prompt())
+    if self.use_knowledge:
+        try:
+            prompt = Settings.get_prompt()
+            if prompt:
+                chat(prompt)
+        except Exception as e:
+            print(f"Warning: Failed to load knowledge context: {e}", file=sys.stderr)  # noqa: T201
+    
+    try:
        result = chat(question)
        return contents(result)
+    except Exception as e:
+        raise RuntimeError(f"Claude API error: {str(e)}") from e

🧹 Nitpick comments (2)

src/cratedb_about/core.py (2)
94-94: Fix the type issue instead of using type ignore

The type ignore suggests there's a type compatibility issue that should be fixed properly.
-            input=input_messages,  # type: ignore[arg-type]
+            input=input_messages,
Consider checking the OpenAI API documentation to ensure the input_messages variable matches the expected type for the input parameter. If necessary, adjust the type annotation on line 60 to match what the API expects.

1-8: Add explicit attribution in file header comment

The header comment mentions the code is derived from a source, but it would be clearer to include more details about the license or permission.
-# Derived from: https://llmstxt.org/domains.html
+# Derived from: https://llmstxt.org/domains.html
+# Used with permission under MIT License (or specify the actual license)
This adds clarity regarding the legal usage of the derived code.

🛑 Comments failed to post (1)

src/cratedb_about/core.py (1)

10-18: 🛠️ Refactor suggestion

Add validation for API keys based on the selected backend

The class should validate the required environment variables during initialization to fail fast and provide clear error messages to users.

@dataclasses.dataclass
class CrateDBConversation:
    """
    Manage conversations about CrateDB.
+
+    Requires:
+      - OpenAI backend: Set OPENAI_API_KEY environment variable
+      - Claude backend: Set ANTHROPIC_API_KEY environment variable
    """

    backend: t.Literal["claude", "openai"] = "openai"
    use_knowledge: bool = True
+
+    def __post_init__(self):
+        """Validate configuration on initialization."""
+        if self.backend == "openai" and not os.environ.get("OPENAI_API_KEY"):
+            raise ValueError("OPENAI_API_KEY environment variable is required when using 'openai' backend")
+        elif self.backend == "claude" and not os.environ.get("ANTHROPIC_API_KEY"):
+            raise ValueError("ANTHROPIC_API_KEY environment variable is required when using 'claude' backend")

This validation aligns with the docstring in the ask function in cli.py which mentions these requirements.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

@dataclasses.dataclass
class CrateDBConversation:
    """
    Manage conversations about CrateDB.

    Requires:
      - OpenAI backend: Set OPENAI_API_KEY environment variable
      - Claude backend: Set ANTHROPIC_API_KEY environment variable
    """

    backend: t.Literal["claude", "openai"] = "openai"
    use_knowledge: bool = True

    def __post_init__(self):
        """Validate configuration on initialization."""
        if self.backend == "openai" and not os.environ.get("OPENAI_API_KEY"):
            raise ValueError("OPENAI_API_KEY environment variable is required when using 'openai' backend")
        elif self.backend == "claude" and not os.environ.get("ANTHROPIC_API_KEY"):
            raise ValueError("ANTHROPIC_API_KEY environment variable is required when using 'claude' backend")

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

src/cratedb_about/core.py (1)

26-40: 🛠️ Refactor suggestion

Add docstring and improve Claude implementation

The ask_claude method lacks a docstring, and uses a hardcoded model index which is brittle.

    def ask_claude(self, question: str) -> str:
+        """
+        Ask a question using Anthropic's Claude model.
+        
+        Requires the ANTHROPIC_API_KEY environment variable to be set.
+        Uses Claude Sonnet 3.5 model.
+        
+        Args:
+            question: The question to ask about CrateDB
+            
+        Returns:
+            str: The response from Claude
+            
+        Raises:
+            RuntimeError: If there's an error communicating with the Claude API
+        """
        from claudette import Chat, contents, models

-        model = models[1]  # Sonnet 3.5
+        # Using explicit model reference with fallback for library changes
+        model_name = "claude-3-sonnet-20240229"  # Sonnet 3.5
+        try:
+            # Find model by name if available
+            model = next((m for m in models if getattr(m, "name", "") == model_name), models[1])
+        except (IndexError, TypeError):
+            # Fallback to a direct model reference if models structure changes
+            raise RuntimeError("Claude API: Unable to find suitable model")
        chat = Chat(model, sp=Settings.instructions)
        if self.use_knowledge:
            try:
                chat(Settings.get_prompt())
            except Exception as e:
                print(f"Warning: Failed to load knowledge context: {e}", file=sys.stderr)  # noqa: T201
        try:
            result = chat(question)
            return contents(result)
        except Exception as e:
            raise RuntimeError(f"Claude API error: {e}") from e

🧹 Nitpick comments (7)

src/cratedb_about/core.py (7)

19-24: Add docstring to the ask method

The ask method lacks a docstring that explains its behavior and parameters.

    def ask(self, question: str) -> str:
+        """
+        Ask a question about CrateDB using the configured LLM backend.
+        
+        Args:
+            question: The question to ask about CrateDB
+            
+        Returns:
+            str: The response from the LLM
+            
+        Raises:
+            NotImplementedError: If an unsupported backend is specified
+            ValueError: If required environment variables are missing
+            RuntimeError: If there's an error communicating with the LLM API
+        """
        if self.backend == "openai":
            return self.ask_gpt(question)
        if self.backend == "claude":
            return self.ask_claude(question)
        raise NotImplementedError("Please select an available LLM backend")

66-81: Update type annotation to use built-in list

The code uses t.List which is deprecated in favor of built-in list in Python 3.9+.

-        input_messages: t.List[Message] = []
+        input_messages: list[Message] = []

91-101: Make OpenAI model configurable

The OpenAI model is hardcoded, which limits flexibility. Consider making it configurable.

@dataclasses.dataclass
class CrateDBConversation:
    """
    Manage conversations about CrateDB.

    Requires:
    - OPENAI_API_KEY environment variable when using "openai" backend
    - ANTHROPIC_API_KEY environment variable when using "claude" backend
    """

    backend: t.Literal["claude", "openai"] = "openai"
    use_knowledge: bool = True
+    openai_model: str = "o4-mini"

And then update the model usage:

        response = client.responses.create(
            # model="gpt-4o",  # noqa: ERA001
-            model="o4-mini",
+            model=self.openai_model,
            reasoning=Reasoning(
                effort="medium",
                # Your organization must be verified to generate reasoning summaries
                # summary="detailed",  # noqa: ERA001
            ),
            instructions=Settings.instructions,
            input=input_messages,  # type: ignore[arg-type]
        )

100-100: Fix type issue instead of using type ignore comment

The code uses a # type: ignore[arg-type] comment. Consider fixing the type issue properly.

-            input=input_messages,  # type: ignore[arg-type]
+            input=input_messages,

The error might be due to outdated type stubs for the OpenAI library. Consider looking into using the latest version of the library or adding proper type definitions.

92-98: Consider replacing commented code with configurable options

The commented code suggests alternate configurations that might be useful but are not enabled. Consider making these configurable options.

@dataclasses.dataclass
class CrateDBConversation:
    """
    Manage conversations about CrateDB.

    Requires:
    - OPENAI_API_KEY environment variable when using "openai" backend
    - ANTHROPIC_API_KEY environment variable when using "claude" backend
    """

    backend: t.Literal["claude", "openai"] = "openai"
    use_knowledge: bool = True
    openai_model: str = "o4-mini"
+    openai_reasoning_effort: t.Literal["low", "medium", "high"] = "medium"
+    openai_reasoning_summary: t.Optional[t.Literal["none", "concise", "detailed"]] = None

Then update the reasoning configuration:

        response = client.responses.create(
-            # model="gpt-4o",  # noqa: ERA001
            model=self.openai_model,
            reasoning=Reasoning(
-                effort="medium",
-                # Your organization must be verified to generate reasoning summaries
-                # summary="detailed",  # noqa: ERA001
+                effort=self.openai_reasoning_effort,
+                **({"summary": self.openai_reasoning_summary} if self.openai_reasoning_summary else {})
            ),
            instructions=Settings.instructions,
            input=input_messages,
        )

61-64: Use environment variable validation from __post_init__

The API key validation in ask_gpt method would be redundant if the suggested __post_init__ validation is implemented.

Once the __post_init__ validation is implemented, you could simplify this code:

-        api_key = os.environ.get("OPENAI_API_KEY")
-        if not api_key:
-            raise ValueError("OPENAI_API_KEY environment variable is required")
-        client = OpenAI(api_key=api_key)
+        client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

1-8: Move backend-specific imports to the top level

Backend-specific imports are currently inside methods. Consider moving them to the top level with conditional imports to improve performance when the class is instantiated multiple times.

# Derived from: https://llmstxt.org/domains.html
import dataclasses
import os
import sys
import typing as t

+# Import backends conditionally to avoid errors if dependencies are missing
+CLAUDE_AVAILABLE = False
+OPENAI_AVAILABLE = False
+
+try:
+    from claudette import Chat, contents, models
+    CLAUDE_AVAILABLE = True
+except ImportError:
+    pass
+
+try:
+    from openai import OpenAI
+    from openai.types.responses import ResponseInputTextParam
+    from openai.types.responses.response_input_param import Message
+    from openai.types.shared_params import Reasoning
+    OPENAI_AVAILABLE = True
+except ImportError:
+    pass
+
from cratedb_about.model import Settings

Then update the __post_init__ method to check for backend availability:

    def __post_init__(self):
        """Validate configuration."""
+        if self.backend == "openai" and not OPENAI_AVAILABLE:
+            raise ImportError("The 'openai' package is required when using the OpenAI backend")
+        elif self.backend == "claude" and not CLAUDE_AVAILABLE:
+            raise ImportError("The 'claudette' package is required when using the Claude backend")
+            
        if self.backend == "openai" and not os.environ.get("OPENAI_API_KEY"):
            raise ValueError("OPENAI_API_KEY environment variable is required when using 'openai' backend")
        elif self.backend == "claude" and not os.environ.get("ANTHROPIC_API_KEY"):
            raise ValueError("ANTHROPIC_API_KEY environment variable is required when using 'claude' backend")

And then simplify the method implementations to remove the imports:

    def ask_claude(self, question: str) -> str:
-        from claudette import Chat, contents, models

    def ask_gpt(self, question: str) -> str:
-        from openai import OpenAI
-        from openai.types.responses import ResponseInputTextParam
-        from openai.types.responses.response_input_param import Message
-        from openai.types.shared_params import Reasoning

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7df5972 and 0e02435.

📒 Files selected for processing (3)

src/cratedb_about/cli.py (1 hunks)
src/cratedb_about/core.py (1 hunks)
src/cratedb_about/model.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

src/cratedb_about/cli.py
src/cratedb_about/model.py

🧰 Additional context used

🧬 Code Graph Analysis (1)

src/cratedb_about/core.py (2)

src/cratedb_about/model.py (2)

Settings (25-57)

get_prompt (44-57)

src/cratedb_about/cli.py (1)

ask (20-41)

src/cratedb_about/core.py

... with subcommands `ask` and `list-questions` for ad hoc conversations about CrateDB.

amotl commented Apr 16, 2025

View reviewed changes

amotl marked this pull request as ready for review April 16, 2025 19:09

coderabbitai bot reviewed Apr 16, 2025

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

examples/ask.py Outdated Show resolved Hide resolved

examples/ask.py Outdated Show resolved Hide resolved

examples/ask.py Outdated Show resolved Hide resolved

amotl force-pushed the use branch 3 times, most recently from 1703851 to d7fb2be Compare April 16, 2025 22:30

amotl changed the title ~~Add example program examples/ask.py for conversations about CrateDB~~ Added CLI program cratedb-about ask for conversations about CrateDB Apr 16, 2025

coderabbitai bot reviewed Apr 16, 2025

View reviewed changes

amotl force-pushed the use branch from 7df5972 to 0e02435 Compare April 16, 2025 23:05

coderabbitai bot reviewed Apr 16, 2025

View reviewed changes

src/cratedb_about/core.py Show resolved Hide resolved

src/cratedb_about/core.py Show resolved Hide resolved

amotl force-pushed the use branch 2 times, most recently from c4c7111 to d2de5c7 Compare April 16, 2025 23:17

amotl force-pushed the build branch from 6876fed to 8d33532 Compare April 16, 2025 23:31

Base automatically changed from build to main April 16, 2025 23:31

amotl changed the title ~~Added CLI program cratedb-about ask for conversations about CrateDB~~ Add CLI program cratedb-about ask for conversations about CrateDB Apr 16, 2025

amotl added 2 commits April 17, 2025 01:32

Add CLI program cratedb-about

4696b77

... with subcommands `ask` and `list-questions` for ad hoc conversations about CrateDB.

cratedb-about: Improve input parameter and error handling per CodeRabbit

f9794d8

amotl force-pushed the use branch from d2de5c7 to f9794d8 Compare April 16, 2025 23:32

amotl merged commit 4051ff7 into main Apr 16, 2025
3 checks passed

amotl deleted the use branch April 16, 2025 23:32

This was referenced May 10, 2025

This and that II #19

Merged

Query: Introduce caching for context payloads on HTTP remote URLs #31

Merged

coderabbitai bot mentioned this pull request Jul 27, 2025

Query: Add a few more example questions specific to CrateDB #60

Merged

-    llms_txt_url = "https://raw.githubusercontent.com/crate/about/6876fedee57f59b693f37996e04f53c6446f2ad6/build/llm/llms-ctx.txt"
+# At the top of the file, add:
+import os
+# …
+# Around line 29, replace the hard‑coded URL with an env‑driven one:
+-    llms_txt_url = "https://raw.githubusercontent.com/crate/about/6876fedee57f59b693f37996e04f53c6446f2ad6/build/llm/llms-ctx.txt"
++    llms_txt_url = os.getenv(
++        "CRATEDB_CONTEXT_URL",
++        "https://raw.githubusercontent.com/crate/about/main/build/llm/llms-ctx.txt"
++    )

Conversation

amotl commented Apr 16, 2025

About

References

Uh oh!

coderabbitai bot commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

amotl left a comment

Choose a reason for hiding this comment

Uh oh!

amotl Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amotl Apr 16, 2025

Choose a reason for hiding this comment

Question

Answer

Uh oh!

amotl Apr 16, 2025

Choose a reason for hiding this comment

Question

Answer

Uh oh!

amotl Apr 16, 2025

Choose a reason for hiding this comment

Question

Answer

Uh oh!

amotl Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

amotl Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

coderabbitai bot commented Apr 16, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

amotl Apr 16, 2025 •

edited

Loading

amotl Apr 16, 2025 •

edited

Loading