Skip to content

Add CLI program cratedb-about ask for conversations about CrateDB#4

Merged
amotl merged 2 commits intomainfrom
use
Apr 16, 2025
Merged

Add CLI program cratedb-about ask for conversations about CrateDB#4
amotl merged 2 commits intomainfrom
use

Conversation

@amotl
Copy link
Member

@amotl amotl commented Apr 16, 2025

@coderabbitai
Copy link

coderabbitai bot commented Apr 16, 2025

Warning

Rate limit exceeded

@amotl has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 13 minutes and 44 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between c4c7111 and f9794d8.

📒 Files selected for processing (11)
  • .github/dependabot.yml (1 hunks)
  • .github/workflows/tests.yml (1 hunks)
  • .gitignore (1 hunks)
  • CHANGES.md (1 hunks)
  • README.md (1 hunks)
  • docs/backlog.md (1 hunks)
  • docs/sandbox.md (1 hunks)
  • pyproject.toml (3 hunks)
  • src/cratedb_about/cli.py (1 hunks)
  • src/cratedb_about/core.py (1 hunks)
  • src/cratedb_about/model.py (1 hunks)

Walkthrough

This update introduces a new command-line interface (CLI) tool named cratedb-about for interacting with CrateDB-related knowledge using large language models (LLMs) via subcommands such as ask and list-questions. The project configuration is updated to include new dependencies and scripts, and documentation is revised to reflect the new installation and usage instructions. Automated testing and dependency management are set up using GitHub Actions and Dependabot. Additional documentation is provided for development workflows and future feature planning. The .gitignore file is expanded to exclude more build artifacts.

Changes

File(s) Change Summary
CHANGES.md, README.md Updated changelog to mention the new CLI and its subcommands; revised README to show new installation and usage instructions for cratedb-about and removed old manual build instructions.
pyproject.toml Added dependencies: claudette, openai, requests<3; added script entry for cratedb-about CLI; updated mypy configuration with mypy_path and package list.
.github/dependabot.yml, .github/workflows/tests.yml Added Dependabot configuration for automated dependency and GitHub Actions updates; introduced a GitHub Actions workflow for testing on Ubuntu with Python 3.9 and 3.13, including linting, building, and running CLI commands.
.gitignore Added patterns to ignore bdist.* and __pycache__ directories.
docs/backlog.md, docs/sandbox.md Added a backlog document outlining future iterations and a sandbox guide for setting up and using the project in a development environment.
src/cratedb_about/cli.py New CLI module using click, providing ask and list_questions commands for querying and listing CrateDB-related questions via LLM backends.
src/cratedb_about/core.py Introduced CrateDBConversation dataclass to manage conversational interactions with CrateDB knowledge via "openai" or "claude" LLM backends, including methods for asking questions and handling backend selection.
src/cratedb_about/model.py Added Example class with a list of CrateDB-related questions and Settings class for fetching and caching prompt instructions from a remote source for use in LLM interactions.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI (cratedb-about)
    participant Core (CrateDBConversation)
    participant Model (Settings/Example)
    participant LLM Backend

    User->>CLI (cratedb-about): Run "cratedb-about ask <question>"
    CLI (cratedb-about)->>Core (CrateDBConversation): ask(question)
    Core (CrateDBConversation)->>Model (Settings): get_prompt()
    Model (Settings)->>Model (Settings): Fetch and cache prompt from URL
    Core (CrateDBConversation)->>LLM Backend: Send prompt and question
    LLM Backend-->>Core (CrateDBConversation): Return answer
    Core (CrateDBConversation)-->>CLI (cratedb-about): Return answer
    CLI (cratedb-about)-->>User: Display answer
Loading
sequenceDiagram
    participant User
    participant CLI (cratedb-about)
    participant Model (Example)

    User->>CLI (cratedb-about): Run "cratedb-about list-questions"
    CLI (cratedb-about)->>Model (Example): Get example questions
    Model (Example)-->>CLI (cratedb-about): Return questions list
    CLI (cratedb-about)-->>User: Display questions
Loading

Poem

🐇
In the meadow of code, a new CLI appears,
With questions and answers for CrateDB peers.
From Claude or OpenAI, wisdom flows out,
Just type and inquire—no need to shout!
Docs and tests now hop along too,
As this rabbit cheers, "Great work, woohoo!"


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Member Author

@amotl amotl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just sharing a quick example what's the possible outcome of this experiment.

examples/ask.py Outdated
Comment on lines +110 to +119
questions = [
"What are the benefits of CrateDB?",
"Tell me about why CrateDB is different.",
"Tell me about CrateDB Cloud.",
"How to use sequences with CrateDB?",
"CrateDB does not seem to provide an AUTOINCREMENT feature?",
"How do I apply sharding properly?",
"How much data can CrateDB store?",
"Please tell me how CrateDB stores data.",
]
Copy link
Member Author

@amotl amotl Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me share responses to a few questions for demonstration purposes. wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ uv run examples/ask.py "How much data can CrateDB store?"

Question

How much data can CrateDB store?

Answer

CrateDB has no built‑in “upper” data limit – it’s designed to grow horizontally. You simply add more nodes (and disks), and CrateDB will shard and distribute your tables across them.

Things that determine how much you can store:

• Disk capacity on your data‑nodes – each node holds as many shards (and their replicas) as its disks allow.
• Shard count per node – we recommend keeping primary + replica shards under 1 000 per node to preserve stability.
• Cluster size – more nodes means more aggregate storage (and more CPUs for parallel queries).

In practice you can start with a single machine (tens or hundreds of gigabytes) and grow to clusters holding terabytes or even petabytes of data. On CrateDB Cloud you can spin up clusters from an 8 GB free tier all the way up to multi‑TB enterprise clusters.

In short: your capacity is only limited by the hardware (or cloud plan) you choose and how you size your shards and nodes – there’s no hard‑coded CrateDB ceiling.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uv run examples/ask.py "Please tell me how CrateDB stores data."

Question

Please tell me how CrateDB stores data.

Answer

Under the hood CrateDB is really a distributed, column‑oriented Lucene database. Here’s how it works end‑to‑end:

  1. Shared‑nothing cluster
    • Every CrateDB node is equal—there’s no single master for data.
    • Tables are split into shards; each shard is a “mini‐database” (a Lucene index) kept on one node.
    • You configure a shard count per table (and per partition, if you use partitioned tables). CrateDB then automatically places primaries and replicas of those shards across your nodes.

  2. Column‑oriented storage on Lucene
    • Inside each shard, CrateDB stores each column as its own compressed Lucene segment (“columnar store”).
    • Numeric, date, geo, full‑text and even nested‐object fields each get their own inverted‐ or BKD‐style Lucene index.
    • Every field is automatically indexed on write — there’s no manual mapping step.

  3. Ingestion and durability
    • Writes are first appended to a transaction log (translog) for durability.
    • CrateDB periodically “refreshes” shards—flushing memtables into new Lucene segments and making data visible to queries.
    • Under heavy write you’ll see many small segments; Lucene automatically merges them down in the background to optimize reads.

  4. Querying
    • When you run SQL, the query planner breaks it into per‑shard jobs.
    • Each shard scans just the columns it needs, leveraging Lucene’s indexes and columnar compression to minimize I/O.
    • Results are aggregated and joined across shards and nodes in a fully distributed fashion.

  5. Partitioning and sharding
    • Partitioned tables physically create one index (set of shards) per partition key (e.g. by date).
    • Sharding (including hash sharding on your chosen primary key) lets you spread data uniformly across the cluster for parallelism.

  6. On‑disk layout
    • Data lives under each node’s path.data (directory), in subdirectories per Lucene index (table or partition), then per shard.
    • Each shard directory holds its Lucene segment files plus a translog folder for recent writes.

  7. Replication and failover
    • Every shard has a configurable number of replicas.
    • If a node fails, its primaries or replicas are automatically re‑elected or re‑allocated to other nodes, ensuring high availability.

  8. Managed options
    • On CrateDB Cloud the same storage model applies—your data sits in Lucene shards on managed VMs, with automated backups/snapshots to object storage.

In short: your SQL tables are really Lucene‐indexed column stores, split into shards and balanced across a cluster. Every column (and nested JSON field) is indexed for blazing‑fast queries, and Lucene’s segment/merge architecture gives you near‑real‑time visibility on your data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uv run examples/ask.py "CrateDB does not seem to provide an AUTOINCREMENT feature?"

Question

CrateDB does not seem to provide an AUTOINCREMENT feature?

Answer

You’re right – CrateDB does not have a built‑in SERIAL or AUTO_INCREMENT type. Because it’s a distributed system it avoids the single‑node bottleneck of a coordinated counter. You have a few options instead:

• Use a timestamp as your key

CREATE TABLE events (
  id   BIGINT DEFAULT now() PRIMARY KEY,
  ...);

(Gaps are fine, but you lose true row‑count sequencing.)

• Use UUIDs

CREATE TABLE things (
  id   TEXT DEFAULT gen_random_text_uuid() PRIMARY KEY,
  ...);

(Or bring in a UDF to generate UUIDv7 for roughly time‑ordered IDs.)

• Use IDs from your source system
If you’re importing from another database you can just keep its primary keys.

• Roll your own sequence table (optimistic concurrency):

  1. Create a tiny one‑shard table to hold your counters:
    CREATE TABLE sequences (
      name       TEXT PRIMARY KEY,
      last_value BIGINT
    ) CLUSTERED INTO 1 SHARDS;
    INSERT INTO sequences (name, last_value) VALUES ('myseq', 0);
  2. When you want a new ID, atomically bump it and grab the value:
    WITH s AS (
      SELECT last_value + 1 AS next,
             _seq_no, _primary_term
      FROM sequences
      WHERE name = 'myseq'
    ), u AS (
      UPDATE sequences
      SET last_value = s.next
      WHERE name = 'myseq'
        AND _seq_no        = s._seq_no
        AND _primary_term  = s._primary_term
      RETURNING s.next AS id
    )
    INSERT INTO mytable (id, …)
    SELECT id, … FROM u;
    If there’s a contention (nobody returned in u), you retry with exponential back‑off.

That last pattern builds a true sequential counter in a distributed‑safe way. Pick the approach that best fits your ingestion rate and consistency needs.

Copy link
Member Author

@amotl amotl Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ uv run examples/ask.py "Does CrateDB support SQLAlchemy?"

Yes – CrateDB has first‑class SQLAlchemy support via the “sqlalchemy‑cratedb” dialect.

• Project & docs: https://github.com/crate/sqlalchemy-cratedb / https://cratedb.com/docs/sqlalchemy-cratedb/
• Install:
  pip install sqlalchemy-cratedb
• Connection string (HTTP):
  crate://:@:4200/?schema=doc
• Example:
  from sqlalchemy import create_engine
  engine = create_engine("crate://crate@localhost:4200/?schema=doc")
  with engine.connect() as conn:
     result = conn.execute("SELECT * FROM my_table LIMIT 10")
     for row in result: print(row)

Evaluation: There are certainly errors inside.

@amotl amotl marked this pull request as ready for review April 16, 2025 19:09
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (3)
examples/ask.py (3)

36-38: Add type hints to improve code quality and maintainability.

Adding type hints would improve code clarity and enable static type checking, which could help catch issues early.

-    backend: t.Literal["claude", "openai"] = "openai"
-    use_knowledge: bool = True
+    backend: t.Literal["claude", "openai"] = "openai"
+    use_knowledge: bool = True
+
+    def ask(self, question: str) -> None:

This relates to the commented-out mypy in pyproject.toml - adding proper type hints would allow you to re-enable type checking.


122-140: Make backend configurable and add error handling in main.

The main function has hardcoded values and minimal error handling.

Consider enhancing the main function to:

  1. Accept backend configuration from environment variables or command line
  2. Add better error handling
  3. Make the interface more user-friendly
 def main():
     """
     Ready.
     """
+    # Get backend from environment variable or use default
+    backend = os.environ.get("CRATEDB_LLM_BACKEND", "openai").lower()
+    if backend not in ["openai", "claude"]:
+        print(f"Warning: Unknown backend '{backend}'. Using 'openai' instead.", file=sys.stderr)  # noqa: T201
+        backend = "openai"
+        
     wizard = CrateDBConversation(
-        backend="openai",
+        backend=backend,
         use_knowledge=True,
     )
     if len(sys.argv) >= 2:
         question = sys.argv[1]
         if question == "list-questions":
             print("A few example questions:\n")  # noqa: T201
-            print("\n\n".join(Example.questions))  # noqa: T201
+            for i, q in enumerate(Example.questions, 1):
+                print(f"{i}. {q}")  # noqa: T201
             return
+        elif question in ["-h", "--help"]:
+            print("Usage: python ask.py [QUESTION|list-questions|--help]")  # noqa: T201
+            print("\nExamples:")  # noqa: T201
+            print("  python ask.py 'What are the benefits of CrateDB?'")  # noqa: T201
+            print("  python ask.py list-questions")  # noqa: T201
+            print("\nEnvironment variables:")  # noqa: T201
+            print("  OPENAI_API_KEY - Required when using the OpenAI backend")  # noqa: T201
+            print("  CRATEDB_LLM_BACKEND - Set to 'openai' or 'claude' (default: openai)")  # noqa: T201
+            return
     else:
         question = Example.questions[4]
     print(f"Question: {question}\nAnswer:")  # noqa: T201
-    wizard.ask(question)
+    try:
+        wizard.ask(question)
+    except Exception as e:
+        print(f"Error: {e}", file=sys.stderr)  # noqa: T201
+        sys.exit(1)

1-11: Enhance script dependency declaration.

The script uses a comment-based format for dependencies, but it could be improved.

Consider converting to a more standardized format like pyproject.toml for the script, or add a requirements.txt file. This would make it easier for users to install dependencies:

# requirements.txt
claudette>=0.1.0
openai>=1.0.0
requests>=2.0.0

Or update the comment to better explain how to install these dependencies manually:

 # /// script
 # requires-python = ">=3.8"
 # dependencies = [
 #     "claudette",
 #     "openai",
 #     "requests",
 # ]
 # ///
+# To install dependencies: pip install claudette openai requests
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6876fed and 0599200.

📒 Files selected for processing (4)
  • CHANGES.md (1 hunks)
  • README.md (1 hunks)
  • examples/ask.py (1 hunks)
  • pyproject.toml (1 hunks)
🧰 Additional context used
🪛 Gitleaks (8.21.2)
README.md

36-36: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

🔇 Additional comments (1)
CHANGES.md (1)

7-7: LGTM! Changelog entry is clear and concise.

The changelog entry correctly documents the addition of the new example program and follows the project's existing format.

@amotl amotl force-pushed the use branch 3 times, most recently from 1703851 to d7fb2be Compare April 16, 2025 22:30
@amotl amotl changed the title Add example program examples/ask.py for conversations about CrateDB Added CLI program cratedb-about ask for conversations about CrateDB Apr 16, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (7)
docs/backlog.md (1)

6-10: Improve Markdown formatting for URL references.

The Markdown linter identified bare URLs in the document. Consider using proper Markdown link syntax for better readability and maintainability.

-## Iteration +2
-- Unlock Discourse: https://community.cratedb.com/raw/1015
-- Unlock HTML resources: https://www.urltoany.com/url-to-markdown.
-  => Find the best standalone program.
-- Unlock GitHub projects: https://github.com/mattduck/gh2md
+## Iteration +2
+- Unlock Discourse: [Discourse Raw Content](https://community.cratedb.com/raw/1015)
+- Unlock HTML resources: [URL to Markdown Converter](https://www.urltoany.com/url-to-markdown).
+  => Find the best standalone program.
+- Unlock GitHub projects: [gh2md GitHub Repository](https://github.com/mattduck/gh2md)
🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

7-7: Bare URL used
null

(MD034, no-bare-urls)


8-8: Bare URL used
null

(MD034, no-bare-urls)


10-10: Bare URL used
null

(MD034, no-bare-urls)

src/cratedb_about/cli.py (1)

31-33: Use Click's echo functions instead of direct stdout writes.

Click provides click.echo() which handles Unicode and different output streams better than direct writes to stdout.

-    sys.stdout.write(f"Question: {question}\nAnswer:\n")
-    sys.stdout.write(wizard.ask(question))
-    sys.stdout.write("\n")
+    click.echo(f"Question: {question}")
+    click.echo("Answer:")
+    click.echo(wizard.ask(question))
.github/workflows/tests.yml (1)

1-65: Well-structured GitHub Actions workflow for automated testing

The workflow is well-configured with appropriate triggers for push events, pull requests, and manual dispatch. The job matrix strategy with multiple Python versions is a good practice for ensuring compatibility.

A few suggestions for enhancement:

  1. Consider adding a job timeout to prevent runaway workflows
  2. Add a step to verify the installation by checking the CLI version
  3. Consider creating a separate job for linting to provide faster feedback
# Add timeout to prevent runaway workflows
 jobs:

   test:
+    timeout-minutes: 15
     name: "
     Python ${{ matrix.python-version }}
     "

# Add version check after installation
     - name: Set up project
       run: |
         uv pip install --editable='.[develop,test]'
+        cratedb-about --version
src/cratedb_about/core.py (4)

1-2: Update or remove stale comment reference

The comment appears to be a reference to source inspiration, but it would be more valuable to provide context about what specific patterns or ideas were derived from the referenced URL.

Either expand on what was derived from the referenced website or consider removing if not directly relevant to understanding the code.


18-23: Method implementation looks good but could be more robust

The ask method effectively routes questions to the appropriate backend implementation.

Consider adding error handling to catch and provide helpful messages for potential API failures:

def ask(self, question: str) -> str:
    try:
        if self.backend == "openai":
            return self.ask_gpt(question)
        if self.backend == "claude":
            return self.ask_claude(question)
        raise NotImplementedError("Please select an available LLM backend")
+    except Exception as e:
+        return f"Error querying LLM backend: {str(e)}"

52-53: Optimize imports for better code organization

The import for Message is placed after instantiating the OpenAI client, which disrupts the logical flow of imports.

Move the import to be grouped with the other OpenAI imports:

from openai import OpenAI
from openai.types.responses import ResponseInputTextParam
from openai.types.shared_params import Reasoning
+from openai.types.responses.response_input_param import Message

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
-from openai.types.responses.response_input_param import Message

56-57: Document model selection rationale

The code comments out "gpt-4o" and uses "o4-mini" without explaining the rationale or performance/cost tradeoffs.

Add a comment explaining the model selection reasoning:

- # model="gpt-4o",  # noqa: ERA001
- model="o4-mini",
+ # Using o4-mini as it provides a good balance of performance and cost
+ # Consider upgrading to gpt-4o for more complex queries if needed
+ model="o4-mini",
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0599200 and d7fb2be.

📒 Files selected for processing (11)
  • .github/dependabot.yml (1 hunks)
  • .github/workflows/tests.yml (1 hunks)
  • .gitignore (1 hunks)
  • CHANGES.md (1 hunks)
  • README.md (1 hunks)
  • docs/backlog.md (1 hunks)
  • docs/sandbox.md (1 hunks)
  • pyproject.toml (3 hunks)
  • src/cratedb_about/cli.py (1 hunks)
  • src/cratedb_about/core.py (1 hunks)
  • src/cratedb_about/model.py (1 hunks)
✅ Files skipped from review due to trivial changes (3)
  • .gitignore
  • .github/dependabot.yml
  • docs/sandbox.md
🚧 Files skipped from review as they are similar to previous changes (2)
  • CHANGES.md
  • README.md
🧰 Additional context used
🧬 Code Graph Analysis (2)
src/cratedb_about/cli.py (2)
src/cratedb_about/core.py (2)
  • CrateDBConversation (10-79)
  • ask (18-23)
src/cratedb_about/model.py (1)
  • Example (6-21)
src/cratedb_about/core.py (2)
src/cratedb_about/model.py (2)
  • Settings (24-44)
  • get_prompt (35-44)
src/cratedb_about/cli.py (1)
  • ask (20-33)
🪛 markdownlint-cli2 (0.17.2)
docs/backlog.md

7-7: Bare URL used
null

(MD034, no-bare-urls)


8-8: Bare URL used
null

(MD034, no-bare-urls)


10-10: Bare URL used
null

(MD034, no-bare-urls)

🔇 Additional comments (5)
pyproject.toml (3)

71-78: LGTM! Dependencies are appropriate for the new feature.

The added dependencies (claudette, openai, requests) align well with the project's new conversational capabilities about CrateDB.


89-89: LGTM! CLI script entry point properly defined.

The CLI script entry point correctly references the main function in the CLI module.


125-129: LGTM! Proper mypy configuration for the new package.

The mypy configuration correctly specifies the source directory and package for type checking.

src/cratedb_about/model.py (1)

11-21: LGTM! Example questions list is comprehensive.

The example questions cover a good range of CrateDB topics and will be helpful for users exploring the CLI.

src/cratedb_about/core.py (1)

9-17: Add type annotations for return value of initialization methods

The dataclass could benefit from improved type annotations, especially for any post-initialization logic.

The dataclass structure is clean and the type annotations for the backend are appropriate, using a Literal type to restrict to valid options.

Configure the language model to support conversations about CrateDB.
"""

llms_txt_url = "https://raw.githubusercontent.com/crate/about/6876fedee57f59b693f37996e04f53c6446f2ad6/build/llm/llms-ctx.txt"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid hardcoding GitHub commit hash in URL.

Using a specific commit hash in the URL creates a tight coupling to that exact version. If the repository changes or the commit becomes unavailable, this will break.

Consider one of these alternatives:

  1. Use a versioned, stable URL without commit hash
  2. Include the context file directly in the package
  3. Make the URL configurable through environment variables or settings
-    llms_txt_url = "https://raw.githubusercontent.com/crate/about/6876fedee57f59b693f37996e04f53c6446f2ad6/build/llm/llms-ctx.txt"
+    llms_txt_url = os.getenv(
+        "CRATEDB_CONTEXT_URL",
+        "https://raw.githubusercontent.com/crate/about/main/build/llm/llms-ctx.txt"
+    )

Don't forget to add import os at the top of the file.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
llms_txt_url = "https://raw.githubusercontent.com/crate/about/6876fedee57f59b693f37996e04f53c6446f2ad6/build/llm/llms-ctx.txt"
# At the top of the file, add:
import os
# …
# Around line 29, replace the hard‑coded URL with an env‑driven one:
- llms_txt_url = "https://raw.githubusercontent.com/crate/about/6876fedee57f59b693f37996e04f53c6446f2ad6/build/llm/llms-ctx.txt"
+ llms_txt_url = os.getenv(
+ "CRATEDB_CONTEXT_URL",
+ "https://raw.githubusercontent.com/crate/about/main/build/llm/llms-ctx.txt"
+ )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep it hard-coded like this until v0.0.1.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/cratedb_about/core.py (1)

26-34: 🛠️ Refactor suggestion

Add error handling and respect use_knowledge flag in ask_claude method

The Claude implementation has several issues that need to be addressed:

  1. It doesn't respect the use_knowledge flag that controls whether to use CrateDB context
  2. It lacks error handling for API failures
  3. The hardcoded model index issue is acknowledged with a FIXME comment
def ask_claude(self, question: str) -> str:
    from claudette import Chat, contents, models

    # FIXME: The API does not provide name-based lookup.
    model = models[1]  # Sonnet 3.5
    chat = Chat(model, sp=Settings.instructions)
-    chat(Settings.get_prompt())
+    if self.use_knowledge:
+        try:
+            prompt = Settings.get_prompt()
+            if prompt:
+                chat(prompt)
+        except Exception as e:
+            print(f"Warning: Failed to load knowledge context: {e}", file=sys.stderr)  # noqa: T201
+    
+    try:
        result = chat(question)
        return contents(result)
+    except Exception as e:
+        raise RuntimeError(f"Claude API error: {str(e)}") from e
🧹 Nitpick comments (2)
src/cratedb_about/core.py (2)

94-94: Fix the type issue instead of using type ignore

The type ignore suggests there's a type compatibility issue that should be fixed properly.

-            input=input_messages,  # type: ignore[arg-type]
+            input=input_messages,

Consider checking the OpenAI API documentation to ensure the input_messages variable matches the expected type for the input parameter. If necessary, adjust the type annotation on line 60 to match what the API expects.


1-8: Add explicit attribution in file header comment

The header comment mentions the code is derived from a source, but it would be clearer to include more details about the license or permission.

-# Derived from: https://llmstxt.org/domains.html
+# Derived from: https://llmstxt.org/domains.html
+# Used with permission under MIT License (or specify the actual license)

This adds clarity regarding the legal usage of the derived code.

🛑 Comments failed to post (1)
src/cratedb_about/core.py (1)

10-18: 🛠️ Refactor suggestion

Add validation for API keys based on the selected backend

The class should validate the required environment variables during initialization to fail fast and provide clear error messages to users.

@dataclasses.dataclass
class CrateDBConversation:
    """
    Manage conversations about CrateDB.
+
+    Requires:
+      - OpenAI backend: Set OPENAI_API_KEY environment variable
+      - Claude backend: Set ANTHROPIC_API_KEY environment variable
    """

    backend: t.Literal["claude", "openai"] = "openai"
    use_knowledge: bool = True
+
+    def __post_init__(self):
+        """Validate configuration on initialization."""
+        if self.backend == "openai" and not os.environ.get("OPENAI_API_KEY"):
+            raise ValueError("OPENAI_API_KEY environment variable is required when using 'openai' backend")
+        elif self.backend == "claude" and not os.environ.get("ANTHROPIC_API_KEY"):
+            raise ValueError("ANTHROPIC_API_KEY environment variable is required when using 'claude' backend")

This validation aligns with the docstring in the ask function in cli.py which mentions these requirements.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

@dataclasses.dataclass
class CrateDBConversation:
    """
    Manage conversations about CrateDB.

    Requires:
      - OpenAI backend: Set OPENAI_API_KEY environment variable
      - Claude backend: Set ANTHROPIC_API_KEY environment variable
    """

    backend: t.Literal["claude", "openai"] = "openai"
    use_knowledge: bool = True

    def __post_init__(self):
        """Validate configuration on initialization."""
        if self.backend == "openai" and not os.environ.get("OPENAI_API_KEY"):
            raise ValueError("OPENAI_API_KEY environment variable is required when using 'openai' backend")
        elif self.backend == "claude" and not os.environ.get("ANTHROPIC_API_KEY"):
            raise ValueError("ANTHROPIC_API_KEY environment variable is required when using 'claude' backend")

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
src/cratedb_about/core.py (1)

26-40: 🛠️ Refactor suggestion

Add docstring and improve Claude implementation

The ask_claude method lacks a docstring, and uses a hardcoded model index which is brittle.

    def ask_claude(self, question: str) -> str:
+        """
+        Ask a question using Anthropic's Claude model.
+        
+        Requires the ANTHROPIC_API_KEY environment variable to be set.
+        Uses Claude Sonnet 3.5 model.
+        
+        Args:
+            question: The question to ask about CrateDB
+            
+        Returns:
+            str: The response from Claude
+            
+        Raises:
+            RuntimeError: If there's an error communicating with the Claude API
+        """
        from claudette import Chat, contents, models

-        model = models[1]  # Sonnet 3.5
+        # Using explicit model reference with fallback for library changes
+        model_name = "claude-3-sonnet-20240229"  # Sonnet 3.5
+        try:
+            # Find model by name if available
+            model = next((m for m in models if getattr(m, "name", "") == model_name), models[1])
+        except (IndexError, TypeError):
+            # Fallback to a direct model reference if models structure changes
+            raise RuntimeError("Claude API: Unable to find suitable model")
        chat = Chat(model, sp=Settings.instructions)
        if self.use_knowledge:
            try:
                chat(Settings.get_prompt())
            except Exception as e:
                print(f"Warning: Failed to load knowledge context: {e}", file=sys.stderr)  # noqa: T201
        try:
            result = chat(question)
            return contents(result)
        except Exception as e:
            raise RuntimeError(f"Claude API error: {e}") from e
🧹 Nitpick comments (7)
src/cratedb_about/core.py (7)

19-24: Add docstring to the ask method

The ask method lacks a docstring that explains its behavior and parameters.

    def ask(self, question: str) -> str:
+        """
+        Ask a question about CrateDB using the configured LLM backend.
+        
+        Args:
+            question: The question to ask about CrateDB
+            
+        Returns:
+            str: The response from the LLM
+            
+        Raises:
+            NotImplementedError: If an unsupported backend is specified
+            ValueError: If required environment variables are missing
+            RuntimeError: If there's an error communicating with the LLM API
+        """
        if self.backend == "openai":
            return self.ask_gpt(question)
        if self.backend == "claude":
            return self.ask_claude(question)
        raise NotImplementedError("Please select an available LLM backend")

66-81: Update type annotation to use built-in list

The code uses t.List which is deprecated in favor of built-in list in Python 3.9+.

-        input_messages: t.List[Message] = []
+        input_messages: list[Message] = []

91-101: Make OpenAI model configurable

The OpenAI model is hardcoded, which limits flexibility. Consider making it configurable.

@dataclasses.dataclass
class CrateDBConversation:
    """
    Manage conversations about CrateDB.

    Requires:
    - OPENAI_API_KEY environment variable when using "openai" backend
    - ANTHROPIC_API_KEY environment variable when using "claude" backend
    """

    backend: t.Literal["claude", "openai"] = "openai"
    use_knowledge: bool = True
+    openai_model: str = "o4-mini"

And then update the model usage:

        response = client.responses.create(
            # model="gpt-4o",  # noqa: ERA001
-            model="o4-mini",
+            model=self.openai_model,
            reasoning=Reasoning(
                effort="medium",
                # Your organization must be verified to generate reasoning summaries
                # summary="detailed",  # noqa: ERA001
            ),
            instructions=Settings.instructions,
            input=input_messages,  # type: ignore[arg-type]
        )

100-100: Fix type issue instead of using type ignore comment

The code uses a # type: ignore[arg-type] comment. Consider fixing the type issue properly.

-            input=input_messages,  # type: ignore[arg-type]
+            input=input_messages,

The error might be due to outdated type stubs for the OpenAI library. Consider looking into using the latest version of the library or adding proper type definitions.


92-98: Consider replacing commented code with configurable options

The commented code suggests alternate configurations that might be useful but are not enabled. Consider making these configurable options.

@dataclasses.dataclass
class CrateDBConversation:
    """
    Manage conversations about CrateDB.

    Requires:
    - OPENAI_API_KEY environment variable when using "openai" backend
    - ANTHROPIC_API_KEY environment variable when using "claude" backend
    """

    backend: t.Literal["claude", "openai"] = "openai"
    use_knowledge: bool = True
    openai_model: str = "o4-mini"
+    openai_reasoning_effort: t.Literal["low", "medium", "high"] = "medium"
+    openai_reasoning_summary: t.Optional[t.Literal["none", "concise", "detailed"]] = None

Then update the reasoning configuration:

        response = client.responses.create(
-            # model="gpt-4o",  # noqa: ERA001
            model=self.openai_model,
            reasoning=Reasoning(
-                effort="medium",
-                # Your organization must be verified to generate reasoning summaries
-                # summary="detailed",  # noqa: ERA001
+                effort=self.openai_reasoning_effort,
+                **({"summary": self.openai_reasoning_summary} if self.openai_reasoning_summary else {})
            ),
            instructions=Settings.instructions,
            input=input_messages,
        )

61-64: Use environment variable validation from __post_init__

The API key validation in ask_gpt method would be redundant if the suggested __post_init__ validation is implemented.

Once the __post_init__ validation is implemented, you could simplify this code:

-        api_key = os.environ.get("OPENAI_API_KEY")
-        if not api_key:
-            raise ValueError("OPENAI_API_KEY environment variable is required")
-        client = OpenAI(api_key=api_key)
+        client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

1-8: Move backend-specific imports to the top level

Backend-specific imports are currently inside methods. Consider moving them to the top level with conditional imports to improve performance when the class is instantiated multiple times.

# Derived from: https://llmstxt.org/domains.html
import dataclasses
import os
import sys
import typing as t

+# Import backends conditionally to avoid errors if dependencies are missing
+CLAUDE_AVAILABLE = False
+OPENAI_AVAILABLE = False
+
+try:
+    from claudette import Chat, contents, models
+    CLAUDE_AVAILABLE = True
+except ImportError:
+    pass
+
+try:
+    from openai import OpenAI
+    from openai.types.responses import ResponseInputTextParam
+    from openai.types.responses.response_input_param import Message
+    from openai.types.shared_params import Reasoning
+    OPENAI_AVAILABLE = True
+except ImportError:
+    pass
+
from cratedb_about.model import Settings

Then update the __post_init__ method to check for backend availability:

    def __post_init__(self):
        """Validate configuration."""
+        if self.backend == "openai" and not OPENAI_AVAILABLE:
+            raise ImportError("The 'openai' package is required when using the OpenAI backend")
+        elif self.backend == "claude" and not CLAUDE_AVAILABLE:
+            raise ImportError("The 'claudette' package is required when using the Claude backend")
+            
        if self.backend == "openai" and not os.environ.get("OPENAI_API_KEY"):
            raise ValueError("OPENAI_API_KEY environment variable is required when using 'openai' backend")
        elif self.backend == "claude" and not os.environ.get("ANTHROPIC_API_KEY"):
            raise ValueError("ANTHROPIC_API_KEY environment variable is required when using 'claude' backend")

And then simplify the method implementations to remove the imports:

    def ask_claude(self, question: str) -> str:
-        from claudette import Chat, contents, models
    def ask_gpt(self, question: str) -> str:
-        from openai import OpenAI
-        from openai.types.responses import ResponseInputTextParam
-        from openai.types.responses.response_input_param import Message
-        from openai.types.shared_params import Reasoning
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7df5972 and 0e02435.

📒 Files selected for processing (3)
  • src/cratedb_about/cli.py (1 hunks)
  • src/cratedb_about/core.py (1 hunks)
  • src/cratedb_about/model.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/cratedb_about/cli.py
  • src/cratedb_about/model.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/cratedb_about/core.py (2)
src/cratedb_about/model.py (2)
  • Settings (25-57)
  • get_prompt (44-57)
src/cratedb_about/cli.py (1)
  • ask (20-41)

@amotl amotl force-pushed the use branch 2 times, most recently from c4c7111 to d2de5c7 Compare April 16, 2025 23:17
Base automatically changed from build to main April 16, 2025 23:31
@amotl amotl changed the title Added CLI program cratedb-about ask for conversations about CrateDB Add CLI program cratedb-about ask for conversations about CrateDB Apr 16, 2025
amotl added 2 commits April 17, 2025 01:32
... with subcommands `ask` and `list-questions` for ad hoc conversations
about CrateDB.
@amotl amotl merged commit 4051ff7 into main Apr 16, 2025
3 checks passed
@amotl amotl deleted the use branch April 16, 2025 23:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant