Skip to content

feat(datasets): Add storage-type column to datasets table, and document datasets table.#923

Merged
kirkrodrigues merged 3 commits into
y-scope:mainfrom
kirkrodrigues:datasets-add-storage-type
May 28, 2025
Merged

feat(datasets): Add storage-type column to datasets table, and document datasets table.#923
kirkrodrigues merged 3 commits into
y-scope:mainfrom
kirkrodrigues:datasets-add-storage-type

Conversation

@kirkrodrigues

@kirkrodrigues kirkrodrigues commented May 26, 2025

Copy link
Copy Markdown
Member

Description

As the title says. Adding the storage type column allows us to separate the storage type from the directory---the alternative being to use a URI containing both the type and directory (e.g., s3://abc), which would be harder for external tools to use.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Validated that compression & search work correctly with the clp-text and clp-json packages.
  • Validated that when using the clp-json package, the datasets table is created, with the added column.
  • Validated that the new docs render correctly.

Summary by CodeRabbit

  • New Features
    • Added a new column for archive storage type to the datasets table, enhancing dataset metadata.
  • Documentation
    • Introduced a new section detailing the metadata database and datasets table schema.
    • Updated the developer guide index to include new design documentation topics.

@kirkrodrigues kirkrodrigues requested a review from a team as a code owner May 26, 2025 13:27
@coderabbitai

coderabbitai Bot commented May 26, 2025

Copy link
Copy Markdown
Contributor

Walkthrough

A new non-nullable column, archive_storage_type, has been added to the datasets table schema in the utility responsible for table creation. Documentation has been updated to describe this schema change, and the developer guide's index now includes a link to the new metadata database design document.

Changes

File(s) Change Summary
components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py Added archive_storage_type (VARCHAR(64), NOT NULL) to the datasets table schema and documented it.
docs/src/dev-guide/design-metadata-db.md Added new section detailing the metadata database and the updated datasets table schema.
docs/src/dev-guide/index.md Updated the index to include the new metadata database design document.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant MetadataDBUtils

    Caller->>MetadataDBUtils: create_datasets_table()
    MetadataDBUtils->>MetadataDBUtils: Create 'datasets' table with columns:\n- name\n- archive_storage_type\n- archive_storage_directory
    Caller-->>MetadataDBUtils: Table created (with new column)
Loading

Possibly related PRs

  • y-scope/clp#831: Refactors metadata database table creation into utility functions, including dataset-specific table creation, which is directly related to the schema changes in this PR.
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6643d42 and 4c1650e.

📒 Files selected for processing (3)
  • components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py (1 hunks)
  • docs/src/dev-guide/design-metadata-db.md (1 hunks)
  • docs/src/dev-guide/index.md (1 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/src/dev-guide/design-metadata-db.md

[uncategorized] ~14-~14: A punctuation mark might be missing here.
Context: ...mmodate all expected values. (table-1)= :::{card} | Column name |...

(AI_EN_LECTOR_MISSING_PUNCTUATION)

🪛 markdownlint-cli2 (0.17.2)
docs/src/dev-guide/design-metadata-db.md

10-10: Link fragments should be valid
null

(MD051, link-fragments)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: lint-check (ubuntu-latest)
🔇 Additional comments (2)
docs/src/dev-guide/index.md (1)

83-85: Validate the toctree entry addition
The design-metadata-db page has been added under the "Design" section. Confirm that docs/src/dev-guide/design-metadata-db.md exists and that the documentation build succeeds without broken links.

docs/src/dev-guide/design-metadata-db.md (1)

14-20: Confirm table directive and caption syntax
Ensure that the (table-1)= anchor and :::{card} directive correctly render Table 1 in MyST. If the caption does not appear, you may need a blank line between the anchor and directive or switch to a standard MyST figure block (::: figure).

🧰 Tools
🪛 LanguageTool

[uncategorized] ~14-~14: A punctuation mark might be missing here.
Context: ...mmodate all expected values. (table-1)= :::{card} | Column name |...

(AI_EN_LECTOR_MISSING_PUNCTUATION)

Comment on lines 109 to 112
`name` VARCHAR(255) NOT NULL,
`archive_storage_type` VARCHAR(64) NOT NULL,
`archive_storage_directory` VARCHAR(4096) NOT NULL,
PRIMARY KEY (`name`)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Handle backward compatibility for the new non-nullable column
Adding archive_storage_type as NOT NULL without a DEFAULT will cause CREATE TABLE IF NOT EXISTS to skip on existing tables, leaving the column missing, and any migration scripts may fail. Consider:

  • Adding a DEFAULT '' clause to the column definition for new tables.
  • Running an ALTER TABLE ... ADD COLUMN IF NOT EXISTS with NOT NULL DEFAULT '' before the CREATE TABLE to migrate existing tables.
  • Documenting the migration path.

Also, confirm that archive_storage_type matches the intended name from the PR objectives (storage_type).

 # Before CREATE TABLE
+db_cursor.execute(
+    f"""
+    ALTER TABLE `{table_prefix}{DATASETS_TABLE_SUFFIX}`
+    ADD COLUMN IF NOT EXISTS `archive_storage_type` VARCHAR(64) NOT NULL DEFAULT '';
+    """
+)

🏁 Script executed:

#!/bin/bash
# Show the CREATE TABLE definition around line 109
rg --context 5 "CREATE TABLE IF NOT EXISTS" -n components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py

# Check for any existing ALTER TABLE statements in this file
rg -n "ALTER TABLE" -n components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py

# Locate all mentions of archive_storage_type to see if defaults or migrations exist
rg -n "archive_storage_type" -n components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py

Length of output: 3047


Ensure backward-compatibility for archive_storage_type
Adding a new NOT NULL column without a default means CREATE TABLE IF NOT EXISTS will skip updating existing tables, so older tables won’t get the new column and any inserts/reads will fail. Please:

• Add a default value for new tables:

 db_cursor.execute(
     f"""
-    CREATE TABLE IF NOT EXISTS `{table_prefix}{DATASETS_TABLE_SUFFIX}` (
+    CREATE TABLE IF NOT EXISTS `{table_prefix}{DATASETS_TABLE_SUFFIX}` (
         `name` VARCHAR(255) NOT NULL,
-        `archive_storage_type` VARCHAR(64) NOT NULL,
+        `archive_storage_type` VARCHAR(64) NOT NULL DEFAULT '',
         `archive_storage_directory` VARCHAR(4096) NOT NULL,
         PRIMARY KEY (`name`)
     )

• Pre-migrate existing tables before the CREATE TABLE:

db_cursor.execute(
    f"""
    ALTER TABLE `{table_prefix}{DATASETS_TABLE_SUFFIX}`
    ADD COLUMN IF NOT EXISTS
      `archive_storage_type` VARCHAR(64) NOT NULL DEFAULT '';
    """
)

• Update your documentation or migration guide to cover this step.

• Confirm that the column name (archive_storage_type) matches the intended PR spec (e.g. storage_type).

🤖 Prompt for AI Agents
In components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py between lines
109 and 112, the new column `archive_storage_type` is defined as NOT NULL
without a default value, which can cause issues with existing tables when using
CREATE TABLE IF NOT EXISTS. To fix this, add a DEFAULT '' clause to the column
definition for new tables, and ensure that existing tables are migrated with an
ALTER TABLE statement that adds the column with DEFAULT '' before creating the
table. Also, verify that the column name matches the intended name from the PR,
such as `storage_type`. Document these migration steps accordingly.

a CLP Presto connector) to locate archives that belong to a dataset. [Table 1](#table-1) describes
the high-level schema of the datasets table. Each column is a string with a length that should
accommodate all expected values.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Minor punctuation fix
Add a period at the end of the sentence on line 13 to complete the statement.

🤖 Prompt for AI Agents
In docs/src/dev-guide/design-metadata-db.md on line 13, the sentence is missing
a period at the end. Add a period to the end of the sentence to complete it.

Comment on lines +1 to +6
# Metadata database

:::{warning}
🚧 This section is still under construction.
:::

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Review "under construction" warning style
The 🚧 emoji within the {warning} admonition is clear, but for consistency with our docs style you may want to use a text-only warning or provide a brief note on what content is pending.

🤖 Prompt for AI Agents
In docs/src/dev-guide/design-metadata-db.md lines 1 to 6, the warning uses a 🚧
emoji within a {warning} admonition, which may not align with our style
guidelines. Replace the emoji with a text-only warning or add a brief note
explaining what content is pending to improve consistency and clarity.

@kirkrodrigues kirkrodrigues requested a review from Copilot May 26, 2025 23:36

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new column, archive_storage_type, to the datasets table to better separate storage type from the directory. It also adds documentation for the metadata database and updates the developer guide index.

  • Adds the archive_storage_type column to the datasets table in clp_metadata_db_utils.py.
  • Introduces a new documentation file, design-metadata-db.md, detailing the updated schema.
  • Updates the developer guide index to include the new design-metadata-db section.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
docs/src/dev-guide/index.md Adds the new design-metadata-db entry in the guide index
docs/src/dev-guide/design-metadata-db.md New documentation for the metadata database and datasets table
components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py Updates SQL table creation to add the archive_storage_type column

design-project-structure

design-kv-ir-streams/index
design-metadata-db

Copilot AI May 26, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Ensure that the formatting and ordering of documentation entries in the index file is consistent with other entries to enhance readability.

Copilot uses AI. Check for mistakes.
@kirkrodrigues kirkrodrigues requested a review from haiqi96 May 26, 2025 23:41
@kirkrodrigues kirkrodrigues changed the title feat(datasets): Add storage-type column to datasets table and document datasets table. feat(datasets): Add storage-type column to datasets table, and document datasets table. May 26, 2025
:hidden:

design-project-structure

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why an extra empty line?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For grouping. I wanted to add comments above each group, but it looks like MyST doesn't support comments in the table of contents. I can remove it so it's not confusing.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, we can keep the line.

@haiqi96 haiqi96 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

@kirkrodrigues kirkrodrigues merged commit 56906ab into y-scope:main May 28, 2025
5 checks passed
@kirkrodrigues kirkrodrigues deleted the datasets-add-storage-type branch May 28, 2025 20:49
kirkrodrigues added a commit to Bill-hbrhbr/clp that referenced this pull request Jun 9, 2025
junhaoliao pushed a commit to junhaoliao/clp that referenced this pull request May 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants