feat(datasets): Add storage-type column to datasets table, and document datasets table. by kirkrodrigues · Pull Request #923 · y-scope/clp

kirkrodrigues · 2025-05-26T13:27:28Z

Description

As the title says. Adding the storage type column allows us to separate the storage type from the directory---the alternative being to use a URI containing both the type and directory (e.g., s3://abc), which would be harder for external tools to use.

Checklist

The PR satisfies the contribution guidelines.
This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.
Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Validated that compression & search work correctly with the clp-text and clp-json packages.
Validated that when using the clp-json package, the datasets table is created, with the added column.
Validated that the new docs render correctly.

Summary by CodeRabbit

New Features
- Added a new column for archive storage type to the datasets table, enhancing dataset metadata.
Documentation
- Introduced a new section detailing the metadata database and datasets table schema.
- Updated the developer guide index to include new design documentation topics.

…t datasets table.

coderabbitai · 2025-05-26T13:27:35Z

Walkthrough

A new non-nullable column, archive_storage_type, has been added to the datasets table schema in the utility responsible for table creation. Documentation has been updated to describe this schema change, and the developer guide's index now includes a link to the new metadata database design document.

Changes

File(s)	Change Summary
components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py	Added `archive_storage_type` (VARCHAR(64), NOT NULL) to the `datasets` table schema and documented it.
docs/src/dev-guide/design-metadata-db.md	Added new section detailing the metadata database and the updated `datasets` table schema.
docs/src/dev-guide/index.md	Updated the index to include the new metadata database design document.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant MetadataDBUtils

    Caller->>MetadataDBUtils: create_datasets_table()
    MetadataDBUtils->>MetadataDBUtils: Create 'datasets' table with columns:\n- name\n- archive_storage_type\n- archive_storage_directory
    Caller-->>MetadataDBUtils: Table created (with new column)

Possibly related PRs

y-scope/clp#831: Refactors metadata database table creation into utility functions, including dataset-specific table creation, which is directly related to the schema changes in this PR.

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6643d42 and 4c1650e.

📒 Files selected for processing (3)

components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py (1 hunks)
docs/src/dev-guide/design-metadata-db.md (1 hunks)
docs/src/dev-guide/index.md (1 hunks)

🧰 Additional context used

🪛 LanguageTool

docs/src/dev-guide/design-metadata-db.md

[uncategorized] ~14-~14: A punctuation mark might be missing here.
Context: ...mmodate all expected values. (table-1)= :::{card} | Column name |...

(AI_EN_LECTOR_MISSING_PUNCTUATION)

🪛 markdownlint-cli2 (0.17.2)

docs/src/dev-guide/design-metadata-db.md

10-10: Link fragments should be valid
null

(MD051, link-fragments)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: lint-check (ubuntu-latest)

🔇 Additional comments (2)

docs/src/dev-guide/index.md (1)

83-85: Validate the toctree entry addition
The design-metadata-db page has been added under the "Design" section. Confirm that docs/src/dev-guide/design-metadata-db.md exists and that the documentation build succeeds without broken links.

docs/src/dev-guide/design-metadata-db.md (1)

14-20: Confirm table directive and caption syntax
Ensure that the (table-1)= anchor and :::{card} directive correctly render Table 1 in MyST. If the caption does not appear, you may need a blank line between the anchor and directive or switch to a standard MyST figure block (::: figure).

🧰 Tools

🪛 LanguageTool

[uncategorized] ~14-~14: A punctuation mark might be missing here.
Context: ...mmodate all expected values. (table-1)= :::{card} | Column name |...

(AI_EN_LECTOR_MISSING_PUNCTUATION)

coderabbitai · 2025-05-26T13:30:52Z

            `name` VARCHAR(255) NOT NULL,
+            `archive_storage_type` VARCHAR(64) NOT NULL,
            `archive_storage_directory` VARCHAR(4096) NOT NULL,
            PRIMARY KEY (`name`)


💡 Verification agent

🧩 Analysis chain

Handle backward compatibility for the new non-nullable column
Adding archive_storage_type as NOT NULL without a DEFAULT will cause CREATE TABLE IF NOT EXISTS to skip on existing tables, leaving the column missing, and any migration scripts may fail. Consider:

Adding a DEFAULT '' clause to the column definition for new tables.

Running an ALTER TABLE ... ADD COLUMN IF NOT EXISTS with NOT NULL DEFAULT '' before the CREATE TABLE to migrate existing tables.

Documenting the migration path.

Also, confirm that archive_storage_type matches the intended name from the PR objectives (storage_type).

# Before CREATE TABLE +db_cursor.execute( + f""" + ALTER TABLE `{table_prefix}{DATASETS_TABLE_SUFFIX}` + ADD COLUMN IF NOT EXISTS `archive_storage_type` VARCHAR(64) NOT NULL DEFAULT ''; + """ +)

🏁 Script executed:

#!/bin/bash # Show the CREATE TABLE definition around line 109 rg --context 5 "CREATE TABLE IF NOT EXISTS" -n components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py # Check for any existing ALTER TABLE statements in this file rg -n "ALTER TABLE" -n components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py # Locate all mentions of archive_storage_type to see if defaults or migrations exist rg -n "archive_storage_type" -n components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py

Length of output: 3047

Ensure backward-compatibility for archive_storage_type
Adding a new NOT NULL column without a default means CREATE TABLE IF NOT EXISTS will skip updating existing tables, so older tables won’t get the new column and any inserts/reads will fail. Please:

• Add a default value for new tables:

db_cursor.execute( f""" - CREATE TABLE IF NOT EXISTS `{table_prefix}{DATASETS_TABLE_SUFFIX}` ( + CREATE TABLE IF NOT EXISTS `{table_prefix}{DATASETS_TABLE_SUFFIX}` ( `name` VARCHAR(255) NOT NULL, - `archive_storage_type` VARCHAR(64) NOT NULL, + `archive_storage_type` VARCHAR(64) NOT NULL DEFAULT '', `archive_storage_directory` VARCHAR(4096) NOT NULL, PRIMARY KEY (`name`) )

• Pre-migrate existing tables before the CREATE TABLE:

db_cursor.execute( f""" ALTER TABLE `{table_prefix}{DATASETS_TABLE_SUFFIX}` ADD COLUMN IF NOT EXISTS `archive_storage_type` VARCHAR(64) NOT NULL DEFAULT ''; """ )

• Update your documentation or migration guide to cover this step.

• Confirm that the column name (archive_storage_type) matches the intended PR spec (e.g. storage_type).

🤖 Prompt for AI Agents

In components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py between lines 109 and 112, the new column `archive_storage_type` is defined as NOT NULL without a default value, which can cause issues with existing tables when using CREATE TABLE IF NOT EXISTS. To fix this, add a DEFAULT '' clause to the column definition for new tables, and ensure that existing tables are migrated with an ALTER TABLE statement that adds the column with DEFAULT '' before creating the table. Also, verify that the column name matches the intended name from the PR, such as `storage_type`. Document these migration steps accordingly.

coderabbitai · 2025-05-26T13:30:52Z

+a CLP Presto connector) to locate archives that belong to a dataset. [Table 1](#table-1) describes
+the high-level schema of the datasets table. Each column is a string with a length that should
+accommodate all expected values.
+


🧹 Nitpick (assertive)

Minor punctuation fix
Add a period at the end of the sentence on line 13 to complete the statement.

🤖 Prompt for AI Agents

In docs/src/dev-guide/design-metadata-db.md on line 13, the sentence is missing a period at the end. Add a period to the end of the sentence to complete it.

coderabbitai · 2025-05-26T13:30:52Z

+# Metadata database
+
+:::{warning}
+🚧 This section is still under construction.
+:::
+


🧹 Nitpick (assertive)

Review "under construction" warning style
The 🚧 emoji within the {warning} admonition is clear, but for consistency with our docs style you may want to use a text-only warning or provide a brief note on what content is pending.

🤖 Prompt for AI Agents

In docs/src/dev-guide/design-metadata-db.md lines 1 to 6, the warning uses a 🚧 emoji within a {warning} admonition, which may not align with our style guidelines. Replace the emoji with a text-only warning or add a brief note explaining what content is pending to improve consistency and clarity.

Copilot

Pull Request Overview

This PR introduces a new column, archive_storage_type, to the datasets table to better separate storage type from the directory. It also adds documentation for the metadata database and updates the developer guide index.

Adds the archive_storage_type column to the datasets table in clp_metadata_db_utils.py.
Introduces a new documentation file, design-metadata-db.md, detailing the updated schema.
Updates the developer guide index to include the new design-metadata-db section.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
docs/src/dev-guide/index.md	Adds the new design-metadata-db entry in the guide index
docs/src/dev-guide/design-metadata-db.md	New documentation for the metadata database and datasets table
components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py	Updates SQL table creation to add the archive_storage_type column

Copilot · 2025-05-26T23:36:57Z

 design-project-structure
+
 design-kv-ir-streams/index
+design-metadata-db


[nitpick] Ensure that the formatting and ordering of documentation entries in the index file is consistent with other entries to enhance readability.

haiqi96 · 2025-05-27T18:20:13Z

 :hidden:

 design-project-structure
+


why an extra empty line?

For grouping. I wanted to add comments above each group, but it looks like MyST doesn't support comments in the table of contents. I can remove it so it's not confusing.

I see, we can keep the line.

haiqi96

looks good

…nt datasets table. (y-scope#923)

feat(datasets): Add storage-type column to datasets table and documen…

4c1650e

…t datasets table.

kirkrodrigues requested a review from a team as a code owner May 26, 2025 13:27

coderabbitai Bot reviewed May 26, 2025

View reviewed changes

Merge branch 'main' into datasets-add-storage-type

b936f0b

kirkrodrigues requested a review from Copilot May 26, 2025 23:36

Copilot AI reviewed May 26, 2025

View reviewed changes

kirkrodrigues requested a review from haiqi96 May 26, 2025 23:41

kirkrodrigues changed the title ~~feat(datasets): Add storage-type column to datasets table and document datasets table.~~ feat(datasets): Add storage-type column to datasets table, and document datasets table. May 26, 2025

haiqi96 reviewed May 27, 2025

View reviewed changes

haiqi96 approved these changes May 27, 2025

View reviewed changes

Merge branch 'main' into datasets-add-storage-type

f9bf209

kirkrodrigues merged commit 56906ab into y-scope:main May 28, 2025
5 checks passed

kirkrodrigues deleted the datasets-add-storage-type branch May 28, 2025 20:49

kirkrodrigues added a commit to Bill-hbrhbr/clp that referenced this pull request Jun 9, 2025

Update to reflect changes from y-scope#923.

2e0ba43

coderabbitai Bot mentioned this pull request Jun 23, 2025

fix(package): Remove archive_storage_type column from the datasets table since it should be a cluster-level setting. #1029

Merged

3 tasks

junhaoliao pushed a commit to junhaoliao/clp that referenced this pull request May 17, 2026

feat(datasets): Add storage-type column to datasets table, and docume…

693a2e2

…nt datasets table. (y-scope#923)

Conversation

kirkrodrigues commented May 26, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Validation performed

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 26, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 26, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI May 26, 2025

Choose a reason for hiding this comment

Uh oh!

haiqi96 May 27, 2025

Choose a reason for hiding this comment

Uh oh!

kirkrodrigues May 27, 2025

Choose a reason for hiding this comment

Uh oh!

haiqi96 May 27, 2025

Choose a reason for hiding this comment

Uh oh!

haiqi96 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kirkrodrigues commented May 26, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 26, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)