feat(datasets): Add storage-type column to datasets table, and document datasets table.#923
Conversation
…t datasets table.
WalkthroughA new non-nullable column, Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant MetadataDBUtils
Caller->>MetadataDBUtils: create_datasets_table()
MetadataDBUtils->>MetadataDBUtils: Create 'datasets' table with columns:\n- name\n- archive_storage_type\n- archive_storage_directory
Caller-->>MetadataDBUtils: Table created (with new column)
Possibly related PRs
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py(1 hunks)docs/src/dev-guide/design-metadata-db.md(1 hunks)docs/src/dev-guide/index.md(1 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/src/dev-guide/design-metadata-db.md
[uncategorized] ~14-~14: A punctuation mark might be missing here.
Context: ...mmodate all expected values. (table-1)= :::{card} | Column name |...
(AI_EN_LECTOR_MISSING_PUNCTUATION)
🪛 markdownlint-cli2 (0.17.2)
docs/src/dev-guide/design-metadata-db.md
10-10: Link fragments should be valid
null
(MD051, link-fragments)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: lint-check (ubuntu-latest)
🔇 Additional comments (2)
docs/src/dev-guide/index.md (1)
83-85: Validate the toctree entry addition
Thedesign-metadata-dbpage has been added under the "Design" section. Confirm thatdocs/src/dev-guide/design-metadata-db.mdexists and that the documentation build succeeds without broken links.docs/src/dev-guide/design-metadata-db.md (1)
14-20: Confirm table directive and caption syntax
Ensure that the(table-1)=anchor and:::{card}directive correctly render Table 1 in MyST. If the caption does not appear, you may need a blank line between the anchor and directive or switch to a standard MyST figure block (::: figure).🧰 Tools
🪛 LanguageTool
[uncategorized] ~14-~14: A punctuation mark might be missing here.
Context: ...mmodate all expected values. (table-1)= :::{card} | Column name |...(AI_EN_LECTOR_MISSING_PUNCTUATION)
| `name` VARCHAR(255) NOT NULL, | ||
| `archive_storage_type` VARCHAR(64) NOT NULL, | ||
| `archive_storage_directory` VARCHAR(4096) NOT NULL, | ||
| PRIMARY KEY (`name`) |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Handle backward compatibility for the new non-nullable column
Adding archive_storage_type as NOT NULL without a DEFAULT will cause CREATE TABLE IF NOT EXISTS to skip on existing tables, leaving the column missing, and any migration scripts may fail. Consider:
- Adding a
DEFAULT ''clause to the column definition for new tables. - Running an
ALTER TABLE ... ADD COLUMN IF NOT EXISTSwithNOT NULL DEFAULT ''before theCREATE TABLEto migrate existing tables. - Documenting the migration path.
Also, confirm that archive_storage_type matches the intended name from the PR objectives (storage_type).
# Before CREATE TABLE
+db_cursor.execute(
+ f"""
+ ALTER TABLE `{table_prefix}{DATASETS_TABLE_SUFFIX}`
+ ADD COLUMN IF NOT EXISTS `archive_storage_type` VARCHAR(64) NOT NULL DEFAULT '';
+ """
+)🏁 Script executed:
#!/bin/bash
# Show the CREATE TABLE definition around line 109
rg --context 5 "CREATE TABLE IF NOT EXISTS" -n components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py
# Check for any existing ALTER TABLE statements in this file
rg -n "ALTER TABLE" -n components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py
# Locate all mentions of archive_storage_type to see if defaults or migrations exist
rg -n "archive_storage_type" -n components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.pyLength of output: 3047
Ensure backward-compatibility for archive_storage_type
Adding a new NOT NULL column without a default means CREATE TABLE IF NOT EXISTS will skip updating existing tables, so older tables won’t get the new column and any inserts/reads will fail. Please:
• Add a default value for new tables:
db_cursor.execute(
f"""
- CREATE TABLE IF NOT EXISTS `{table_prefix}{DATASETS_TABLE_SUFFIX}` (
+ CREATE TABLE IF NOT EXISTS `{table_prefix}{DATASETS_TABLE_SUFFIX}` (
`name` VARCHAR(255) NOT NULL,
- `archive_storage_type` VARCHAR(64) NOT NULL,
+ `archive_storage_type` VARCHAR(64) NOT NULL DEFAULT '',
`archive_storage_directory` VARCHAR(4096) NOT NULL,
PRIMARY KEY (`name`)
)• Pre-migrate existing tables before the CREATE TABLE:
db_cursor.execute(
f"""
ALTER TABLE `{table_prefix}{DATASETS_TABLE_SUFFIX}`
ADD COLUMN IF NOT EXISTS
`archive_storage_type` VARCHAR(64) NOT NULL DEFAULT '';
"""
)• Update your documentation or migration guide to cover this step.
• Confirm that the column name (archive_storage_type) matches the intended PR spec (e.g. storage_type).
🤖 Prompt for AI Agents
In components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py between lines
109 and 112, the new column `archive_storage_type` is defined as NOT NULL
without a default value, which can cause issues with existing tables when using
CREATE TABLE IF NOT EXISTS. To fix this, add a DEFAULT '' clause to the column
definition for new tables, and ensure that existing tables are migrated with an
ALTER TABLE statement that adds the column with DEFAULT '' before creating the
table. Also, verify that the column name matches the intended name from the PR,
such as `storage_type`. Document these migration steps accordingly.
| a CLP Presto connector) to locate archives that belong to a dataset. [Table 1](#table-1) describes | ||
| the high-level schema of the datasets table. Each column is a string with a length that should | ||
| accommodate all expected values. | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick (assertive)
Minor punctuation fix
Add a period at the end of the sentence on line 13 to complete the statement.
🤖 Prompt for AI Agents
In docs/src/dev-guide/design-metadata-db.md on line 13, the sentence is missing
a period at the end. Add a period to the end of the sentence to complete it.
| # Metadata database | ||
|
|
||
| :::{warning} | ||
| 🚧 This section is still under construction. | ||
| ::: | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick (assertive)
Review "under construction" warning style
The 🚧 emoji within the {warning} admonition is clear, but for consistency with our docs style you may want to use a text-only warning or provide a brief note on what content is pending.
🤖 Prompt for AI Agents
In docs/src/dev-guide/design-metadata-db.md lines 1 to 6, the warning uses a 🚧
emoji within a {warning} admonition, which may not align with our style
guidelines. Replace the emoji with a text-only warning or add a brief note
explaining what content is pending to improve consistency and clarity.
There was a problem hiding this comment.
Pull Request Overview
This PR introduces a new column, archive_storage_type, to the datasets table to better separate storage type from the directory. It also adds documentation for the metadata database and updates the developer guide index.
- Adds the archive_storage_type column to the datasets table in clp_metadata_db_utils.py.
- Introduces a new documentation file, design-metadata-db.md, detailing the updated schema.
- Updates the developer guide index to include the new design-metadata-db section.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| docs/src/dev-guide/index.md | Adds the new design-metadata-db entry in the guide index |
| docs/src/dev-guide/design-metadata-db.md | New documentation for the metadata database and datasets table |
| components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py | Updates SQL table creation to add the archive_storage_type column |
| design-project-structure | ||
|
|
||
| design-kv-ir-streams/index | ||
| design-metadata-db |
There was a problem hiding this comment.
[nitpick] Ensure that the formatting and ordering of documentation entries in the index file is consistent with other entries to enhance readability.
| :hidden: | ||
|
|
||
| design-project-structure | ||
|
|
There was a problem hiding this comment.
For grouping. I wanted to add comments above each group, but it looks like MyST doesn't support comments in the table of contents. I can remove it so it's not confusing.
There was a problem hiding this comment.
I see, we can keep the line.
…nt datasets table. (y-scope#923)
Description
As the title says. Adding the storage type column allows us to separate the storage type from the directory---the alternative being to use a URI containing both the type and directory (e.g.,
s3://abc), which would be harder for external tools to use.Checklist
breaking change.
Validation performed
Summary by CodeRabbit