fix(clp-package): Update stale `validate_dataset_exists` references and add default dataset fallback in `sbin` scripts (fixes #2059). by junhaoliao · Pull Request #2060 · y-scope/clp

junhaoliao · 2026-03-03T19:40:24Z

Description

#1992 renamed validate_dataset_exists to validate_datasets_exist (singular to plural)
in native/utils.py and updated its signature to accept list[str], but two callers were
not updated:

native/decompress.py — import (line 39) and call site (line 151)
native/archive_manager.py — import (line 31) and call site (line 203)

Since these are module-level imports, the ImportError prevents the entire decompress
module from loading, making all decompress.sh subcommands (x, i, j)
non-functional.

This PR:

Updates both files to import and call validate_datasets_exist, wrapping the single
dataset string in a list to match the new signature.
Adds a default dataset fallback in native/decompress.py for the j (extract-json)
subcommand, matching the pattern used by every other script (compress.py,
decompress.py (non-native), search.py, archive_manager.py):
```
dataset = CLP_DEFAULT_DATASET_NAME if dataset is None else dataset
```
Previously, native/decompress.py errored when --dataset was omitted, unlike all
other scripts which fall back to the "default" dataset.

Checklist

The PR satisfies the contribution guidelines.
This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.
Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Tested on the built package (task to build, then sbin/start-clp.sh) with both
clp-JSON (clp-s) and clp-TEXT (clp) storage engines.

Part A: clp-JSON (clp-s engine)

Setup

Task: Build package, start CLP, and compress test data into both the default and a
named dataset.

Commands:

$ cd build/clp-package
$ ./sbin/start-clp.sh
$ ./sbin/compress.sh --timestamp-key timestamp ~/samples/postgresql.jsonl
$ ./sbin/compress.sh --timestamp-key timestamp --dataset myds ~/samples/postgresql.jsonl

Output:

2026-03-03T19:24:03.148 INFO [controller] Setting up environment for bundling database...
...
2026-03-03T19:24:16.496 INFO [controller] Started CLP.

2026-03-03T19:24:21.503 INFO [compress] Compression job 1 submitted.
2026-03-03T19:24:24.007 INFO [compress] Compression finished.
2026-03-03T19:24:24.007 INFO [compress] Compressed 385.21MB into 10.06MB (38.31x). Speed: 177.04MB/s.

2026-03-03T19:24:25.143 INFO [compress] Compression job 2 submitted.
2026-03-03T19:24:27.649 INFO [compress] Compression finished.
2026-03-03T19:24:27.649 INFO [compress] Compressed 385.21MB into 10.06MB (38.31x). Speed: 182.56MB/s.

Archive IDs:

$ ls var/data/archives/default/
b39f5ee0-b0ae-4a39-be88-8651c3e0df19

$ ls var/data/archives/myds/
f0ac92ae-eeec-4245-b742-54b013933574

Scenario 1: `extract-json` without `--dataset` (default dataset fallback)

Task: Verify that j without --dataset falls back to the "default" dataset
instead of erroring with "Dataset unspecified".

Command:

$ ./sbin/decompress.sh j b39f5ee0-b0ae-4a39-be88-8651c3e0df19

Output:

2026-03-03T19:24:38.532 INFO [decompress] Finished extraction job 1.

Explanation: Before this fix, this command would fail with an ImportError. Even after
fixing the import, the old code would reject this call with "Dataset unspecified, but must
be specified for command j". Now it correctly falls back to the "default" dataset,
matching the behavior of all other scripts.

Scenario 2: `extract-json` with `--dataset`

Task: Verify j with an explicit dataset filter works.

Command:

$ ./sbin/decompress.sh j f0ac92ae-eeec-4245-b742-54b013933574 --dataset myds

Output:

2026-03-03T19:24:41.830 INFO [decompress] Finished extraction job 2.

Scenario 3: `extract-json` with `--target-chunk-size`

Task: Verify j with a custom chunk size works.

Command:

$ ./sbin/decompress.sh j f0ac92ae-eeec-4245-b742-54b013933574 --dataset myds --target-chunk-size 1048576

Output:

2026-03-03T19:24:44.801 INFO [decompress] Finished extraction job 3.

Scenario 4: `extract-ir` by `--orig-file-id`

Task: Verify the i subcommand loads without ImportError when using
--orig-file-id.

Command:

$ ./sbin/decompress.sh i 0 --orig-file-id test-id

Output:

2026-03-03T19:25:15.524 ERROR [decompress] IR extraction is not supported for storage engine `clp-s`.

Explanation: The error is expected — IR extraction is only supported for the clp
storage engine, and the test data was compressed with clp-s. The important thing is that
the module loaded successfully (no ImportError).

Scenario 5: `extract-ir` by `--orig-file-path`

Task: Verify the i subcommand loads without ImportError when using
--orig-file-path.

Command:

$ ./sbin/decompress.sh i 0 --orig-file-path /some/path

Output:

2026-03-03T19:25:21.271 ERROR [decompress] IR extraction is not supported for storage engine `clp-s`.

Scenario 6: `extract-ir` with `--target-uncompressed-size`

Task: Verify the i subcommand loads with the optional --target-uncompressed-size
argument.

Command:

$ ./sbin/decompress.sh i 0 --target-uncompressed-size 134217728 --orig-file-id test-id

Output:

2026-03-03T19:25:25.875 ERROR [decompress] IR extraction is not supported for storage engine `clp-s`.

Scenario 7: `--orig-file-id` and `--orig-file-path` are mutually exclusive

Task: Verify that providing both --orig-file-id and --orig-file-path is rejected
by argparse.

Command:

$ ./sbin/decompress.sh i 0 --orig-file-id test-id --orig-file-path /some/path

Output:

usage: decompress.py i [-h]
                       [--target-uncompressed-size TARGET_UNCOMPRESSED_SIZE]
                       (--orig-file-id ORIG_FILE_ID | --orig-file-path ORIG_FILE_PATH)
                       msg_ix
decompress.py i: error: argument --orig-file-path: not allowed with argument --orig-file-id

Explanation: argparse correctly enforces the mutually exclusive group defined in
decompress.py.

Scenario 8: `extract-file` (`x`) loads without ImportError

Task: Verify the x subcommand is not broken by the import issue.

Command:

$ ./sbin/decompress.sh x /home/junhao/samples/postgresql.jsonl

Output:

2026-03-03T19:25:34.574 ERROR [decompress] File extraction is not supported for archive storage type `fs` with storage engine `clp-s`.

Explanation: The error is expected for clp-s data. The module loaded successfully
without ImportError.

Part B: clp-TEXT (clp engine)

Switched to the clp (text) storage engine to test full end-to-end decompression,
including extract-file and extract-ir which are only supported by the clp engine.

Setup (clp-TEXT)

Task: Stop CLP-S, clean data, start with text config, and compress hive-24hr text
logs.

Commands:

$ ./sbin/stop-clp.sh
$ rm -rf var/data var/log
$ cp etc/clp-config.template.text.yaml etc/clp-config-text.yaml
$ ./sbin/start-clp.sh -c etc/clp-config-text.yaml
$ ./sbin/compress.sh -c etc/clp-config-text.yaml ~/samples/hive-24hr

Output:

2026-03-03T19:29:07.345 INFO [controller] Setting up environment for bundling database...
...
2026-03-03T19:29:20.537 INFO [controller] Started CLP.

2026-03-03T19:29:26.708 INFO [compress] Compression job 1 submitted.
2026-03-03T19:29:31.718 INFO [compress] Compression finished.
2026-03-03T19:29:31.718 INFO [compress] Compressed 1.99GB into 44.17MB (46.10x). Speed: 648.14MB/s.

File info (from DB):

orig_file_id: 47280a95-b8d5-4adb-a878-2463da3a94c0
path: /home/junhao/samples/hive-24hr/i-53ca095c/application_1427088391284_0024/container_1427088391284_0024_01_000044/syslog

Scenario 9: `extract-file` by path (clp-TEXT)

Task: Verify file extraction works with the clp engine.

Command:

$ ./sbin/decompress.sh -c etc/clp-config-text.yaml x /home/junhao/samples/hive-24hr/i-53ca095c/application_1427088391284_0024/container_1427088391284_0024_01_000044/syslog

Output: Command completed successfully (exit code 0, no output).

Scenario 10: `extract-file` with `--extraction-dir` (clp-TEXT)

Task: Verify file extraction to a custom directory.

Command:

$ mkdir -p /tmp/clp-extract
$ ./sbin/decompress.sh -c etc/clp-config-text.yaml x --extraction-dir /tmp/clp-extract /home/junhao/samples/hive-24hr/i-53ca095c/application_1427088391284_0024/container_1427088391284_0024_01_000044/syslog

Verification:

$ find /tmp/clp-extract -type f
/tmp/clp-extract/home/junhao/samples/hive-24hr/i-53ca095c/application_1427088391284_0024/container_1427088391284_0024_01_000044/syslog

Explanation: File was extracted to the specified directory with the original path
structure preserved.

Scenario 11: `extract-ir` by `--orig-file-id` (clp-TEXT)

Task: Verify IR extraction works end-to-end with a real file ID.

Command:

$ ./sbin/decompress.sh -c etc/clp-config-text.yaml i 0 --orig-file-id 47280a95-b8d5-4adb-a878-2463da3a94c0

Output:

2026-03-03T19:30:15.649 INFO [decompress] Finished extraction job 1.

Scenario 12: `extract-ir` by `--orig-file-path` (clp-TEXT)

Task: Verify IR extraction works with a file path lookup.

Command:

$ ./sbin/decompress.sh -c etc/clp-config-text.yaml i 0 --orig-file-path /home/junhao/samples/hive-24hr/i-53ca095c/application_1427088391284_0024/container_1427088391284_0024_01_000044/syslog

Output:

2026-03-03T19:30:21.682 INFO [decompress] Finished extraction job 2.

Scenario 13: `extract-ir` with `--target-uncompressed-size` (clp-TEXT)

Task: Verify IR extraction with custom target uncompressed size.

Command:

$ ./sbin/decompress.sh -c etc/clp-config-text.yaml i 0 --target-uncompressed-size 134217728 --orig-file-id 47280a95-b8d5-4adb-a878-2463da3a94c0

Output:

2026-03-03T19:30:29.178 INFO [decompress] Finished extraction job 3.

Scenario 14: `--orig-file-id` and `--orig-file-path` mutually exclusive (clp-TEXT)

Task: Confirm mutual exclusivity is enforced under the clp engine as well.

Command:

$ ./sbin/decompress.sh -c etc/clp-config-text.yaml i 0 --orig-file-id 47280a95-b8d5-4adb-a878-2463da3a94c0 --orig-file-path /some/path

Output:

usage: decompress.py i [-h]
                       [--target-uncompressed-size TARGET_UNCOMPRESSED_SIZE]
                       (--orig-file-id ORIG_FILE_ID | --orig-file-path ORIG_FILE_PATH)
                       msg_ix
decompress.py i: error: argument --orig-file-path: not allowed with argument --orig-file-id

Summary by CodeRabbit

Release Notes

Bug Fixes
- Improved default dataset handling when no dataset is specified
Chores
- Enhanced internal validation system to support multiple datasets more efficiently

…d add default dataset fallback (fixes y-scope#2059).

coderabbitai · 2026-03-03T19:40:56Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c48968d and 4ff5e9a.

📒 Files selected for processing (2)

components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py

Walkthrough

The changes replace a single-dataset validation function with a multi-dataset variant across two files. The function rename from validate_dataset_exists to validate_datasets_exist is applied, with call-sites wrapping single datasets in lists. Additionally, default dataset handling is introduced in the JSON extraction path.

Changes

Cohort / File(s)	Summary
Function API Update `components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py`	Import and usage of validation function updated from `validate_dataset_exists` to `validate_datasets_exist`, with single dataset wrapped in list at call-site.
Configuration and Validation Updates `components/clp-package-utils/clp_package_utils/scripts/native/decompress.py`	Added `CLP_DEFAULT_DATASET_NAME` constant import; updated validation function from `validate_dataset_exists` to `validate_datasets_exist`; dataset now defaults to `CLP_DEFAULT_DATASET_NAME` when not provided in JSON extraction path.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related issues

bug(decompress): All decompress subcommands fail with ImportError due to renamed function #2059: The function rename from validate_dataset_exists to validate_datasets_exist directly addresses the ImportError and API change described in this issue.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main changes: updating stale validate_dataset_exists references and adding a default dataset fallback in sbin scripts.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…nd add default dataset fallback in `sbin` scripts (fixes y-scope#2059). (y-scope#2060)

fix(decompress): Update stale validate_dataset_exists references an…

4ff5e9a

…d add default dataset fallback (fixes y-scope#2059).

junhaoliao requested a review from a team as a code owner March 3, 2026 19:40

junhaoliao requested a review from sitaowang1998 March 3, 2026 19:40

junhaoliao changed the title ~~fix(decompress): Update stale validate_dataset_exists references and add default dataset fallback (fixes #2059).~~ fix(clp-package): Update stale validate_dataset_exists references and add default dataset fallback (fixes #2059). Mar 3, 2026

junhaoliao changed the title ~~fix(clp-package): Update stale validate_dataset_exists references and add default dataset fallback (fixes #2059).~~ fix(clp-package): Update stale validate_dataset_exists references and add default dataset fallback in sbin scripts (fixes #2059). Mar 3, 2026

junhaoliao requested a review from hoophalab March 3, 2026 19:49

sitaowang1998 approved these changes Mar 3, 2026

View reviewed changes

junhaoliao merged commit 35b5ef1 into y-scope:main Mar 3, 2026
25 of 29 checks passed

junhaoliao added this to the February 2026 milestone Mar 7, 2026

junhaoliao deleted the fix-decompression branch May 7, 2026 19:46

junhaoliao added a commit to junhaoliao/clp that referenced this pull request May 17, 2026

fix(clp-package): Update stale validate_dataset_exists references a…

185e835

…nd add default dataset fallback in `sbin` scripts (fixes y-scope#2059). (y-scope#2060)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(clp-package): Update stale `validate_dataset_exists` references and add default dataset fallback in `sbin` scripts (fixes #2059).#2060

fix(clp-package): Update stale `validate_dataset_exists` references and add default dataset fallback in `sbin` scripts (fixes #2059).#2060
junhaoliao merged 1 commit into
y-scope:mainfrom
junhaoliao:fix-decompression

junhaoliao commented Mar 3, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Mar 3, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

junhaoliao commented Mar 3, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Validation performed

Part A: clp-JSON (clp-s engine)

Setup

Scenario 1: extract-json without --dataset (default dataset fallback)

Scenario 2: extract-json with --dataset

Scenario 3: extract-json with --target-chunk-size

Scenario 4: extract-ir by --orig-file-id

Scenario 5: extract-ir by --orig-file-path

Scenario 6: extract-ir with --target-uncompressed-size

Scenario 7: --orig-file-id and --orig-file-path are mutually exclusive

Scenario 8: extract-file (x) loads without ImportError

Part B: clp-TEXT (clp engine)

Setup (clp-TEXT)

Scenario 9: extract-file by path (clp-TEXT)

Scenario 10: extract-file with --extraction-dir (clp-TEXT)

Scenario 11: extract-ir by --orig-file-id (clp-TEXT)

Scenario 12: extract-ir by --orig-file-path (clp-TEXT)

Scenario 13: extract-ir with --target-uncompressed-size (clp-TEXT)

Scenario 14: --orig-file-id and --orig-file-path mutually exclusive (clp-TEXT)

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

junhaoliao commented Mar 3, 2026 •

edited by coderabbitai Bot

Loading

Scenario 1: `extract-json` without `--dataset` (default dataset fallback)

Scenario 2: `extract-json` with `--dataset`

Scenario 3: `extract-json` with `--target-chunk-size`

Scenario 4: `extract-ir` by `--orig-file-id`

Scenario 5: `extract-ir` by `--orig-file-path`

Scenario 6: `extract-ir` with `--target-uncompressed-size`

Scenario 7: `--orig-file-id` and `--orig-file-path` are mutually exclusive

Scenario 8: `extract-file` (`x`) loads without ImportError

Scenario 9: `extract-file` by path (clp-TEXT)

Scenario 10: `extract-file` with `--extraction-dir` (clp-TEXT)

Scenario 11: `extract-ir` by `--orig-file-id` (clp-TEXT)

Scenario 12: `extract-ir` by `--orig-file-path` (clp-TEXT)

Scenario 13: `extract-ir` with `--target-uncompressed-size` (clp-TEXT)

Scenario 14: `--orig-file-id` and `--orig-file-path` mutually exclusive (clp-TEXT)

coderabbitai Bot commented Mar 3, 2026 •

edited

Loading