Skip to content

fix(clp-package): Update stale validate_dataset_exists references and add default dataset fallback in sbin scripts (fixes #2059).#2060

Merged
junhaoliao merged 1 commit into
y-scope:mainfrom
junhaoliao:fix-decompression
Mar 3, 2026

Conversation

@junhaoliao

@junhaoliao junhaoliao commented Mar 3, 2026

Copy link
Copy Markdown
Member

Description

#1992 renamed validate_dataset_exists to validate_datasets_exist (singular to plural)
in native/utils.py and updated its signature to accept list[str], but two callers were
not updated:

  • native/decompress.py — import (line 39) and call site (line 151)
  • native/archive_manager.py — import (line 31) and call site (line 203)

Since these are module-level imports, the ImportError prevents the entire decompress
module from loading, making all decompress.sh subcommands (x, i, j)
non-functional.

This PR:

  1. Updates both files to import and call validate_datasets_exist, wrapping the single
    dataset string in a list to match the new signature.
  2. Adds a default dataset fallback in native/decompress.py for the j (extract-json)
    subcommand, matching the pattern used by every other script (compress.py,
    decompress.py (non-native), search.py, archive_manager.py):
    dataset = CLP_DEFAULT_DATASET_NAME if dataset is None else dataset
    Previously, native/decompress.py errored when --dataset was omitted, unlike all
    other scripts which fall back to the "default" dataset.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Tested on the built package (task to build, then sbin/start-clp.sh) with both
clp-JSON (clp-s) and clp-TEXT (clp) storage engines.

Part A: clp-JSON (clp-s engine)

Setup

Task: Build package, start CLP, and compress test data into both the default and a
named dataset.

Commands:

$ cd build/clp-package
$ ./sbin/start-clp.sh
$ ./sbin/compress.sh --timestamp-key timestamp ~/samples/postgresql.jsonl
$ ./sbin/compress.sh --timestamp-key timestamp --dataset myds ~/samples/postgresql.jsonl

Output:

2026-03-03T19:24:03.148 INFO [controller] Setting up environment for bundling database...
...
2026-03-03T19:24:16.496 INFO [controller] Started CLP.

2026-03-03T19:24:21.503 INFO [compress] Compression job 1 submitted.
2026-03-03T19:24:24.007 INFO [compress] Compression finished.
2026-03-03T19:24:24.007 INFO [compress] Compressed 385.21MB into 10.06MB (38.31x). Speed: 177.04MB/s.

2026-03-03T19:24:25.143 INFO [compress] Compression job 2 submitted.
2026-03-03T19:24:27.649 INFO [compress] Compression finished.
2026-03-03T19:24:27.649 INFO [compress] Compressed 385.21MB into 10.06MB (38.31x). Speed: 182.56MB/s.

Archive IDs:

$ ls var/data/archives/default/
b39f5ee0-b0ae-4a39-be88-8651c3e0df19

$ ls var/data/archives/myds/
f0ac92ae-eeec-4245-b742-54b013933574

Scenario 1: extract-json without --dataset (default dataset fallback)

Task: Verify that j without --dataset falls back to the "default" dataset
instead of erroring with "Dataset unspecified".

Command:

$ ./sbin/decompress.sh j b39f5ee0-b0ae-4a39-be88-8651c3e0df19

Output:

2026-03-03T19:24:38.532 INFO [decompress] Finished extraction job 1.

Explanation: Before this fix, this command would fail with an ImportError. Even after
fixing the import, the old code would reject this call with "Dataset unspecified, but must
be specified for command j". Now it correctly falls back to the "default" dataset,
matching the behavior of all other scripts.

Scenario 2: extract-json with --dataset

Task: Verify j with an explicit dataset filter works.

Command:

$ ./sbin/decompress.sh j f0ac92ae-eeec-4245-b742-54b013933574 --dataset myds

Output:

2026-03-03T19:24:41.830 INFO [decompress] Finished extraction job 2.

Scenario 3: extract-json with --target-chunk-size

Task: Verify j with a custom chunk size works.

Command:

$ ./sbin/decompress.sh j f0ac92ae-eeec-4245-b742-54b013933574 --dataset myds --target-chunk-size 1048576

Output:

2026-03-03T19:24:44.801 INFO [decompress] Finished extraction job 3.

Scenario 4: extract-ir by --orig-file-id

Task: Verify the i subcommand loads without ImportError when using
--orig-file-id.

Command:

$ ./sbin/decompress.sh i 0 --orig-file-id test-id

Output:

2026-03-03T19:25:15.524 ERROR [decompress] IR extraction is not supported for storage engine `clp-s`.

Explanation: The error is expected — IR extraction is only supported for the clp
storage engine, and the test data was compressed with clp-s. The important thing is that
the module loaded successfully (no ImportError).

Scenario 5: extract-ir by --orig-file-path

Task: Verify the i subcommand loads without ImportError when using
--orig-file-path.

Command:

$ ./sbin/decompress.sh i 0 --orig-file-path /some/path

Output:

2026-03-03T19:25:21.271 ERROR [decompress] IR extraction is not supported for storage engine `clp-s`.

Scenario 6: extract-ir with --target-uncompressed-size

Task: Verify the i subcommand loads with the optional --target-uncompressed-size
argument.

Command:

$ ./sbin/decompress.sh i 0 --target-uncompressed-size 134217728 --orig-file-id test-id

Output:

2026-03-03T19:25:25.875 ERROR [decompress] IR extraction is not supported for storage engine `clp-s`.

Scenario 7: --orig-file-id and --orig-file-path are mutually exclusive

Task: Verify that providing both --orig-file-id and --orig-file-path is rejected
by argparse.

Command:

$ ./sbin/decompress.sh i 0 --orig-file-id test-id --orig-file-path /some/path

Output:

usage: decompress.py i [-h]
                       [--target-uncompressed-size TARGET_UNCOMPRESSED_SIZE]
                       (--orig-file-id ORIG_FILE_ID | --orig-file-path ORIG_FILE_PATH)
                       msg_ix
decompress.py i: error: argument --orig-file-path: not allowed with argument --orig-file-id

Explanation: argparse correctly enforces the mutually exclusive group defined in
decompress.py.

Scenario 8: extract-file (x) loads without ImportError

Task: Verify the x subcommand is not broken by the import issue.

Command:

$ ./sbin/decompress.sh x /home/junhao/samples/postgresql.jsonl

Output:

2026-03-03T19:25:34.574 ERROR [decompress] File extraction is not supported for archive storage type `fs` with storage engine `clp-s`.

Explanation: The error is expected for clp-s data. The module loaded successfully
without ImportError.

Part B: clp-TEXT (clp engine)

Switched to the clp (text) storage engine to test full end-to-end decompression,
including extract-file and extract-ir which are only supported by the clp engine.

Setup (clp-TEXT)

Task: Stop CLP-S, clean data, start with text config, and compress hive-24hr text
logs.

Commands:

$ ./sbin/stop-clp.sh
$ rm -rf var/data var/log
$ cp etc/clp-config.template.text.yaml etc/clp-config-text.yaml
$ ./sbin/start-clp.sh -c etc/clp-config-text.yaml
$ ./sbin/compress.sh -c etc/clp-config-text.yaml ~/samples/hive-24hr

Output:

2026-03-03T19:29:07.345 INFO [controller] Setting up environment for bundling database...
...
2026-03-03T19:29:20.537 INFO [controller] Started CLP.

2026-03-03T19:29:26.708 INFO [compress] Compression job 1 submitted.
2026-03-03T19:29:31.718 INFO [compress] Compression finished.
2026-03-03T19:29:31.718 INFO [compress] Compressed 1.99GB into 44.17MB (46.10x). Speed: 648.14MB/s.

File info (from DB):

orig_file_id: 47280a95-b8d5-4adb-a878-2463da3a94c0
path: /home/junhao/samples/hive-24hr/i-53ca095c/application_1427088391284_0024/container_1427088391284_0024_01_000044/syslog

Scenario 9: extract-file by path (clp-TEXT)

Task: Verify file extraction works with the clp engine.

Command:

$ ./sbin/decompress.sh -c etc/clp-config-text.yaml x /home/junhao/samples/hive-24hr/i-53ca095c/application_1427088391284_0024/container_1427088391284_0024_01_000044/syslog

Output: Command completed successfully (exit code 0, no output).

Scenario 10: extract-file with --extraction-dir (clp-TEXT)

Task: Verify file extraction to a custom directory.

Command:

$ mkdir -p /tmp/clp-extract
$ ./sbin/decompress.sh -c etc/clp-config-text.yaml x --extraction-dir /tmp/clp-extract /home/junhao/samples/hive-24hr/i-53ca095c/application_1427088391284_0024/container_1427088391284_0024_01_000044/syslog

Verification:

$ find /tmp/clp-extract -type f
/tmp/clp-extract/home/junhao/samples/hive-24hr/i-53ca095c/application_1427088391284_0024/container_1427088391284_0024_01_000044/syslog

Explanation: File was extracted to the specified directory with the original path
structure preserved.

Scenario 11: extract-ir by --orig-file-id (clp-TEXT)

Task: Verify IR extraction works end-to-end with a real file ID.

Command:

$ ./sbin/decompress.sh -c etc/clp-config-text.yaml i 0 --orig-file-id 47280a95-b8d5-4adb-a878-2463da3a94c0

Output:

2026-03-03T19:30:15.649 INFO [decompress] Finished extraction job 1.

Scenario 12: extract-ir by --orig-file-path (clp-TEXT)

Task: Verify IR extraction works with a file path lookup.

Command:

$ ./sbin/decompress.sh -c etc/clp-config-text.yaml i 0 --orig-file-path /home/junhao/samples/hive-24hr/i-53ca095c/application_1427088391284_0024/container_1427088391284_0024_01_000044/syslog

Output:

2026-03-03T19:30:21.682 INFO [decompress] Finished extraction job 2.

Scenario 13: extract-ir with --target-uncompressed-size (clp-TEXT)

Task: Verify IR extraction with custom target uncompressed size.

Command:

$ ./sbin/decompress.sh -c etc/clp-config-text.yaml i 0 --target-uncompressed-size 134217728 --orig-file-id 47280a95-b8d5-4adb-a878-2463da3a94c0

Output:

2026-03-03T19:30:29.178 INFO [decompress] Finished extraction job 3.

Scenario 14: --orig-file-id and --orig-file-path mutually exclusive (clp-TEXT)

Task: Confirm mutual exclusivity is enforced under the clp engine as well.

Command:

$ ./sbin/decompress.sh -c etc/clp-config-text.yaml i 0 --orig-file-id 47280a95-b8d5-4adb-a878-2463da3a94c0 --orig-file-path /some/path

Output:

usage: decompress.py i [-h]
                       [--target-uncompressed-size TARGET_UNCOMPRESSED_SIZE]
                       (--orig-file-id ORIG_FILE_ID | --orig-file-path ORIG_FILE_PATH)
                       msg_ix
decompress.py i: error: argument --orig-file-path: not allowed with argument --orig-file-id

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Improved default dataset handling when no dataset is specified
  • Chores
    • Enhanced internal validation system to support multiple datasets more efficiently

@junhaoliao junhaoliao requested a review from a team as a code owner March 3, 2026 19:40
@junhaoliao junhaoliao requested a review from sitaowang1998 March 3, 2026 19:40
@coderabbitai

coderabbitai Bot commented Mar 3, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c48968d and 4ff5e9a.

📒 Files selected for processing (2)
  • components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py
  • components/clp-package-utils/clp_package_utils/scripts/native/decompress.py

Walkthrough

The changes replace a single-dataset validation function with a multi-dataset variant across two files. The function rename from validate_dataset_exists to validate_datasets_exist is applied, with call-sites wrapping single datasets in lists. Additionally, default dataset handling is introduced in the JSON extraction path.

Changes

Cohort / File(s) Summary
Function API Update
components/clp-package-utils/clp_package_utils/scripts/native/archive_manager.py
Import and usage of validation function updated from validate_dataset_exists to validate_datasets_exist, with single dataset wrapped in list at call-site.
Configuration and Validation Updates
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
Added CLP_DEFAULT_DATASET_NAME constant import; updated validation function from validate_dataset_exists to validate_datasets_exist; dataset now defaults to CLP_DEFAULT_DATASET_NAME when not provided in JSON extraction path.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related issues

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: updating stale validate_dataset_exists references and adding a default dataset fallback in sbin scripts.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@junhaoliao junhaoliao changed the title fix(decompress): Update stale validate_dataset_exists references and add default dataset fallback (fixes #2059). fix(clp-package): Update stale validate_dataset_exists references and add default dataset fallback (fixes #2059). Mar 3, 2026
@junhaoliao junhaoliao changed the title fix(clp-package): Update stale validate_dataset_exists references and add default dataset fallback (fixes #2059). fix(clp-package): Update stale validate_dataset_exists references and add default dataset fallback in sbin scripts (fixes #2059). Mar 3, 2026
@junhaoliao junhaoliao requested a review from hoophalab March 3, 2026 19:49
@junhaoliao junhaoliao merged commit 35b5ef1 into y-scope:main Mar 3, 2026
25 of 29 checks passed
@junhaoliao junhaoliao added this to the February 2026 milestone Mar 7, 2026
@junhaoliao junhaoliao deleted the fix-decompression branch May 7, 2026 19:46
junhaoliao added a commit to junhaoliao/clp that referenced this pull request May 17, 2026
…nd add default dataset fallback in `sbin` scripts (fixes y-scope#2059). (y-scope#2060)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants