fix(package): Update `native/decompress.py` to use CLI args and env vars for DB config (fixes #1199); Prevent IR extraction with clp-s storage engine and only use dataset argument for JSON extraction (fixes #1200). by junhaoliao · Pull Request #1202 · y-scope/clp

junhaoliao · 2025-08-14T05:50:37Z

Description

This PR addresses two issues:

Fixes native/decompress.py still uses deprecated --db-config-file argument instead of command-line database parameters. #1199: The native/decompress.py script was still using the deprecated --db-config-file argument instead of command-line database parameters. This caused runtime errors since the core CLP components no longer accept the --db-config-file argument after the refactoring in feat(core)!: Replace YAML config with CLI args and env vars for metadata DB (resolves #1146). #1148. This PR updated native/decompress.py to pass database connection parameters as command-line arguments and environment variables instead of using a temporary YAML config file.
Fixes IR extraction parser missing --dataset argument causes AttributeError #1200: The IR extraction parser was missing the --dataset argument, causing an AttributeError when trying to access parsed_args.dataset in the handle_extract_stream_cmd function (i.e., sbin/decompress.sh i). The fix uses getattr with a fallback to None when the dataset attribute is not available.
Check if engine type before submitting an IR extraction job. If type is clp-s, error out and do not submit.

Checklist

The PR satisfies the contribution guidelines.
This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.
Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Built the CLP package with task package

For CLP-Text

cd build/clp-package/sbin
./compress.sh ~/samples/hive-24hr
./decompress.sh x
./decompress.sh i --orig-file-id <id> 0
(-ve): ./decompress.sh j a8722927-820e-4d04-9b0e-bfb739fb91c9 -> 2025-08-14T15:17:20.637 ERROR [decompress] JSON extraction is not supported for storage engine clp.

For CLP-JSON

Changed clp-config.yml to start the package as CLP-JSON

package:
  storage_engine: "clp-s"
  query_engine: "clp-s"

./compress.sh ~/samples/postgresql/ --timestamp-key=timestamp
(-ve): ./decompress.sh x -> 2025-08-14T15:21:51.715 ERROR [decompress] File extraction is not supported for archive storage type fswith storage engineclp-s.
(-ve): ./decompress.sh i --orig-file-id <id> 0 -> 2025-08-14T15:24:21.453 ERROR [decompress] IR extraction is not supported for storage engine clp-s.
./decompress.sh j <id>

Summary by CodeRabbit

New Features
- Configure DB connection via CLI flags rather than a temporary config file.
- Pass DB credentials to subprocesses through environment variables (CLP_DB_USER, CLP_DB_PASS).
Bug Fixes
- Enforce storage-engine compatibility: IR extraction limited to the CLP engine with clear errors for invalid combos.
- JSON extraction now defaults and validates the dataset before running; dataset argument is restricted for IR extraction.

…ars for DB config (fixes y-scope#1199); Fallback to `dataset=None` when IR extraction CLI parser is used (fixes y-scope#1200).

coderabbitai · 2025-08-14T05:50:44Z

Walkthrough

Enforces storage-engine compatibility and stricter dataset handling in CLP decompress; IR extraction only allowed for CLP storage engine and forbids dataset for IR. Native decompress now passes DB connection params via CLI flags and supplies DB credentials via the subprocess environment instead of writing a temporary YAML config.

Changes

Cohort / File(s)	Summary
CLP decompress control flow & dataset handling `components/clp-package-utils/clp_package_utils/scripts/decompress.py`	Add guard: IR extraction only supported when storage engine is `CLP` (otherwise log error and return -1). Remove prior IR-path/CLP_S dataset logic. For JSON extraction, derive `dataset` from `parsed_args` (default to `CLP_DEFAULT_DATASET_NAME`), validate against CLP `table_prefix` (on failure log error and return -1), and unconditionally append dataset to the command.
Native decompress — DB params via CLI and env `components/clp-package-utils/clp_package_utils/scripts/native/decompress.py`	Remove creation/cleanup of temporary YAML DB config and YAML import. Use `get_clp_connection_params_and_type(True)` to collect DB params and pass `--db-type`, `--db-host`, `--db-port`, `--db-name`, `--db-table-prefix` to the CLP CLI. Inject `CLP_DB_USER`/`CLP_DB_PASS` into subprocess environment (`env=...`) and pass `env` to `subprocess.Popen`; add `os` import.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant Decompress as decompress.py
  participant CLPConfig
  User->>Decompress: run decompress (parsed_args, storage_engine)
  Decompress->>Decompress: if storage_engine == IR -> require StorageEngine.CLP
  alt IR & not CLP
    Decompress-->>User: log error + return -1
  else IR & CLP
    Decompress->>Decompress: ensure no dataset present; else error
    Decompress->>CLPConfig: build IR extract command
    Decompress->>User: run command / return status
  end
  alt JSON
    Decompress->>Decompress: dataset = parsed_args.dataset or CLP_DEFAULT_DATASET_NAME
    Decompress->>CLPConfig: validate dataset matches table_prefix
    alt validation fail
      Decompress-->>User: log error + return -1
    else
      Decompress->>Decompress: append dataset to extract command
      Decompress->>User: run command / return status
    end
  end

sequenceDiagram
  participant User
  participant Native as native/decompress.py
  participant CLPTool
  User->>Native: run native decompress
  Native->>Native: clp_db_connection_params = get_clp_connection_params_and_type(True)
  Native->>Native: build CLI args (--db-type/--db-host/--db-port/--db-name/--db-table-prefix)
  Native->>Native: extract_env = os.environ + {CLP_DB_USER, CLP_DB_PASS}
  Native->>CLPTool: invoke CLP CLI with args and env
  CLPTool-->>Native: exit status
  Native-->>User: return result

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

native/decompress.py still uses deprecated --db-config-file argument instead of command-line database parameters. #1199 — Replaces --db-config-file/temp YAML usage with CLI DB flags and env-based credentials as described in this PR.
IR extraction parser missing --dataset argument causes AttributeError #1200 — Addresses dataset handling differences between IR and JSON extraction and prevents accessing parsed_args.dataset for IR extraction.

Possibly related PRs

feat(job-orchestration): Add dataset fields to input configs of compression and search jobs. #839 — Introduced CLP_DEFAULT_DATASET_NAME and dataset fields; changes here default and validate dataset against that constant.

Suggested reviewers

kirkrodrigues

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🔭 Outside diff range comments (2)

components/clp-package-utils/clp_package_utils/scripts/decompress.py (1)

188-200: Fix: Safe dataset access prevents AttributeError on IR parser — LGTM; consider scoping validation to JSON only

Using getattr parsed_args, "dataset", None resolves the AttributeError when the IR parser lacks --dataset. To avoid unnecessary DB calls during IR extraction, you can scope dataset defaulting/validation to the JSON command only.

-    dataset = getattr(parsed_args, "dataset", None)
-    if StorageEngine.CLP_S == storage_engine:
-        dataset = CLP_DEFAULT_DATASET_NAME if dataset is None else dataset
-        try:
-            clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True)
-            validate_dataset_name(clp_db_connection_params["table_prefix"], dataset)
-        except Exception as e:
-            logger.error(e)
-            return -1
-    elif dataset is not None:
-        logger.error(f"Dataset selection is not supported for storage engine: {storage_engine}.")
-        return -1
+    dataset = getattr(parsed_args, "dataset", None)
+    # Only resolve/validate dataset for JSON extraction to avoid extra DB calls on IR paths
+    if EXTRACT_JSON_CMD == job_command:
+        if StorageEngine.CLP_S == storage_engine:
+            dataset = CLP_DEFAULT_DATASET_NAME if dataset is None else dataset
+            try:
+                clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True)
+                validate_dataset_name(clp_db_connection_params["table_prefix"], dataset)
+            except Exception as e:
+                logger.error(e)
+                return -1
+        elif dataset is not None:
+            logger.error(f"Dataset selection is not supported for storage engine: {storage_engine}.")
+            return -1

components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (1)

265-269: Unify subprocess handling with subprocess.run(check=True) for clearer error paths

Using subprocess.run with check=True simplifies error handling and aligns with the wrapper script.

-    proc = subprocess.Popen(extract_cmd, env=extract_env)
-    return_code = proc.wait()
-    if 0 != return_code:
-        logger.error(f"File extraction failed, return_code={return_code}")
-        return return_code
+    try:
+        subprocess.run(extract_cmd, env=extract_env, check=True)
+    except subprocess.CalledProcessError as e:
+        logger.error(f"File extraction failed, return_code={e.returncode}")
+        return e.returncode

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1fe0224 and 2320d64.

📒 Files selected for processing (2)

components/clp-package-utils/clp_package_utils/scripts/decompress.py (1 hunks)
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (3 hunks)

🧰 Additional context used

🧠 Learnings (3)

📓 Common learnings

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Learnt from: junhaoliao
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/archive_manager.py:178-178
Timestamp: 2025-08-14T05:06:58.977Z
Learning: In the decompress.py script, the --dataset argument was only added to the json_extraction_parser but not to the ir_extraction_parser, causing AttributeError when handle_extract_stream_cmd tries to access parsed_args.dataset for IR extraction commands.

📚 Learning: 2025-08-14T05:06:58.977Z

Learnt from: junhaoliao
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/archive_manager.py:178-178
Timestamp: 2025-08-14T05:06:58.977Z
Learning: In the decompress.py script, the --dataset argument was only added to the json_extraction_parser but not to the ir_extraction_parser, causing AttributeError when handle_extract_stream_cmd tries to access parsed_args.dataset for IR extraction commands.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/decompress.py

📚 Learning: 2025-08-13T14:48:49.020Z

Learnt from: haiqi96
PR: y-scope/clp#1144
File: components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py:106-114
Timestamp: 2025-08-13T14:48:49.020Z
Learning: For the dataset manager scripts in components/clp-package-utils/clp_package_utils/scripts/, the native script (native/dataset_manager.py) is designed to only be called through the wrapper script (dataset_manager.py), so dataset validation is only performed at the wrapper level rather than duplicating it in the native script.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/decompress.py

🧬 Code Graph Analysis (1)

components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (1)

components/clp-py-utils/clp_py_utils/clp_config.py (1)

get_clp_connection_params_and_type (209-230)

🔇 Additional comments (1)

components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (1)

245-250: Confirmed: CLP reads CLP_DB_USER and CLP_DB_PASS

Repo search shows these env vars are used across docs, tests and runtime code — no change needed.

components/core/src/clp/GlobalMetadataDBConfig.cpp — getenv("CLP_DB_USER") / getenv("CLP_DB_PASS") and validation (lines ~122–129, ~176–183)

components/core/tools/scripts/db/init-db.py — reads os.environ["CLP_DB_USER"] / ["CLP_DB_PASS"] (lines 35–36)

components/webui/server/src/plugins/external/mysql.ts — uses fastify.config.CLP_DB_USER / CLP_DB_PASS (lines 22–26)

components/clp-package-utils/clp_package_utils/scripts/start_clp.py — sets CLP_DB_USER / CLP_DB_PASS (lines 962–963)

components/clp-package-utils/clp_package_utils/scripts/native/decompress.py — sets extract_env with CLP_DB_USER / CLP_DB_PASS (lines 245–248)

docs/src/user-guide/core-unstructured/clp.md — documents CLP_DB_USER and CLP_DB_PASS (lines 131–132)

components/core/tests/test-GlobalMetadataDBConfig.cpp — tests set/unset these env vars

Approved — credentials via env are correct and match CLP expectations.

coderabbitai · 2025-08-14T05:58:35Z

+    # Configure CLP metadata DB connection params.
+    clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True)
    # fmt: off
    extract_cmd = [
        str(clp_home / "bin" / "clp"),
        "x", str(archives_dir), str(extraction_dir),
-        "--db-config-file", str(db_config_file_path),
+        "--db-type", "mysql",
+        "--db-host", clp_db_connection_params["host"],
+        "--db-port", str(clp_db_connection_params["port"]),
+        "--db-name", clp_db_connection_params["name"],
+        "--db-table-prefix", clp_db_connection_params["table_prefix"],
    ]


💡 Verification agent

🧩 Analysis chain

Switch to CLI DB params: verify flag support and consider passing optional SSL cert

The migration away from --db-config-file looks correct. Two follow-ups:

Verify the clp binary accepts these flags.

If SSL is configured, consider passing it via a dedicated flag (e.g., --db-ssl-cert) if supported.

Proposed enhancement (guarded by CLI support):

- "--db-type", "mysql", + "--db-type", clp_db_connection_params["type"], "--db-host", clp_db_connection_params["host"], "--db-port", str(clp_db_connection_params["port"]), "--db-name", clp_db_connection_params["name"], "--db-table-prefix", clp_db_connection_params["table_prefix"], + ] + ( + ["--db-ssl-cert", clp_db_connection_params["ssl_cert"]] + if "ssl_cert" in clp_db_connection_params + else [] )

Run this script to verify flag names in the repo:

🏁 Script executed:

#!/bin/bash # Verify clp binary (core) recognizes new DB flags rg -n --ignore-case -- 'db-type|db-host|db-port|db-name|db-table-prefix|db-ssl-cert' -A 3 -B 3

Length of output: 266

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "==== Show snippet from decompress.py ====" sed -n '1,400p' components/clp-package-utils/clp_package_utils/scripts/native/decompress.py || true echo echo "==== Search for get_clp_connection_params_and_type ====" rg -n --hidden --no-ignore 'get_clp_connection_params_and_type' || true echo echo "==== Search for ssl-related keys/flags in repo ====" rg -n --hidden --no-ignore 'ssl_cert|ssl-cert|ssl_cert_path|ssl_cert_file|--db-ssl-cert' || true echo echo "==== Search for db-config-file and db- flags in repo ====" rg -n --hidden --no-ignore '--db-config-file|--db-type|--db-host|--db-port|--db-name|--db-table-prefix|db_type|db_host|db_port|db_name|table_prefix' || true echo echo "==== Search for other clp invocations or docs containing db flags ====" rg -n --hidden --no-ignore '\bclp\b' -S || true

Length of output: 269325

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== Show clp CommandLineArguments.cpp (first 400 lines) ===" sed -n '1,400p' components/core/src/clp/clp/CommandLineArguments.cpp || true echo echo "=== Search for DB-related CLI flags in core sources ===" rg -n --hidden --no-ignore --no-heading --column '(--db-|db-|db_)' components/core/src || true echo echo "=== Show CommandLineArguments.hpp ===" sed -n '1,300p' components/core/src/clp/clp/CommandLineArguments.hpp || true echo echo "=== Search for db-ssl or ssl_cert in C++ sources ===" rg -n --hidden --no-ignore --no-heading --column 'db-ssl|ssl_cert|ssl-cert|ssl' components/core/src || true

Length of output: 70284

Switch to CLI DB params: clp supports --db-*; handle sqlite and SSL

Quick summary:

I verified the CLP core exposes the CLI flags --db-type, --db-host, --db-port, --db-name and --db-table-prefix (see components/core/src/clp/GlobalMetadataDBConfig.cpp).

I found no evidence of a --db-ssl-cert CLI flag in the repo (no matches for db-ssl / db-ssl-cert / ssl_cert in core sources). The Python config code (components/clp-py-utils/clp_py_utils/clp_config.py) can return ssl_cert, though.

What to change

components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (around the snippet you flagged):

Replace the hard-coded "--db-type", "mysql" with the actual type from clp_db_connection_params["type"].

Only append the MySQL-style flags (host/port/name/table-prefix) when the DB type is MySQL/MariaDB — do not pass them for SQLite (CLP will use the local metadata DB file).

If clp_db_connection_params contains an SSL cert, do NOT append a non-existent --db-ssl-cert flag (core doesn't support it). Instead either:

write a temporary DB config file including the ssl_cert and pass it via --db-config-file (only when SSL is present), or

update the CLP core to accept a dedicated --db-ssl-cert flag and then pass it from here.

Suggested code change (concise diff):

- # fmt: off - extract_cmd = [ - str(clp_home / "bin" / "clp"), - "x", str(archives_dir), str(extraction_dir), - "--db-type", "mysql", - "--db-host", clp_db_connection_params["host"], - "--db-port", str(clp_db_connection_params["port"]), - "--db-name", clp_db_connection_params["name"], - "--db-table-prefix", clp_db_connection_params["table_prefix"], - ] - # fmt: on + # fmt: off + extract_cmd = [str(clp_home / "bin" / "clp"), "x", str(archives_dir), str(extraction_dir)] + db_type = clp_db_connection_params["type"] + if db_type.lower() in ("mysql", "mariadb"): + extract_cmd += [ + "--db-type", db_type, + "--db-host", clp_db_connection_params["host"], + "--db-port", str(clp_db_connection_params["port"]), + "--db-name", clp_db_connection_params["name"], + "--db-table-prefix", clp_db_connection_params["table_prefix"], + ] + # fmt: on

If SSL is configured, a minimal approach without changing core:

Serialize the DB connection (including ssl_cert) to a temporary config file and pass --db-config-file for this invocation, then delete the temp file after the process finishes.

Files worth reviewing / touching

components/clp-package-utils/clp_package_utils/scripts/native/decompress.py — update as above (lines ~232–243).

components/core/src/clp/GlobalMetadataDBConfig.cpp — confirms supported CLI flags and enforces that MySQL-specific options aren’t used with SQLite (consider adding db-ssl support here if desired).

components/clp-py-utils/clp_py_utils/clp_config.py — contains ssl_cert in returned params (used to decide whether to use the tmp config-file fallback).

Tag:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Configure CLP metadata DB connection params.

clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True)

# fmt: off

extract_cmd = [

str(clp_home / "bin" / "clp"),

"x", str(archives_dir), str(extraction_dir),

"--db-config-file", str(db_config_file_path),

"--db-type", "mysql",

"--db-host", clp_db_connection_params["host"],

"--db-port", str(clp_db_connection_params["port"]),

"--db-name", clp_db_connection_params["name"],

"--db-table-prefix", clp_db_connection_params["table_prefix"],

]

# Configure CLP metadata DB connection params.

clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True)

# fmt: off

extract_cmd = [str(clp_home / "bin" / "clp"), "x", str(archives_dir), str(extraction_dir)]

db_type = clp_db_connection_params["type"]

if db_type.lower() in ("mysql", "mariadb"):

extract_cmd += [

"--db-type", db_type,

"--db-host", clp_db_connection_params["host"],

"--db-port", str(clp_db_connection_params["port"]),

"--db-name", clp_db_connection_params["name"],

"--db-table-prefix", clp_db_connection_params["table_prefix"],

]

# fmt: on

Bill-hbrhbr · 2025-08-14T10:32:45Z

Hi Junhao, I think we currently do not support --dataset with IR extraction. Let me know if I'm wrong. We can confirm with @kirkrodrigues
If we do support --dataset with IR extraction, I'll help review the details of this PR.

junhaoliao · 2025-08-14T10:54:42Z

Hi Junhao, I think we currently do not support --dataset with IR extraction. Let me know if I'm wrong. We can confirm with @kirkrodrigues If we do support --dataset with IR extraction, I'll help review the details of this PR.

Sorry for the confusion. The fix is not to make dataset work with CLP Text. Rather, the ‘parsed_args.dataset’ access yields an exception whenever the command ‘i’ is specified. Maybe I wasn’t using the command in the right way? Still if it throws an exception I think we should either catch-and-handle or avoid

junhaoliao · 2025-08-14T10:57:00Z

        return -1

-    dataset = parsed_args.dataset
+    dataset = getattr(parsed_args, "dataset", None)


Here ‘parsed_args’ may not always have an attribute named « dataset »

junhaoliao · 2025-08-14T15:31:27Z

++ @haiqi96 since the changes are package related

kirkrodrigues · 2025-08-14T16:18:07Z

        return -1

-    dataset = parsed_args.dataset
+    dataset = getattr(parsed_args, "dataset", None)


Perhaps to improve readability and reduce errors like this in the future, we can move any IR-specific stuff into the IR block on line 220. And similarly for the JSON block.

…e review Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

🔭 Outside diff range comments (1)

components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (1)

265-275: Ensure temp file cleanup on errors by using try/finally

If Popen raises (e.g., missing binary) or the process crashes, files_to_extract_list_path may leak. Use a finally block to guarantee cleanup.

Apply:

-    proc = subprocess.Popen(extract_cmd, env=extract_env)
-    return_code = proc.wait()
-    if 0 != return_code:
-        logger.error(f"File extraction failed, return_code={return_code}")
-        return return_code
-
-    # Remove generated files
-    if files_to_extract_list_path is not None:
-        files_to_extract_list_path.unlink()
-
-    return 0
+    try:
+        proc = subprocess.Popen(extract_cmd, env=extract_env)
+        return_code = proc.wait()
+        if return_code != 0:
+            logger.error(f"File extraction failed, return_code={return_code}")
+            return return_code
+        return 0
+    finally:
+        # Remove generated files
+        if files_to_extract_list_path is not None and files_to_extract_list_path.exists():
+            files_to_extract_list_path.unlink()

♻️ Duplicate comments (1)

components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (1)

232-244: Guard DB flag construction by backend type; account for SSL cert presence

Good move switching to CLI flags and using the actual db type from config. Two follow-ups:

Only pass MySQL-specific flags when the backend is MySQL/MariaDB (future-proofs for other backends).
clp_config may include ssl_cert (see clp_py_utils.clp_config.get_clp_connection_params_and_type). Core doesn’t expose an SSL flag today, so at minimum warn when ssl_cert is configured but won’t be propagated.

Apply this diff to make flag construction backend-aware and warn on SSL cert presence:

@@
-    # Configure CLP metadata DB connection params.
-    clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True)
-    # fmt: off
-    extract_cmd = [
-        str(clp_home / "bin" / "clp"),
-        "x", str(archives_dir), str(extraction_dir),
-        "--db-type", clp_db_connection_params["type"],
-        "--db-host", clp_db_connection_params["host"],
-        "--db-port", str(clp_db_connection_params["port"]),
-        "--db-name", clp_db_connection_params["name"],
-        "--db-table-prefix", clp_db_connection_params["table_prefix"],
-    ]
-    # fmt: on
+    # Configure CLP metadata DB connection params.
+    clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True)
+    db_type = clp_db_connection_params["type"].lower()
+    if "ssl_cert" in clp_db_connection_params:
+        logger.warning(
+            "A DB SSL certificate is configured but is not currently propagated to clp via CLI/env."
+            " Ensure core supports SSL via CLI/env or handle it out-of-band."
+        )
+    # fmt: off
+    extract_cmd = [str(clp_home / "bin" / "clp"), "x", str(archives_dir), str(extraction_dir)]
+    if db_type in ("mysql", "mariadb"):
+        extract_cmd += [
+            "--db-type", db_type,
+            "--db-host", clp_db_connection_params["host"],
+            "--db-port", str(clp_db_connection_params["port"]),
+            "--db-name", clp_db_connection_params["name"],
+            "--db-table-prefix", clp_db_connection_params["table_prefix"],
+        ]
+    # fmt: on

If you want us to confirm core flag support and whether any SSL-related flags exist, run:

#!/bin/bash
# Verify core supports these DB flags and lacks SSL flags
rg -n --hidden --no-ignore --no-heading -S -- '
(--db-type|--db-host|--db-port|--db-name|--db-table-prefix|db_type|db_host|db_port|db_name|table_prefix|db-ssl|--db-ssl|ssl_cert|ssl-cert)
' components/core/src || true

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled
Linear integration is disabled

You can enable these settings in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2320d64 and b6bcc76.

📒 Files selected for processing (1)

components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (3 hunks)

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: junhaoliao
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/archive_manager.py:178-178
Timestamp: 2025-08-14T05:06:58.977Z
Learning: In the decompress.py script, the --dataset argument was only added to the json_extraction_parser but not to the ir_extraction_parser, causing AttributeError when handle_extract_stream_cmd tries to access parsed_args.dataset for IR extraction commands.

🧬 Code Graph Analysis (1)

components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (1)

components/clp-py-utils/clp_py_utils/clp_config.py (1)

get_clp_connection_params_and_type (209-230)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: lint-check (ubuntu-24.04)

🔇 Additional comments (1)

components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (1)

4-4: LGTM: import usage justified

Importing os is necessary for constructing the child process environment.

coderabbitai · 2025-08-14T18:15:32Z

+    extract_env = {
+        **os.environ,
+        "CLP_DB_USER": clp_db_connection_params["username"],
+        "CLP_DB_PASS": clp_db_connection_params["password"],
+    }


💡 Verification agent

🧩 Analysis chain

Confirm env var names for credentials are honoured by clp core

You’re passing credentials via CLP_DB_USER/CLP_DB_PASS, which is sensible. Please verify clp core actually reads these names; otherwise, extraction will fail silently with auth errors.

Run this to confirm the env variable names are used anywhere in the repo:

Optional hardening: only set env overrides when values are non-empty.

- extract_env = { - **os.environ, - "CLP_DB_USER": clp_db_connection_params["username"], - "CLP_DB_PASS": clp_db_connection_params["password"], - } + extract_env = {**os.environ} + if clp_db_connection_params.get("username"): + extract_env["CLP_DB_USER"] = clp_db_connection_params["username"] + if clp_db_connection_params.get("password"): + extract_env["CLP_DB_PASS"] = clp_db_connection_params["password"]

🏁 Script executed:

#!/bin/bash # Search for the exact env var names in core and utils rg -n --hidden --no-ignore -S 'CLP_DB_USER|CLP_DB_PASS' || true # Also look for getenv calls that may read them implicitly rg -n --hidden --no-ignore -S 'getenv\$"CLP_DB_(USER|PASS)"\$' components/core/src || true

Length of output: 5718

Confirmed: core honours CLP_DB_USER / CLP_DB_PASS — optional hardening suggested

Confirmed — the repo reads CLP_DB_USER and CLP_DB_PASS (core uses getenv), so the env assignments in decompress.py are correct. I recommend optionally guarding against empty username/password to avoid overriding existing env vars with empty values.

Files to note:

components/core/src/clp/GlobalMetadataDBConfig.cpp — getenv("CLP_DB_USER") (line 122), getenv("CLP_DB_PASS") (line 128); throws if not set (lines 178/182).

components/core/tools/scripts/db/init-db.py — username/password via os.environ["CLP_DB_USER"/"CLP_DB_PASS"] (lines 35–36).

components/clp-package-utils/clp_package_utils/scripts/native/decompress.py — sets CLP_DB_USER/CLP_DB_PASS (lines 247–248).

components/clp-package-utils/clp_package_utils/scripts/start_clp.py — sets env (lines 962–963).

components/job-orchestration/job_orchestration/executor/compress/compression_task.py — sets CLP_DB_USER/CLP_DB_PASS (lines 214–215).

components/webui/server/src/plugins/external/mysql.ts — reads CLP_DB_PASS / CLP_DB_USER (lines 22, 25).

docs/src/user-guide/core-unstructured/clp.md — documents these env vars (lines 131–132).

components/core/tests/test-GlobalMetadataDBConfig.cpp — tests that set/unset these vars.

Suggested (optional) hardening — only set env overrides when values are non-empty:

- extract_env = { - **os.environ, - "CLP_DB_USER": clp_db_connection_params["username"], - "CLP_DB_PASS": clp_db_connection_params["password"], - } + extract_env = {**os.environ} + if clp_db_connection_params.get("username"): + extract_env["CLP_DB_USER"] = clp_db_connection_params["username"] + if clp_db_connection_params.get("password"): + extract_env["CLP_DB_PASS"] = clp_db_connection_params["password"]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

extract_env = {

**os.environ,

"CLP_DB_USER": clp_db_connection_params["username"],

"CLP_DB_PASS": clp_db_connection_params["password"],

}

extract_env = {**os.environ}

if clp_db_connection_params.get("username"):

extract_env["CLP_DB_USER"] = clp_db_connection_params["username"]

if clp_db_connection_params.get("password"):

extract_env["CLP_DB_PASS"] = clp_db_connection_params["password"]

🤖 Prompt for AI Agents

In components/clp-package-utils/clp_package_utils/scripts/native/decompress.py around lines 245 to 249, the code unconditionally writes CLP_DB_USER/CLP_DB_PASS into the environment which can override existing values with empty strings; change it to only set/override os.environ entries when the corresponding clp_db_connection_params["username"] or ["password"] is non-empty (e.g. check for None/empty after stripping) so existing env vars are preserved if the provided values are empty.

junhaoliao · 2025-08-14T19:28:08Z

        return -1

    job_command = parsed_args.command
+    if EXTRACT_IR_CMD == job_command and StorageEngine.CLP != storage_engine:


This is added to prevent fail after job submission

haiqi96 · 2025-08-14T19:28:35Z

    # fmt: on

    if EXTRACT_IR_CMD == job_command:
+        if hasattr(parsed_args, "dataset"):


Do we need to check this?
My understanding is that since we don;t explicilty add dataset when target is IR, the script will return an error if we try to specify --dataset on the cmdline if target is IR
but I could be wrong.

right no need to check this. the attr won't be added for command i anyways

coderabbitai

Actionable comments posted: 1

🔭 Outside diff range comments (1)

components/clp-package-utils/clp_package_utils/scripts/decompress.py (1)
231-235: Drop redundant None-check when appending --dataset

Given you default dataset to CLP_DEFAULT_DATASET_NAME, dataset is never None here. Simplify by appending unconditionally.
-        if dataset is not None:
-            extract_cmd.append("--dataset")
-            extract_cmd.append(dataset)
+        extract_cmd.append("--dataset")
+        extract_cmd.append(dataset)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled
Linear integration is disabled

You can enable these settings in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 156ab4c and 1dc2f2a.

📒 Files selected for processing (1)

components/clp-package-utils/clp_package_utils/scripts/decompress.py (2 hunks)

🧰 Additional context used

🧠 Learnings (4)

📓 Common learnings

Learnt from: junhaoliao
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/archive_manager.py:178-178
Timestamp: 2025-08-14T05:06:58.977Z
Learning: In the decompress.py script, the --dataset argument was only added to the json_extraction_parser but not to the ir_extraction_parser, causing AttributeError when handle_extract_stream_cmd tries to access parsed_args.dataset for IR extraction commands.

📚 Learning: 2025-08-14T05:06:58.977Z

Learnt from: junhaoliao
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/archive_manager.py:178-178
Timestamp: 2025-08-14T05:06:58.977Z
Learning: In the decompress.py script, the --dataset argument was only added to the json_extraction_parser but not to the ir_extraction_parser, causing AttributeError when handle_extract_stream_cmd tries to access parsed_args.dataset for IR extraction commands.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/decompress.py

📚 Learning: 2025-01-16T16:58:43.190Z

Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/decompress.py

📚 Learning: 2024-10-07T21:16:41.660Z

Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/decompress.py

🧬 Code Graph Analysis (1)

components/clp-package-utils/clp_package_utils/scripts/decompress.py (2)

components/clp-py-utils/clp_py_utils/clp_config.py (2)

StorageEngine (79-81)

get_clp_connection_params_and_type (209-230)

components/clp-package-utils/clp_package_utils/general.py (1)

validate_dataset_name (556-582)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: lint-check (macos-15)
GitHub Check: lint-check (ubuntu-24.04)

🔇 Additional comments (1)

components/clp-package-utils/clp_package_utils/scripts/decompress.py (1)

184-186: Good guard: prevent unsupported IR jobs for non-CLP engines

Early-returning here avoids submitting work that would inevitably fail, matching the stated objective to block IR extraction on CLP-S.

coderabbitai · 2025-08-14T19:36:32Z

+        dataset = parsed_args.dataset
+        dataset = CLP_DEFAULT_DATASET_NAME if dataset is None else dataset
+        try:
+            clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True)
+            validate_dataset_name(clp_db_connection_params["table_prefix"], dataset)
+        except Exception as e:
+            logger.error(e)
+            return -1


💡 Verification agent

🧩 Analysis chain

Dataset handling now confined to JSON path — tighten error handling/logging

This resolves the prior AttributeError (aligns with the retrieved learning) by keeping dataset usage solely in the JSON branch. Two small improvements:

Scope the exception to ValueError (what validate_dataset_name raises) and log with context for easier troubleshooting.

Optional: add a quick verification to ensure no remaining IR-path references to parsed_args.dataset.

Apply this diff to improve error handling:

- try: - clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True) - validate_dataset_name(clp_db_connection_params["table_prefix"], dataset) - except Exception as e: - logger.error(e) - return -1 + try: + clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True) + validate_dataset_name(clp_db_connection_params["table_prefix"], dataset) + except ValueError as e: + logger.error(f"Invalid dataset `{dataset}`: {e}") + return -1

Run this script to verify there are no IR-path references to parsed_args.dataset:

🏁 Script executed:

#!/bin/bash # Ensure dataset is only referenced in the JSON branch of decompress.py set -euo pipefail echo "Occurrences of 'parsed_args.dataset' with context:" rg -n -A 3 -B 3 $'parsed_args\\.dataset' components/clp-package-utils/clp_package_utils/scripts/decompress.py echo echo "Check the IR parser does not define --dataset:" rg -n -A 5 -B 5 $'ir_extraction_parser' components/clp-package-utils/clp_package_utils/scripts/decompress.py

Length of output: 1780

Tighten dataset error handling in JSON extraction branch

Verified: rg shows parsed_args.dataset only used in the JSON branch (decompress.py lines ~221–225); ir_extraction_parser does not define --dataset (decompress.py lines ~282–293). Apply the narrower exception handling and improved log message below.

Files to update

components/clp-package-utils/clp_package_utils/scripts/decompress.py — JSON extraction branch (around lines 221–225)

IR parser (around lines 282–293) — no changes required (no --dataset present)

Suggested diff:

- try: - clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True) - validate_dataset_name(clp_db_connection_params["table_prefix"], dataset) - except Exception as e: - logger.error(e) - return -1 + try: + clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True) + validate_dataset_name(clp_db_connection_params["table_prefix"], dataset) + except ValueError as e: + logger.error(f"Invalid dataset `{dataset}`: {e}") + return -1

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

dataset = parsed_args.dataset

dataset = CLP_DEFAULT_DATASET_NAME if dataset is None else dataset

try:

clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True)

validate_dataset_name(clp_db_connection_params["table_prefix"], dataset)

except Exception as e:

logger.error(e)

return -1

dataset = parsed_args.dataset

dataset = CLP_DEFAULT_DATASET_NAME if dataset is None else dataset

try:

clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True)

validate_dataset_name(clp_db_connection_params["table_prefix"], dataset)

except ValueError as e:

logger.error(f"Invalid dataset `{dataset}`: {e}")

return -1

kirkrodrigues

For the PR title, how about:

fix(package): Update `native/decompress.py` to use CLI args and env vars for DB config (fixes #1199); Prevent IR extraction with clp-s storage engine and only use dataset argument for JSON extraction (fixes #1200).

…ars for DB config (fixes y-scope#1199); Prevent IR extraction with clp-s storage engine and only use dataset argument for JSON extraction (fixes y-scope#1200). (y-scope#1202)

fix(package): Update native/decompress.py to use CLI args and env v…

2320d64

…ars for DB config (fixes y-scope#1199); Fallback to `dataset=None` when IR extraction CLI parser is used (fixes y-scope#1200).

junhaoliao requested a review from a team as a code owner August 14, 2025 05:50

junhaoliao requested review from Bill-hbrhbr and kirkrodrigues August 14, 2025 05:50

coderabbitai Bot reviewed Aug 14, 2025

View reviewed changes

junhaoliao commented Aug 14, 2025

View reviewed changes

junhaoliao requested a review from haiqi96 August 14, 2025 15:31

kirkrodrigues reviewed Aug 14, 2025

View reviewed changes

Use clp_db_connection_params for db type - Apply suggestions from cod…

b6bcc76

…e review Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>

coderabbitai Bot reviewed Aug 14, 2025

View reviewed changes

improve dataset handling for different storage engines

156ab4c

junhaoliao requested review from Bill-hbrhbr, haiqi96 and kirkrodrigues and removed request for Bill-hbrhbr and haiqi96 August 14, 2025 19:25

junhaoliao marked this pull request as draft August 14, 2025 19:26

junhaoliao self-assigned this Aug 14, 2025

junhaoliao marked this pull request as ready for review August 14, 2025 19:27

Merge branch 'main' into fix-decompress

6278f29

junhaoliao commented Aug 14, 2025

View reviewed changes

haiqi96 reviewed Aug 14, 2025

View reviewed changes

refactor(clp-package-utils): remove unnecessary dataset selection check

1dc2f2a

junhaoliao requested a review from haiqi96 August 14, 2025 19:31

coderabbitai Bot reviewed Aug 14, 2025

View reviewed changes

kirkrodrigues approved these changes Aug 14, 2025

View reviewed changes

junhaoliao merged commit 135c6bb into y-scope:main Aug 14, 2025
9 checks passed

junhaoliao deleted the fix-decompress branch August 14, 2025 20:21

coderabbitai Bot mentioned this pull request Aug 14, 2025

refactor(package): Standardize how configuration and credentials are passed to containers (resolves #1149). #1152

Merged

3 tasks

-    extract_env = {
-        **os.environ,
-        "CLP_DB_USER": clp_db_connection_params["username"],
-        "CLP_DB_PASS": clp_db_connection_params["password"],
-    }
+    extract_env = {**os.environ}
+    if clp_db_connection_params.get("username"):
+        extract_env["CLP_DB_USER"] = clp_db_connection_params["username"]
+    if clp_db_connection_params.get("password"):
+        extract_env["CLP_DB_PASS"] = clp_db_connection_params["password"]

Conversation

junhaoliao commented Aug 14, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Validation performed

For CLP-Text

For CLP-JSON

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

Bill-hbrhbr commented Aug 14, 2025

Uh oh!

junhaoliao commented Aug 14, 2025

Uh oh!

junhaoliao Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

junhaoliao commented Aug 14, 2025

Uh oh!

kirkrodrigues Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

junhaoliao Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

haiqi96 Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

junhaoliao Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

kirkrodrigues left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

junhaoliao commented Aug 14, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Aug 14, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

haiqi96 Aug 14, 2025 •

edited

Loading