Skip to content

fix(package): Update native/decompress.py to use CLI args and env vars for DB config (fixes #1199); Prevent IR extraction with clp-s storage engine and only use dataset argument for JSON extraction (fixes #1200).#1202

Merged
junhaoliao merged 5 commits into
y-scope:mainfrom
junhaoliao:fix-decompress
Aug 14, 2025

Conversation

@junhaoliao

@junhaoliao junhaoliao commented Aug 14, 2025

Copy link
Copy Markdown
Member

Description

This PR addresses two issues:

  1. Fixes native/decompress.py still uses deprecated --db-config-file argument instead of command-line database parameters. #1199: The native/decompress.py script was still using the deprecated --db-config-file argument instead of command-line database parameters. This caused runtime errors since the core CLP components no longer accept the --db-config-file argument after the refactoring in feat(core)!: Replace YAML config with CLI args and env vars for metadata DB (resolves #1146). #1148. This PR updated native/decompress.py to pass database connection parameters as command-line arguments and environment variables instead of using a temporary YAML config file.

  2. Fixes IR extraction parser missing --dataset argument causes AttributeError #1200: The IR extraction parser was missing the --dataset argument, causing an AttributeError when trying to access parsed_args.dataset in the handle_extract_stream_cmd function (i.e., sbin/decompress.sh i). The fix uses getattr with a fallback to None when the dataset attribute is not available.

  3. Check if engine type before submitting an IR extraction job. If type is clp-s, error out and do not submit.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Built the CLP package with task package

For CLP-Text

  • cd build/clp-package/sbin
  • ./compress.sh ~/samples/hive-24hr
  • ./decompress.sh x
  • ./decompress.sh i --orig-file-id <id> 0
  • (-ve): ./decompress.sh j a8722927-820e-4d04-9b0e-bfb739fb91c9 -> 2025-08-14T15:17:20.637 ERROR [decompress] JSON extraction is not supported for storage engine clp.

For CLP-JSON

  • Changed clp-config.yml to start the package as CLP-JSON
    package:
      storage_engine: "clp-s"
      query_engine: "clp-s"
  • ./compress.sh ~/samples/postgresql/ --timestamp-key=timestamp
  • (-ve): ./decompress.sh x -> 2025-08-14T15:21:51.715 ERROR [decompress] File extraction is not supported for archive storage type fswith storage engineclp-s.
  • (-ve): ./decompress.sh i --orig-file-id <id> 0 -> 2025-08-14T15:24:21.453 ERROR [decompress] IR extraction is not supported for storage engine clp-s.
  • ./decompress.sh j <id>

Summary by CodeRabbit

  • New Features

    • Configure DB connection via CLI flags rather than a temporary config file.
    • Pass DB credentials to subprocesses through environment variables (CLP_DB_USER, CLP_DB_PASS).
  • Bug Fixes

    • Enforce storage-engine compatibility: IR extraction limited to the CLP engine with clear errors for invalid combos.
    • JSON extraction now defaults and validates the dataset before running; dataset argument is restricted for IR extraction.

…ars for DB config (fixes y-scope#1199); Fallback to `dataset=None` when IR extraction CLI parser is used (fixes y-scope#1200).
@junhaoliao junhaoliao requested a review from a team as a code owner August 14, 2025 05:50
@coderabbitai

coderabbitai Bot commented Aug 14, 2025

Copy link
Copy Markdown
Contributor

Walkthrough

Enforces storage-engine compatibility and stricter dataset handling in CLP decompress; IR extraction only allowed for CLP storage engine and forbids dataset for IR. Native decompress now passes DB connection params via CLI flags and supplies DB credentials via the subprocess environment instead of writing a temporary YAML config.

Changes

Cohort / File(s) Summary
CLP decompress control flow & dataset handling
components/clp-package-utils/clp_package_utils/scripts/decompress.py
Add guard: IR extraction only supported when storage engine is CLP (otherwise log error and return -1). Remove prior IR-path/CLP_S dataset logic. For JSON extraction, derive dataset from parsed_args (default to CLP_DEFAULT_DATASET_NAME), validate against CLP table_prefix (on failure log error and return -1), and unconditionally append dataset to the command.
Native decompress — DB params via CLI and env
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
Remove creation/cleanup of temporary YAML DB config and YAML import. Use get_clp_connection_params_and_type(True) to collect DB params and pass --db-type, --db-host, --db-port, --db-name, --db-table-prefix to the CLP CLI. Inject CLP_DB_USER/CLP_DB_PASS into subprocess environment (env=...) and pass env to subprocess.Popen; add os import.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant Decompress as decompress.py
  participant CLPConfig
  User->>Decompress: run decompress (parsed_args, storage_engine)
  Decompress->>Decompress: if storage_engine == IR -> require StorageEngine.CLP
  alt IR & not CLP
    Decompress-->>User: log error + return -1
  else IR & CLP
    Decompress->>Decompress: ensure no dataset present; else error
    Decompress->>CLPConfig: build IR extract command
    Decompress->>User: run command / return status
  end
  alt JSON
    Decompress->>Decompress: dataset = parsed_args.dataset or CLP_DEFAULT_DATASET_NAME
    Decompress->>CLPConfig: validate dataset matches table_prefix
    alt validation fail
      Decompress-->>User: log error + return -1
    else
      Decompress->>Decompress: append dataset to extract command
      Decompress->>User: run command / return status
    end
  end
Loading
sequenceDiagram
  participant User
  participant Native as native/decompress.py
  participant CLPTool
  User->>Native: run native decompress
  Native->>Native: clp_db_connection_params = get_clp_connection_params_and_type(True)
  Native->>Native: build CLI args (--db-type/--db-host/--db-port/--db-name/--db-table-prefix)
  Native->>Native: extract_env = os.environ + {CLP_DB_USER, CLP_DB_PASS}
  Native->>CLPTool: invoke CLP CLI with args and env
  CLPTool-->>Native: exit status
  Native-->>User: return result
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Possibly related PRs

Suggested reviewers

  • kirkrodrigues
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (2)
components/clp-package-utils/clp_package_utils/scripts/decompress.py (1)

188-200: Fix: Safe dataset access prevents AttributeError on IR parser — LGTM; consider scoping validation to JSON only

Using getattr parsed_args, "dataset", None resolves the AttributeError when the IR parser lacks --dataset. To avoid unnecessary DB calls during IR extraction, you can scope dataset defaulting/validation to the JSON command only.

-    dataset = getattr(parsed_args, "dataset", None)
-    if StorageEngine.CLP_S == storage_engine:
-        dataset = CLP_DEFAULT_DATASET_NAME if dataset is None else dataset
-        try:
-            clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True)
-            validate_dataset_name(clp_db_connection_params["table_prefix"], dataset)
-        except Exception as e:
-            logger.error(e)
-            return -1
-    elif dataset is not None:
-        logger.error(f"Dataset selection is not supported for storage engine: {storage_engine}.")
-        return -1
+    dataset = getattr(parsed_args, "dataset", None)
+    # Only resolve/validate dataset for JSON extraction to avoid extra DB calls on IR paths
+    if EXTRACT_JSON_CMD == job_command:
+        if StorageEngine.CLP_S == storage_engine:
+            dataset = CLP_DEFAULT_DATASET_NAME if dataset is None else dataset
+            try:
+                clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True)
+                validate_dataset_name(clp_db_connection_params["table_prefix"], dataset)
+            except Exception as e:
+                logger.error(e)
+                return -1
+        elif dataset is not None:
+            logger.error(f"Dataset selection is not supported for storage engine: {storage_engine}.")
+            return -1
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (1)

265-269: Unify subprocess handling with subprocess.run(check=True) for clearer error paths

Using subprocess.run with check=True simplifies error handling and aligns with the wrapper script.

-    proc = subprocess.Popen(extract_cmd, env=extract_env)
-    return_code = proc.wait()
-    if 0 != return_code:
-        logger.error(f"File extraction failed, return_code={return_code}")
-        return return_code
+    try:
+        subprocess.run(extract_cmd, env=extract_env, check=True)
+    except subprocess.CalledProcessError as e:
+        logger.error(f"File extraction failed, return_code={e.returncode}")
+        return e.returncode
📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1fe0224 and 2320d64.

📒 Files selected for processing (2)
  • components/clp-package-utils/clp_package_utils/scripts/decompress.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (3 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Learnt from: junhaoliao
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/archive_manager.py:178-178
Timestamp: 2025-08-14T05:06:58.977Z
Learning: In the decompress.py script, the --dataset argument was only added to the json_extraction_parser but not to the ir_extraction_parser, causing AttributeError when handle_extract_stream_cmd tries to access parsed_args.dataset for IR extraction commands.
📚 Learning: 2025-08-14T05:06:58.977Z
Learnt from: junhaoliao
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/archive_manager.py:178-178
Timestamp: 2025-08-14T05:06:58.977Z
Learning: In the decompress.py script, the --dataset argument was only added to the json_extraction_parser but not to the ir_extraction_parser, causing AttributeError when handle_extract_stream_cmd tries to access parsed_args.dataset for IR extraction commands.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/decompress.py
📚 Learning: 2025-08-13T14:48:49.020Z
Learnt from: haiqi96
PR: y-scope/clp#1144
File: components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py:106-114
Timestamp: 2025-08-13T14:48:49.020Z
Learning: For the dataset manager scripts in components/clp-package-utils/clp_package_utils/scripts/, the native script (native/dataset_manager.py) is designed to only be called through the wrapper script (dataset_manager.py), so dataset validation is only performed at the wrapper level rather than duplicating it in the native script.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/decompress.py
🧬 Code Graph Analysis (1)
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
  • get_clp_connection_params_and_type (209-230)
🔇 Additional comments (1)
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (1)

245-250: Confirmed: CLP reads CLP_DB_USER and CLP_DB_PASS

Repo search shows these env vars are used across docs, tests and runtime code — no change needed.

  • components/core/src/clp/GlobalMetadataDBConfig.cpp — getenv("CLP_DB_USER") / getenv("CLP_DB_PASS") and validation (lines ~122–129, ~176–183)
  • components/core/tools/scripts/db/init-db.py — reads os.environ["CLP_DB_USER"] / ["CLP_DB_PASS"] (lines 35–36)
  • components/webui/server/src/plugins/external/mysql.ts — uses fastify.config.CLP_DB_USER / CLP_DB_PASS (lines 22–26)
  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py — sets CLP_DB_USER / CLP_DB_PASS (lines 962–963)
  • components/clp-package-utils/clp_package_utils/scripts/native/decompress.py — sets extract_env with CLP_DB_USER / CLP_DB_PASS (lines 245–248)
  • docs/src/user-guide/core-unstructured/clp.md — documents CLP_DB_USER and CLP_DB_PASS (lines 131–132)
  • components/core/tests/test-GlobalMetadataDBConfig.cpp — tests set/unset these env vars

Approved — credentials via env are correct and match CLP expectations.

Comment on lines +232 to 243
# Configure CLP metadata DB connection params.
clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True)
# fmt: off
extract_cmd = [
str(clp_home / "bin" / "clp"),
"x", str(archives_dir), str(extraction_dir),
"--db-config-file", str(db_config_file_path),
"--db-type", "mysql",
"--db-host", clp_db_connection_params["host"],
"--db-port", str(clp_db_connection_params["port"]),
"--db-name", clp_db_connection_params["name"],
"--db-table-prefix", clp_db_connection_params["table_prefix"],
]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Switch to CLI DB params: verify flag support and consider passing optional SSL cert

The migration away from --db-config-file looks correct. Two follow-ups:

  • Verify the clp binary accepts these flags.
  • If SSL is configured, consider passing it via a dedicated flag (e.g., --db-ssl-cert) if supported.

Proposed enhancement (guarded by CLI support):

-        "--db-type", "mysql",
+        "--db-type", clp_db_connection_params["type"],
         "--db-host", clp_db_connection_params["host"],
         "--db-port", str(clp_db_connection_params["port"]),
         "--db-name", clp_db_connection_params["name"],
         "--db-table-prefix", clp_db_connection_params["table_prefix"],
+    ] + (
+        ["--db-ssl-cert", clp_db_connection_params["ssl_cert"]]
+        if "ssl_cert" in clp_db_connection_params
+        else []
     )

Run this script to verify flag names in the repo:


🏁 Script executed:

#!/bin/bash
# Verify clp binary (core) recognizes new DB flags
rg -n --ignore-case -- 'db-type|db-host|db-port|db-name|db-table-prefix|db-ssl-cert' -A 3 -B 3

Length of output: 266


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "==== Show snippet from decompress.py ===="
sed -n '1,400p' components/clp-package-utils/clp_package_utils/scripts/native/decompress.py || true

echo
echo "==== Search for get_clp_connection_params_and_type ===="
rg -n --hidden --no-ignore 'get_clp_connection_params_and_type' || true

echo
echo "==== Search for ssl-related keys/flags in repo ===="
rg -n --hidden --no-ignore 'ssl_cert|ssl-cert|ssl_cert_path|ssl_cert_file|--db-ssl-cert' || true

echo
echo "==== Search for db-config-file and db- flags in repo ===="
rg -n --hidden --no-ignore '--db-config-file|--db-type|--db-host|--db-port|--db-name|--db-table-prefix|db_type|db_host|db_port|db_name|table_prefix' || true

echo
echo "==== Search for other clp invocations or docs containing db flags ===="
rg -n --hidden --no-ignore '\bclp\b' -S || true

Length of output: 269325


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Show clp CommandLineArguments.cpp (first 400 lines) ==="
sed -n '1,400p' components/core/src/clp/clp/CommandLineArguments.cpp || true

echo
echo "=== Search for DB-related CLI flags in core sources ==="
rg -n --hidden --no-ignore --no-heading --column '(--db-|db-|db_)' components/core/src || true

echo
echo "=== Show CommandLineArguments.hpp ==="
sed -n '1,300p' components/core/src/clp/clp/CommandLineArguments.hpp || true

echo
echo "=== Search for db-ssl or ssl_cert in C++ sources ==="
rg -n --hidden --no-ignore --no-heading --column 'db-ssl|ssl_cert|ssl-cert|ssl' components/core/src || true

Length of output: 70284


Switch to CLI DB params: clp supports --db-*; handle sqlite and SSL

Quick summary:

  • I verified the CLP core exposes the CLI flags --db-type, --db-host, --db-port, --db-name and --db-table-prefix (see components/core/src/clp/GlobalMetadataDBConfig.cpp).
  • I found no evidence of a --db-ssl-cert CLI flag in the repo (no matches for db-ssl / db-ssl-cert / ssl_cert in core sources). The Python config code (components/clp-py-utils/clp_py_utils/clp_config.py) can return ssl_cert, though.

What to change

  • components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (around the snippet you flagged):
    • Replace the hard-coded "--db-type", "mysql" with the actual type from clp_db_connection_params["type"].
    • Only append the MySQL-style flags (host/port/name/table-prefix) when the DB type is MySQL/MariaDB — do not pass them for SQLite (CLP will use the local metadata DB file).
    • If clp_db_connection_params contains an SSL cert, do NOT append a non-existent --db-ssl-cert flag (core doesn't support it). Instead either:
      • write a temporary DB config file including the ssl_cert and pass it via --db-config-file (only when SSL is present), or
      • update the CLP core to accept a dedicated --db-ssl-cert flag and then pass it from here.

Suggested code change (concise diff):

-    # fmt: off
-    extract_cmd = [
-        str(clp_home / "bin" / "clp"),
-        "x", str(archives_dir), str(extraction_dir),
-        "--db-type", "mysql",
-        "--db-host", clp_db_connection_params["host"],
-        "--db-port", str(clp_db_connection_params["port"]),
-        "--db-name", clp_db_connection_params["name"],
-        "--db-table-prefix", clp_db_connection_params["table_prefix"],
-    ]
-    # fmt: on
+    # fmt: off
+    extract_cmd = [str(clp_home / "bin" / "clp"), "x", str(archives_dir), str(extraction_dir)]
+    db_type = clp_db_connection_params["type"]
+    if db_type.lower() in ("mysql", "mariadb"):
+        extract_cmd += [
+            "--db-type", db_type,
+            "--db-host", clp_db_connection_params["host"],
+            "--db-port", str(clp_db_connection_params["port"]),
+            "--db-name", clp_db_connection_params["name"],
+            "--db-table-prefix", clp_db_connection_params["table_prefix"],
+        ]
+    # fmt: on

If SSL is configured, a minimal approach without changing core:

  • Serialize the DB connection (including ssl_cert) to a temporary config file and pass --db-config-file for this invocation, then delete the temp file after the process finishes.

Files worth reviewing / touching

  • components/clp-package-utils/clp_package_utils/scripts/native/decompress.py — update as above (lines ~232–243).
  • components/core/src/clp/GlobalMetadataDBConfig.cpp — confirms supported CLI flags and enforces that MySQL-specific options aren’t used with SQLite (consider adding db-ssl support here if desired).
  • components/clp-py-utils/clp_py_utils/clp_config.py — contains ssl_cert in returned params (used to decide whether to use the tmp config-file fallback).

Tag:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Configure CLP metadata DB connection params.
clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True)
# fmt: off
extract_cmd = [
str(clp_home / "bin" / "clp"),
"x", str(archives_dir), str(extraction_dir),
"--db-config-file", str(db_config_file_path),
"--db-type", "mysql",
"--db-host", clp_db_connection_params["host"],
"--db-port", str(clp_db_connection_params["port"]),
"--db-name", clp_db_connection_params["name"],
"--db-table-prefix", clp_db_connection_params["table_prefix"],
]
# Configure CLP metadata DB connection params.
clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True)
# fmt: off
extract_cmd = [str(clp_home / "bin" / "clp"), "x", str(archives_dir), str(extraction_dir)]
db_type = clp_db_connection_params["type"]
if db_type.lower() in ("mysql", "mariadb"):
extract_cmd += [
"--db-type", db_type,
"--db-host", clp_db_connection_params["host"],
"--db-port", str(clp_db_connection_params["port"]),
"--db-name", clp_db_connection_params["name"],
"--db-table-prefix", clp_db_connection_params["table_prefix"],
]
# fmt: on

@Bill-hbrhbr

Copy link
Copy Markdown
Contributor

Hi Junhao, I think we currently do not support --dataset with IR extraction. Let me know if I'm wrong. We can confirm with @kirkrodrigues
If we do support --dataset with IR extraction, I'll help review the details of this PR.

@junhaoliao

Copy link
Copy Markdown
Member Author

Hi Junhao, I think we currently do not support --dataset with IR extraction. Let me know if I'm wrong. We can confirm with @kirkrodrigues If we do support --dataset with IR extraction, I'll help review the details of this PR.

Sorry for the confusion. The fix is not to make dataset work with CLP Text. Rather, the ‘parsed_args.dataset’ access yields an exception whenever the command ‘i’ is specified. Maybe I wasn’t using the command in the right way? Still if it throws an exception I think we should either catch-and-handle or avoid

return -1

dataset = parsed_args.dataset
dataset = getattr(parsed_args, "dataset", None)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here ‘parsed_args’ may not always have an attribute named « dataset »

@junhaoliao junhaoliao requested a review from haiqi96 August 14, 2025 15:31
@junhaoliao

Copy link
Copy Markdown
Member Author

++ @haiqi96 since the changes are package related

return -1

dataset = parsed_args.dataset
dataset = getattr(parsed_args, "dataset", None)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps to improve readability and reduce errors like this in the future, we can move any IR-specific stuff into the IR block on line 220. And similarly for the JSON block.

Comment thread components/clp-package-utils/clp_package_utils/scripts/native/decompress.py Outdated
…e review

Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (1)

265-275: Ensure temp file cleanup on errors by using try/finally

If Popen raises (e.g., missing binary) or the process crashes, files_to_extract_list_path may leak. Use a finally block to guarantee cleanup.

Apply:

-    proc = subprocess.Popen(extract_cmd, env=extract_env)
-    return_code = proc.wait()
-    if 0 != return_code:
-        logger.error(f"File extraction failed, return_code={return_code}")
-        return return_code
-
-    # Remove generated files
-    if files_to_extract_list_path is not None:
-        files_to_extract_list_path.unlink()
-
-    return 0
+    try:
+        proc = subprocess.Popen(extract_cmd, env=extract_env)
+        return_code = proc.wait()
+        if return_code != 0:
+            logger.error(f"File extraction failed, return_code={return_code}")
+            return return_code
+        return 0
+    finally:
+        # Remove generated files
+        if files_to_extract_list_path is not None and files_to_extract_list_path.exists():
+            files_to_extract_list_path.unlink()
♻️ Duplicate comments (1)
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (1)

232-244: Guard DB flag construction by backend type; account for SSL cert presence

Good move switching to CLI flags and using the actual db type from config. Two follow-ups:

  • Only pass MySQL-specific flags when the backend is MySQL/MariaDB (future-proofs for other backends).
  • clp_config may include ssl_cert (see clp_py_utils.clp_config.get_clp_connection_params_and_type). Core doesn’t expose an SSL flag today, so at minimum warn when ssl_cert is configured but won’t be propagated.

Apply this diff to make flag construction backend-aware and warn on SSL cert presence:

@@
-    # Configure CLP metadata DB connection params.
-    clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True)
-    # fmt: off
-    extract_cmd = [
-        str(clp_home / "bin" / "clp"),
-        "x", str(archives_dir), str(extraction_dir),
-        "--db-type", clp_db_connection_params["type"],
-        "--db-host", clp_db_connection_params["host"],
-        "--db-port", str(clp_db_connection_params["port"]),
-        "--db-name", clp_db_connection_params["name"],
-        "--db-table-prefix", clp_db_connection_params["table_prefix"],
-    ]
-    # fmt: on
+    # Configure CLP metadata DB connection params.
+    clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True)
+    db_type = clp_db_connection_params["type"].lower()
+    if "ssl_cert" in clp_db_connection_params:
+        logger.warning(
+            "A DB SSL certificate is configured but is not currently propagated to clp via CLI/env."
+            " Ensure core supports SSL via CLI/env or handle it out-of-band."
+        )
+    # fmt: off
+    extract_cmd = [str(clp_home / "bin" / "clp"), "x", str(archives_dir), str(extraction_dir)]
+    if db_type in ("mysql", "mariadb"):
+        extract_cmd += [
+            "--db-type", db_type,
+            "--db-host", clp_db_connection_params["host"],
+            "--db-port", str(clp_db_connection_params["port"]),
+            "--db-name", clp_db_connection_params["name"],
+            "--db-table-prefix", clp_db_connection_params["table_prefix"],
+        ]
+    # fmt: on

If you want us to confirm core flag support and whether any SSL-related flags exist, run:

#!/bin/bash
# Verify core supports these DB flags and lacks SSL flags
rg -n --hidden --no-ignore --no-heading -S -- '
(--db-type|--db-host|--db-port|--db-name|--db-table-prefix|db_type|db_host|db_port|db_name|table_prefix|db-ssl|--db-ssl|ssl_cert|ssl-cert)
' components/core/src || true
📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled
  • Linear integration is disabled

You can enable these settings in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2320d64 and b6bcc76.

📒 Files selected for processing (1)
  • components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: junhaoliao
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/archive_manager.py:178-178
Timestamp: 2025-08-14T05:06:58.977Z
Learning: In the decompress.py script, the --dataset argument was only added to the json_extraction_parser but not to the ir_extraction_parser, causing AttributeError when handle_extract_stream_cmd tries to access parsed_args.dataset for IR extraction commands.
🧬 Code Graph Analysis (1)
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
  • get_clp_connection_params_and_type (209-230)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (1)
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (1)

4-4: LGTM: import usage justified

Importing os is necessary for constructing the child process environment.

Comment on lines +245 to +249
extract_env = {
**os.environ,
"CLP_DB_USER": clp_db_connection_params["username"],
"CLP_DB_PASS": clp_db_connection_params["password"],
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Confirm env var names for credentials are honoured by clp core

You’re passing credentials via CLP_DB_USER/CLP_DB_PASS, which is sensible. Please verify clp core actually reads these names; otherwise, extraction will fail silently with auth errors.

Run this to confirm the env variable names are used anywhere in the repo:

Optional hardening: only set env overrides when values are non-empty.

-    extract_env = {
-        **os.environ,
-        "CLP_DB_USER": clp_db_connection_params["username"],
-        "CLP_DB_PASS": clp_db_connection_params["password"],
-    }
+    extract_env = {**os.environ}
+    if clp_db_connection_params.get("username"):
+        extract_env["CLP_DB_USER"] = clp_db_connection_params["username"]
+    if clp_db_connection_params.get("password"):
+        extract_env["CLP_DB_PASS"] = clp_db_connection_params["password"]

🏁 Script executed:

#!/bin/bash
# Search for the exact env var names in core and utils
rg -n --hidden --no-ignore -S 'CLP_DB_USER|CLP_DB_PASS' || true

# Also look for getenv calls that may read them implicitly
rg -n --hidden --no-ignore -S 'getenv\\("CLP_DB_(USER|PASS)"\\)' components/core/src || true

Length of output: 5718


Confirmed: core honours CLP_DB_USER / CLP_DB_PASS — optional hardening suggested

Confirmed — the repo reads CLP_DB_USER and CLP_DB_PASS (core uses getenv), so the env assignments in decompress.py are correct. I recommend optionally guarding against empty username/password to avoid overriding existing env vars with empty values.

Files to note:

  • components/core/src/clp/GlobalMetadataDBConfig.cpp — getenv("CLP_DB_USER") (line 122), getenv("CLP_DB_PASS") (line 128); throws if not set (lines 178/182).
  • components/core/tools/scripts/db/init-db.py — username/password via os.environ["CLP_DB_USER"/"CLP_DB_PASS"] (lines 35–36).
  • components/clp-package-utils/clp_package_utils/scripts/native/decompress.py — sets CLP_DB_USER/CLP_DB_PASS (lines 247–248).
  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py — sets env (lines 962–963).
  • components/job-orchestration/job_orchestration/executor/compress/compression_task.py — sets CLP_DB_USER/CLP_DB_PASS (lines 214–215).
  • components/webui/server/src/plugins/external/mysql.ts — reads CLP_DB_PASS / CLP_DB_USER (lines 22, 25).
  • docs/src/user-guide/core-unstructured/clp.md — documents these env vars (lines 131–132).
  • components/core/tests/test-GlobalMetadataDBConfig.cpp — tests that set/unset these vars.

Suggested (optional) hardening — only set env overrides when values are non-empty:

-    extract_env = {
-        **os.environ,
-        "CLP_DB_USER": clp_db_connection_params["username"],
-        "CLP_DB_PASS": clp_db_connection_params["password"],
-    }
+    extract_env = {**os.environ}
+    if clp_db_connection_params.get("username"):
+        extract_env["CLP_DB_USER"] = clp_db_connection_params["username"]
+    if clp_db_connection_params.get("password"):
+        extract_env["CLP_DB_PASS"] = clp_db_connection_params["password"]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
extract_env = {
**os.environ,
"CLP_DB_USER": clp_db_connection_params["username"],
"CLP_DB_PASS": clp_db_connection_params["password"],
}
extract_env = {**os.environ}
if clp_db_connection_params.get("username"):
extract_env["CLP_DB_USER"] = clp_db_connection_params["username"]
if clp_db_connection_params.get("password"):
extract_env["CLP_DB_PASS"] = clp_db_connection_params["password"]
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
around lines 245 to 249, the code unconditionally writes CLP_DB_USER/CLP_DB_PASS
into the environment which can override existing values with empty strings;
change it to only set/override os.environ entries when the corresponding
clp_db_connection_params["username"] or ["password"] is non-empty (e.g. check
for None/empty after stripping) so existing env vars are preserved if the
provided values are empty.

@junhaoliao junhaoliao requested review from Bill-hbrhbr, haiqi96 and kirkrodrigues and removed request for Bill-hbrhbr and haiqi96 August 14, 2025 19:25
@junhaoliao junhaoliao marked this pull request as draft August 14, 2025 19:26
@junhaoliao junhaoliao self-assigned this Aug 14, 2025
@junhaoliao junhaoliao marked this pull request as ready for review August 14, 2025 19:27
return -1

job_command = parsed_args.command
if EXTRACT_IR_CMD == job_command and StorageEngine.CLP != storage_engine:

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is added to prevent fail after job submission

# fmt: on

if EXTRACT_IR_CMD == job_command:
if hasattr(parsed_args, "dataset"):

@haiqi96 haiqi96 Aug 14, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check this?
My understanding is that since we don;t explicilty add dataset when target is IR, the script will return an error if we try to specify --dataset on the cmdline if target is IR
but I could be wrong.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right no need to check this. the attr won't be added for command i anyways

@junhaoliao junhaoliao requested a review from haiqi96 August 14, 2025 19:31

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
components/clp-package-utils/clp_package_utils/scripts/decompress.py (1)

231-235: Drop redundant None-check when appending --dataset

Given you default dataset to CLP_DEFAULT_DATASET_NAME, dataset is never None here. Simplify by appending unconditionally.

-        if dataset is not None:
-            extract_cmd.append("--dataset")
-            extract_cmd.append(dataset)
+        extract_cmd.append("--dataset")
+        extract_cmd.append(dataset)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled
  • Linear integration is disabled

You can enable these settings in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 156ab4c and 1dc2f2a.

📒 Files selected for processing (1)
  • components/clp-package-utils/clp_package_utils/scripts/decompress.py (2 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: junhaoliao
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/archive_manager.py:178-178
Timestamp: 2025-08-14T05:06:58.977Z
Learning: In the decompress.py script, the --dataset argument was only added to the json_extraction_parser but not to the ir_extraction_parser, causing AttributeError when handle_extract_stream_cmd tries to access parsed_args.dataset for IR extraction commands.
📚 Learning: 2025-08-14T05:06:58.977Z
Learnt from: junhaoliao
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/archive_manager.py:178-178
Timestamp: 2025-08-14T05:06:58.977Z
Learning: In the decompress.py script, the --dataset argument was only added to the json_extraction_parser but not to the ir_extraction_parser, causing AttributeError when handle_extract_stream_cmd tries to access parsed_args.dataset for IR extraction commands.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/decompress.py
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/decompress.py
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/decompress.py
🧬 Code Graph Analysis (1)
components/clp-package-utils/clp_package_utils/scripts/decompress.py (2)
components/clp-py-utils/clp_py_utils/clp_config.py (2)
  • StorageEngine (79-81)
  • get_clp_connection_params_and_type (209-230)
components/clp-package-utils/clp_package_utils/general.py (1)
  • validate_dataset_name (556-582)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (1)
components/clp-package-utils/clp_package_utils/scripts/decompress.py (1)

184-186: Good guard: prevent unsupported IR jobs for non-CLP engines

Early-returning here avoids submitting work that would inevitably fail, matching the stated objective to block IR extraction on CLP-S.

Comment on lines +222 to +229
dataset = parsed_args.dataset
dataset = CLP_DEFAULT_DATASET_NAME if dataset is None else dataset
try:
clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True)
validate_dataset_name(clp_db_connection_params["table_prefix"], dataset)
except Exception as e:
logger.error(e)
return -1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Dataset handling now confined to JSON path — tighten error handling/logging

This resolves the prior AttributeError (aligns with the retrieved learning) by keeping dataset usage solely in the JSON branch. Two small improvements:

  • Scope the exception to ValueError (what validate_dataset_name raises) and log with context for easier troubleshooting.
  • Optional: add a quick verification to ensure no remaining IR-path references to parsed_args.dataset.

Apply this diff to improve error handling:

-        try:
-            clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True)
-            validate_dataset_name(clp_db_connection_params["table_prefix"], dataset)
-        except Exception as e:
-            logger.error(e)
-            return -1
+        try:
+            clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True)
+            validate_dataset_name(clp_db_connection_params["table_prefix"], dataset)
+        except ValueError as e:
+            logger.error(f"Invalid dataset `{dataset}`: {e}")
+            return -1

Run this script to verify there are no IR-path references to parsed_args.dataset:


🏁 Script executed:

#!/bin/bash
# Ensure dataset is only referenced in the JSON branch of decompress.py
set -euo pipefail

echo "Occurrences of 'parsed_args.dataset' with context:"
rg -n -A 3 -B 3 $'parsed_args\\.dataset' components/clp-package-utils/clp_package_utils/scripts/decompress.py

echo
echo "Check the IR parser does not define --dataset:"
rg -n -A 5 -B 5 $'ir_extraction_parser' components/clp-package-utils/clp_package_utils/scripts/decompress.py

Length of output: 1780


Tighten dataset error handling in JSON extraction branch

Verified: rg shows parsed_args.dataset only used in the JSON branch (decompress.py lines ~221–225); ir_extraction_parser does not define --dataset (decompress.py lines ~282–293). Apply the narrower exception handling and improved log message below.

Files to update

  • components/clp-package-utils/clp_package_utils/scripts/decompress.py — JSON extraction branch (around lines 221–225)
  • IR parser (around lines 282–293) — no changes required (no --dataset present)

Suggested diff:

-        try:
-            clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True)
-            validate_dataset_name(clp_db_connection_params["table_prefix"], dataset)
-        except Exception as e:
-            logger.error(e)
-            return -1
+        try:
+            clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True)
+            validate_dataset_name(clp_db_connection_params["table_prefix"], dataset)
+        except ValueError as e:
+            logger.error(f"Invalid dataset `{dataset}`: {e}")
+            return -1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
dataset = parsed_args.dataset
dataset = CLP_DEFAULT_DATASET_NAME if dataset is None else dataset
try:
clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True)
validate_dataset_name(clp_db_connection_params["table_prefix"], dataset)
except Exception as e:
logger.error(e)
return -1
dataset = parsed_args.dataset
dataset = CLP_DEFAULT_DATASET_NAME if dataset is None else dataset
try:
clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True)
validate_dataset_name(clp_db_connection_params["table_prefix"], dataset)
except ValueError as e:
logger.error(f"Invalid dataset `{dataset}`: {e}")
return -1

@kirkrodrigues kirkrodrigues left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the PR title, how about:

fix(package): Update `native/decompress.py` to use CLI args and env vars for DB config (fixes #1199); Prevent IR extraction with clp-s storage engine and only use dataset argument for JSON extraction (fixes #1200).

@junhaoliao junhaoliao changed the title fix(package): Update native/decompress.py to use CLI args and env vars for DB config (fixes #1199); Fallback to dataset=None when IR extraction CLI parser is used (fixes #1200). fix(package): Update native/decompress.py to use CLI args and env vars for DB config (fixes #1199); Prevent IR extraction with clp-s storage engine and only use dataset argument for JSON extraction (fixes #1200). Aug 14, 2025
@junhaoliao junhaoliao merged commit 135c6bb into y-scope:main Aug 14, 2025
9 checks passed
@junhaoliao junhaoliao deleted the fix-decompress branch August 14, 2025 20:21
junhaoliao added a commit to junhaoliao/clp that referenced this pull request May 17, 2026
…ars for DB config (fixes y-scope#1199); Prevent IR extraction with clp-s storage engine and only use dataset argument for JSON extraction (fixes y-scope#1200). (y-scope#1202)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

4 participants