fix(package): Update native/decompress.py to use CLI args and env vars for DB config (fixes #1199); Prevent IR extraction with clp-s storage engine and only use dataset argument for JSON extraction (fixes #1200).#1202
Conversation
…ars for DB config (fixes y-scope#1199); Fallback to `dataset=None` when IR extraction CLI parser is used (fixes y-scope#1200).
WalkthroughEnforces storage-engine compatibility and stricter dataset handling in CLP decompress; IR extraction only allowed for CLP storage engine and forbids dataset for IR. Native decompress now passes DB connection params via CLI flags and supplies DB credentials via the subprocess environment instead of writing a temporary YAML config. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Decompress as decompress.py
participant CLPConfig
User->>Decompress: run decompress (parsed_args, storage_engine)
Decompress->>Decompress: if storage_engine == IR -> require StorageEngine.CLP
alt IR & not CLP
Decompress-->>User: log error + return -1
else IR & CLP
Decompress->>Decompress: ensure no dataset present; else error
Decompress->>CLPConfig: build IR extract command
Decompress->>User: run command / return status
end
alt JSON
Decompress->>Decompress: dataset = parsed_args.dataset or CLP_DEFAULT_DATASET_NAME
Decompress->>CLPConfig: validate dataset matches table_prefix
alt validation fail
Decompress-->>User: log error + return -1
else
Decompress->>Decompress: append dataset to extract command
Decompress->>User: run command / return status
end
end
sequenceDiagram
participant User
participant Native as native/decompress.py
participant CLPTool
User->>Native: run native decompress
Native->>Native: clp_db_connection_params = get_clp_connection_params_and_type(True)
Native->>Native: build CLI args (--db-type/--db-host/--db-port/--db-name/--db-table-prefix)
Native->>Native: extract_env = os.environ + {CLP_DB_USER, CLP_DB_PASS}
Native->>CLPTool: invoke CLP CLI with args and env
CLPTool-->>Native: exit status
Native-->>User: return result
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 1
🔭 Outside diff range comments (2)
components/clp-package-utils/clp_package_utils/scripts/decompress.py (1)
188-200: Fix: Safe dataset access prevents AttributeError on IR parser — LGTM; consider scoping validation to JSON onlyUsing getattr parsed_args, "dataset", None resolves the AttributeError when the IR parser lacks --dataset. To avoid unnecessary DB calls during IR extraction, you can scope dataset defaulting/validation to the JSON command only.
- dataset = getattr(parsed_args, "dataset", None) - if StorageEngine.CLP_S == storage_engine: - dataset = CLP_DEFAULT_DATASET_NAME if dataset is None else dataset - try: - clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True) - validate_dataset_name(clp_db_connection_params["table_prefix"], dataset) - except Exception as e: - logger.error(e) - return -1 - elif dataset is not None: - logger.error(f"Dataset selection is not supported for storage engine: {storage_engine}.") - return -1 + dataset = getattr(parsed_args, "dataset", None) + # Only resolve/validate dataset for JSON extraction to avoid extra DB calls on IR paths + if EXTRACT_JSON_CMD == job_command: + if StorageEngine.CLP_S == storage_engine: + dataset = CLP_DEFAULT_DATASET_NAME if dataset is None else dataset + try: + clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True) + validate_dataset_name(clp_db_connection_params["table_prefix"], dataset) + except Exception as e: + logger.error(e) + return -1 + elif dataset is not None: + logger.error(f"Dataset selection is not supported for storage engine: {storage_engine}.") + return -1components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (1)
265-269: Unify subprocess handling with subprocess.run(check=True) for clearer error pathsUsing subprocess.run with check=True simplifies error handling and aligns with the wrapper script.
- proc = subprocess.Popen(extract_cmd, env=extract_env) - return_code = proc.wait() - if 0 != return_code: - logger.error(f"File extraction failed, return_code={return_code}") - return return_code + try: + subprocess.run(extract_cmd, env=extract_env, check=True) + except subprocess.CalledProcessError as e: + logger.error(f"File extraction failed, return_code={e.returncode}") + return e.returncode
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
components/clp-package-utils/clp_package_utils/scripts/decompress.py(1 hunks)components/clp-package-utils/clp_package_utils/scripts/native/decompress.py(3 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Learnt from: junhaoliao
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/archive_manager.py:178-178
Timestamp: 2025-08-14T05:06:58.977Z
Learning: In the decompress.py script, the --dataset argument was only added to the json_extraction_parser but not to the ir_extraction_parser, causing AttributeError when handle_extract_stream_cmd tries to access parsed_args.dataset for IR extraction commands.
📚 Learning: 2025-08-14T05:06:58.977Z
Learnt from: junhaoliao
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/archive_manager.py:178-178
Timestamp: 2025-08-14T05:06:58.977Z
Learning: In the decompress.py script, the --dataset argument was only added to the json_extraction_parser but not to the ir_extraction_parser, causing AttributeError when handle_extract_stream_cmd tries to access parsed_args.dataset for IR extraction commands.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/decompress.py
📚 Learning: 2025-08-13T14:48:49.020Z
Learnt from: haiqi96
PR: y-scope/clp#1144
File: components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py:106-114
Timestamp: 2025-08-13T14:48:49.020Z
Learning: For the dataset manager scripts in components/clp-package-utils/clp_package_utils/scripts/, the native script (native/dataset_manager.py) is designed to only be called through the wrapper script (dataset_manager.py), so dataset validation is only performed at the wrapper level rather than duplicating it in the native script.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/decompress.py
🧬 Code Graph Analysis (1)
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
get_clp_connection_params_and_type(209-230)
🔇 Additional comments (1)
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (1)
245-250: Confirmed: CLP reads CLP_DB_USER and CLP_DB_PASSRepo search shows these env vars are used across docs, tests and runtime code — no change needed.
- components/core/src/clp/GlobalMetadataDBConfig.cpp — getenv("CLP_DB_USER") / getenv("CLP_DB_PASS") and validation (lines ~122–129, ~176–183)
- components/core/tools/scripts/db/init-db.py — reads os.environ["CLP_DB_USER"] / ["CLP_DB_PASS"] (lines 35–36)
- components/webui/server/src/plugins/external/mysql.ts — uses fastify.config.CLP_DB_USER / CLP_DB_PASS (lines 22–26)
- components/clp-package-utils/clp_package_utils/scripts/start_clp.py — sets CLP_DB_USER / CLP_DB_PASS (lines 962–963)
- components/clp-package-utils/clp_package_utils/scripts/native/decompress.py — sets extract_env with CLP_DB_USER / CLP_DB_PASS (lines 245–248)
- docs/src/user-guide/core-unstructured/clp.md — documents CLP_DB_USER and CLP_DB_PASS (lines 131–132)
- components/core/tests/test-GlobalMetadataDBConfig.cpp — tests set/unset these env vars
Approved — credentials via env are correct and match CLP expectations.
| # Configure CLP metadata DB connection params. | ||
| clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True) | ||
| # fmt: off | ||
| extract_cmd = [ | ||
| str(clp_home / "bin" / "clp"), | ||
| "x", str(archives_dir), str(extraction_dir), | ||
| "--db-config-file", str(db_config_file_path), | ||
| "--db-type", "mysql", | ||
| "--db-host", clp_db_connection_params["host"], | ||
| "--db-port", str(clp_db_connection_params["port"]), | ||
| "--db-name", clp_db_connection_params["name"], | ||
| "--db-table-prefix", clp_db_connection_params["table_prefix"], | ||
| ] |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Switch to CLI DB params: verify flag support and consider passing optional SSL cert
The migration away from --db-config-file looks correct. Two follow-ups:
- Verify the clp binary accepts these flags.
- If SSL is configured, consider passing it via a dedicated flag (e.g., --db-ssl-cert) if supported.
Proposed enhancement (guarded by CLI support):
- "--db-type", "mysql",
+ "--db-type", clp_db_connection_params["type"],
"--db-host", clp_db_connection_params["host"],
"--db-port", str(clp_db_connection_params["port"]),
"--db-name", clp_db_connection_params["name"],
"--db-table-prefix", clp_db_connection_params["table_prefix"],
+ ] + (
+ ["--db-ssl-cert", clp_db_connection_params["ssl_cert"]]
+ if "ssl_cert" in clp_db_connection_params
+ else []
)Run this script to verify flag names in the repo:
🏁 Script executed:
#!/bin/bash
# Verify clp binary (core) recognizes new DB flags
rg -n --ignore-case -- 'db-type|db-host|db-port|db-name|db-table-prefix|db-ssl-cert' -A 3 -B 3Length of output: 266
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "==== Show snippet from decompress.py ===="
sed -n '1,400p' components/clp-package-utils/clp_package_utils/scripts/native/decompress.py || true
echo
echo "==== Search for get_clp_connection_params_and_type ===="
rg -n --hidden --no-ignore 'get_clp_connection_params_and_type' || true
echo
echo "==== Search for ssl-related keys/flags in repo ===="
rg -n --hidden --no-ignore 'ssl_cert|ssl-cert|ssl_cert_path|ssl_cert_file|--db-ssl-cert' || true
echo
echo "==== Search for db-config-file and db- flags in repo ===="
rg -n --hidden --no-ignore '--db-config-file|--db-type|--db-host|--db-port|--db-name|--db-table-prefix|db_type|db_host|db_port|db_name|table_prefix' || true
echo
echo "==== Search for other clp invocations or docs containing db flags ===="
rg -n --hidden --no-ignore '\bclp\b' -S || trueLength of output: 269325
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Show clp CommandLineArguments.cpp (first 400 lines) ==="
sed -n '1,400p' components/core/src/clp/clp/CommandLineArguments.cpp || true
echo
echo "=== Search for DB-related CLI flags in core sources ==="
rg -n --hidden --no-ignore --no-heading --column '(--db-|db-|db_)' components/core/src || true
echo
echo "=== Show CommandLineArguments.hpp ==="
sed -n '1,300p' components/core/src/clp/clp/CommandLineArguments.hpp || true
echo
echo "=== Search for db-ssl or ssl_cert in C++ sources ==="
rg -n --hidden --no-ignore --no-heading --column 'db-ssl|ssl_cert|ssl-cert|ssl' components/core/src || trueLength of output: 70284
Switch to CLI DB params: clp supports --db-*; handle sqlite and SSL
Quick summary:
- I verified the CLP core exposes the CLI flags --db-type, --db-host, --db-port, --db-name and --db-table-prefix (see components/core/src/clp/GlobalMetadataDBConfig.cpp).
- I found no evidence of a --db-ssl-cert CLI flag in the repo (no matches for db-ssl / db-ssl-cert / ssl_cert in core sources). The Python config code (components/clp-py-utils/clp_py_utils/clp_config.py) can return ssl_cert, though.
What to change
- components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (around the snippet you flagged):
- Replace the hard-coded "--db-type", "mysql" with the actual type from clp_db_connection_params["type"].
- Only append the MySQL-style flags (host/port/name/table-prefix) when the DB type is MySQL/MariaDB — do not pass them for SQLite (CLP will use the local metadata DB file).
- If clp_db_connection_params contains an SSL cert, do NOT append a non-existent --db-ssl-cert flag (core doesn't support it). Instead either:
- write a temporary DB config file including the ssl_cert and pass it via --db-config-file (only when SSL is present), or
- update the CLP core to accept a dedicated --db-ssl-cert flag and then pass it from here.
Suggested code change (concise diff):
- # fmt: off
- extract_cmd = [
- str(clp_home / "bin" / "clp"),
- "x", str(archives_dir), str(extraction_dir),
- "--db-type", "mysql",
- "--db-host", clp_db_connection_params["host"],
- "--db-port", str(clp_db_connection_params["port"]),
- "--db-name", clp_db_connection_params["name"],
- "--db-table-prefix", clp_db_connection_params["table_prefix"],
- ]
- # fmt: on
+ # fmt: off
+ extract_cmd = [str(clp_home / "bin" / "clp"), "x", str(archives_dir), str(extraction_dir)]
+ db_type = clp_db_connection_params["type"]
+ if db_type.lower() in ("mysql", "mariadb"):
+ extract_cmd += [
+ "--db-type", db_type,
+ "--db-host", clp_db_connection_params["host"],
+ "--db-port", str(clp_db_connection_params["port"]),
+ "--db-name", clp_db_connection_params["name"],
+ "--db-table-prefix", clp_db_connection_params["table_prefix"],
+ ]
+ # fmt: onIf SSL is configured, a minimal approach without changing core:
- Serialize the DB connection (including ssl_cert) to a temporary config file and pass --db-config-file for this invocation, then delete the temp file after the process finishes.
Files worth reviewing / touching
- components/clp-package-utils/clp_package_utils/scripts/native/decompress.py — update as above (lines ~232–243).
- components/core/src/clp/GlobalMetadataDBConfig.cpp — confirms supported CLI flags and enforces that MySQL-specific options aren’t used with SQLite (consider adding db-ssl support here if desired).
- components/clp-py-utils/clp_py_utils/clp_config.py — contains ssl_cert in returned params (used to decide whether to use the tmp config-file fallback).
Tag:
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Configure CLP metadata DB connection params. | |
| clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True) | |
| # fmt: off | |
| extract_cmd = [ | |
| str(clp_home / "bin" / "clp"), | |
| "x", str(archives_dir), str(extraction_dir), | |
| "--db-config-file", str(db_config_file_path), | |
| "--db-type", "mysql", | |
| "--db-host", clp_db_connection_params["host"], | |
| "--db-port", str(clp_db_connection_params["port"]), | |
| "--db-name", clp_db_connection_params["name"], | |
| "--db-table-prefix", clp_db_connection_params["table_prefix"], | |
| ] | |
| # Configure CLP metadata DB connection params. | |
| clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True) | |
| # fmt: off | |
| extract_cmd = [str(clp_home / "bin" / "clp"), "x", str(archives_dir), str(extraction_dir)] | |
| db_type = clp_db_connection_params["type"] | |
| if db_type.lower() in ("mysql", "mariadb"): | |
| extract_cmd += [ | |
| "--db-type", db_type, | |
| "--db-host", clp_db_connection_params["host"], | |
| "--db-port", str(clp_db_connection_params["port"]), | |
| "--db-name", clp_db_connection_params["name"], | |
| "--db-table-prefix", clp_db_connection_params["table_prefix"], | |
| ] | |
| # fmt: on |
|
Hi Junhao, I think we currently do not support |
Sorry for the confusion. The fix is not to make dataset work with CLP Text. Rather, the ‘parsed_args.dataset’ access yields an exception whenever the command ‘i’ is specified. Maybe I wasn’t using the command in the right way? Still if it throws an exception I think we should either catch-and-handle or avoid |
| return -1 | ||
|
|
||
| dataset = parsed_args.dataset | ||
| dataset = getattr(parsed_args, "dataset", None) |
There was a problem hiding this comment.
Here ‘parsed_args’ may not always have an attribute named « dataset »
|
++ @haiqi96 since the changes are package related |
| return -1 | ||
|
|
||
| dataset = parsed_args.dataset | ||
| dataset = getattr(parsed_args, "dataset", None) |
There was a problem hiding this comment.
Perhaps to improve readability and reduce errors like this in the future, we can move any IR-specific stuff into the IR block on line 220. And similarly for the JSON block.
…e review Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (1)
265-275: Ensure temp file cleanup on errors by using try/finallyIf Popen raises (e.g., missing binary) or the process crashes, files_to_extract_list_path may leak. Use a finally block to guarantee cleanup.
Apply:
- proc = subprocess.Popen(extract_cmd, env=extract_env) - return_code = proc.wait() - if 0 != return_code: - logger.error(f"File extraction failed, return_code={return_code}") - return return_code - - # Remove generated files - if files_to_extract_list_path is not None: - files_to_extract_list_path.unlink() - - return 0 + try: + proc = subprocess.Popen(extract_cmd, env=extract_env) + return_code = proc.wait() + if return_code != 0: + logger.error(f"File extraction failed, return_code={return_code}") + return return_code + return 0 + finally: + # Remove generated files + if files_to_extract_list_path is not None and files_to_extract_list_path.exists(): + files_to_extract_list_path.unlink()
♻️ Duplicate comments (1)
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (1)
232-244: Guard DB flag construction by backend type; account for SSL cert presenceGood move switching to CLI flags and using the actual db type from config. Two follow-ups:
- Only pass MySQL-specific flags when the backend is MySQL/MariaDB (future-proofs for other backends).
- clp_config may include ssl_cert (see clp_py_utils.clp_config.get_clp_connection_params_and_type). Core doesn’t expose an SSL flag today, so at minimum warn when ssl_cert is configured but won’t be propagated.
Apply this diff to make flag construction backend-aware and warn on SSL cert presence:
@@ - # Configure CLP metadata DB connection params. - clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True) - # fmt: off - extract_cmd = [ - str(clp_home / "bin" / "clp"), - "x", str(archives_dir), str(extraction_dir), - "--db-type", clp_db_connection_params["type"], - "--db-host", clp_db_connection_params["host"], - "--db-port", str(clp_db_connection_params["port"]), - "--db-name", clp_db_connection_params["name"], - "--db-table-prefix", clp_db_connection_params["table_prefix"], - ] - # fmt: on + # Configure CLP metadata DB connection params. + clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True) + db_type = clp_db_connection_params["type"].lower() + if "ssl_cert" in clp_db_connection_params: + logger.warning( + "A DB SSL certificate is configured but is not currently propagated to clp via CLI/env." + " Ensure core supports SSL via CLI/env or handle it out-of-band." + ) + # fmt: off + extract_cmd = [str(clp_home / "bin" / "clp"), "x", str(archives_dir), str(extraction_dir)] + if db_type in ("mysql", "mariadb"): + extract_cmd += [ + "--db-type", db_type, + "--db-host", clp_db_connection_params["host"], + "--db-port", str(clp_db_connection_params["port"]), + "--db-name", clp_db_connection_params["name"], + "--db-table-prefix", clp_db_connection_params["table_prefix"], + ] + # fmt: onIf you want us to confirm core flag support and whether any SSL-related flags exist, run:
#!/bin/bash # Verify core supports these DB flags and lacks SSL flags rg -n --hidden --no-ignore --no-heading -S -- ' (--db-type|--db-host|--db-port|--db-name|--db-table-prefix|db_type|db_host|db_port|db_name|table_prefix|db-ssl|--db-ssl|ssl_cert|ssl-cert) ' components/core/src || true
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled
- Linear integration is disabled
You can enable these settings in your CodeRabbit configuration.
📒 Files selected for processing (1)
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py(3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: junhaoliao
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/archive_manager.py:178-178
Timestamp: 2025-08-14T05:06:58.977Z
Learning: In the decompress.py script, the --dataset argument was only added to the json_extraction_parser but not to the ir_extraction_parser, causing AttributeError when handle_extract_stream_cmd tries to access parsed_args.dataset for IR extraction commands.
🧬 Code Graph Analysis (1)
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
get_clp_connection_params_and_type(209-230)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (1)
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (1)
4-4: LGTM: import usage justifiedImporting os is necessary for constructing the child process environment.
| extract_env = { | ||
| **os.environ, | ||
| "CLP_DB_USER": clp_db_connection_params["username"], | ||
| "CLP_DB_PASS": clp_db_connection_params["password"], | ||
| } |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Confirm env var names for credentials are honoured by clp core
You’re passing credentials via CLP_DB_USER/CLP_DB_PASS, which is sensible. Please verify clp core actually reads these names; otherwise, extraction will fail silently with auth errors.
Run this to confirm the env variable names are used anywhere in the repo:
Optional hardening: only set env overrides when values are non-empty.
- extract_env = {
- **os.environ,
- "CLP_DB_USER": clp_db_connection_params["username"],
- "CLP_DB_PASS": clp_db_connection_params["password"],
- }
+ extract_env = {**os.environ}
+ if clp_db_connection_params.get("username"):
+ extract_env["CLP_DB_USER"] = clp_db_connection_params["username"]
+ if clp_db_connection_params.get("password"):
+ extract_env["CLP_DB_PASS"] = clp_db_connection_params["password"]🏁 Script executed:
#!/bin/bash
# Search for the exact env var names in core and utils
rg -n --hidden --no-ignore -S 'CLP_DB_USER|CLP_DB_PASS' || true
# Also look for getenv calls that may read them implicitly
rg -n --hidden --no-ignore -S 'getenv\\("CLP_DB_(USER|PASS)"\\)' components/core/src || trueLength of output: 5718
Confirmed: core honours CLP_DB_USER / CLP_DB_PASS — optional hardening suggested
Confirmed — the repo reads CLP_DB_USER and CLP_DB_PASS (core uses getenv), so the env assignments in decompress.py are correct. I recommend optionally guarding against empty username/password to avoid overriding existing env vars with empty values.
Files to note:
- components/core/src/clp/GlobalMetadataDBConfig.cpp — getenv("CLP_DB_USER") (line 122), getenv("CLP_DB_PASS") (line 128); throws if not set (lines 178/182).
- components/core/tools/scripts/db/init-db.py — username/password via os.environ["CLP_DB_USER"/"CLP_DB_PASS"] (lines 35–36).
- components/clp-package-utils/clp_package_utils/scripts/native/decompress.py — sets CLP_DB_USER/CLP_DB_PASS (lines 247–248).
- components/clp-package-utils/clp_package_utils/scripts/start_clp.py — sets env (lines 962–963).
- components/job-orchestration/job_orchestration/executor/compress/compression_task.py — sets CLP_DB_USER/CLP_DB_PASS (lines 214–215).
- components/webui/server/src/plugins/external/mysql.ts — reads CLP_DB_PASS / CLP_DB_USER (lines 22, 25).
- docs/src/user-guide/core-unstructured/clp.md — documents these env vars (lines 131–132).
- components/core/tests/test-GlobalMetadataDBConfig.cpp — tests that set/unset these vars.
Suggested (optional) hardening — only set env overrides when values are non-empty:
- extract_env = {
- **os.environ,
- "CLP_DB_USER": clp_db_connection_params["username"],
- "CLP_DB_PASS": clp_db_connection_params["password"],
- }
+ extract_env = {**os.environ}
+ if clp_db_connection_params.get("username"):
+ extract_env["CLP_DB_USER"] = clp_db_connection_params["username"]
+ if clp_db_connection_params.get("password"):
+ extract_env["CLP_DB_PASS"] = clp_db_connection_params["password"]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| extract_env = { | |
| **os.environ, | |
| "CLP_DB_USER": clp_db_connection_params["username"], | |
| "CLP_DB_PASS": clp_db_connection_params["password"], | |
| } | |
| extract_env = {**os.environ} | |
| if clp_db_connection_params.get("username"): | |
| extract_env["CLP_DB_USER"] = clp_db_connection_params["username"] | |
| if clp_db_connection_params.get("password"): | |
| extract_env["CLP_DB_PASS"] = clp_db_connection_params["password"] |
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
around lines 245 to 249, the code unconditionally writes CLP_DB_USER/CLP_DB_PASS
into the environment which can override existing values with empty strings;
change it to only set/override os.environ entries when the corresponding
clp_db_connection_params["username"] or ["password"] is non-empty (e.g. check
for None/empty after stripping) so existing env vars are preserved if the
provided values are empty.
| return -1 | ||
|
|
||
| job_command = parsed_args.command | ||
| if EXTRACT_IR_CMD == job_command and StorageEngine.CLP != storage_engine: |
There was a problem hiding this comment.
This is added to prevent fail after job submission
| # fmt: on | ||
|
|
||
| if EXTRACT_IR_CMD == job_command: | ||
| if hasattr(parsed_args, "dataset"): |
There was a problem hiding this comment.
Do we need to check this?
My understanding is that since we don;t explicilty add dataset when target is IR, the script will return an error if we try to specify --dataset on the cmdline if target is IR
but I could be wrong.
There was a problem hiding this comment.
right no need to check this. the attr won't be added for command i anyways
There was a problem hiding this comment.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
components/clp-package-utils/clp_package_utils/scripts/decompress.py (1)
231-235: Drop redundant None-check when appending --datasetGiven you default dataset to CLP_DEFAULT_DATASET_NAME, dataset is never None here. Simplify by appending unconditionally.
- if dataset is not None: - extract_cmd.append("--dataset") - extract_cmd.append(dataset) + extract_cmd.append("--dataset") + extract_cmd.append(dataset)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled
- Linear integration is disabled
You can enable these settings in your CodeRabbit configuration.
📒 Files selected for processing (1)
components/clp-package-utils/clp_package_utils/scripts/decompress.py(2 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: junhaoliao
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/archive_manager.py:178-178
Timestamp: 2025-08-14T05:06:58.977Z
Learning: In the decompress.py script, the --dataset argument was only added to the json_extraction_parser but not to the ir_extraction_parser, causing AttributeError when handle_extract_stream_cmd tries to access parsed_args.dataset for IR extraction commands.
📚 Learning: 2025-08-14T05:06:58.977Z
Learnt from: junhaoliao
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/archive_manager.py:178-178
Timestamp: 2025-08-14T05:06:58.977Z
Learning: In the decompress.py script, the --dataset argument was only added to the json_extraction_parser but not to the ir_extraction_parser, causing AttributeError when handle_extract_stream_cmd tries to access parsed_args.dataset for IR extraction commands.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/decompress.py
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/decompress.py
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/decompress.py
🧬 Code Graph Analysis (1)
components/clp-package-utils/clp_package_utils/scripts/decompress.py (2)
components/clp-py-utils/clp_py_utils/clp_config.py (2)
StorageEngine(79-81)get_clp_connection_params_and_type(209-230)components/clp-package-utils/clp_package_utils/general.py (1)
validate_dataset_name(556-582)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (1)
components/clp-package-utils/clp_package_utils/scripts/decompress.py (1)
184-186: Good guard: prevent unsupported IR jobs for non-CLP enginesEarly-returning here avoids submitting work that would inevitably fail, matching the stated objective to block IR extraction on CLP-S.
| dataset = parsed_args.dataset | ||
| dataset = CLP_DEFAULT_DATASET_NAME if dataset is None else dataset | ||
| try: | ||
| clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True) | ||
| validate_dataset_name(clp_db_connection_params["table_prefix"], dataset) | ||
| except Exception as e: | ||
| logger.error(e) | ||
| return -1 |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Dataset handling now confined to JSON path — tighten error handling/logging
This resolves the prior AttributeError (aligns with the retrieved learning) by keeping dataset usage solely in the JSON branch. Two small improvements:
- Scope the exception to ValueError (what validate_dataset_name raises) and log with context for easier troubleshooting.
- Optional: add a quick verification to ensure no remaining IR-path references to parsed_args.dataset.
Apply this diff to improve error handling:
- try:
- clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True)
- validate_dataset_name(clp_db_connection_params["table_prefix"], dataset)
- except Exception as e:
- logger.error(e)
- return -1
+ try:
+ clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True)
+ validate_dataset_name(clp_db_connection_params["table_prefix"], dataset)
+ except ValueError as e:
+ logger.error(f"Invalid dataset `{dataset}`: {e}")
+ return -1Run this script to verify there are no IR-path references to parsed_args.dataset:
🏁 Script executed:
#!/bin/bash
# Ensure dataset is only referenced in the JSON branch of decompress.py
set -euo pipefail
echo "Occurrences of 'parsed_args.dataset' with context:"
rg -n -A 3 -B 3 $'parsed_args\\.dataset' components/clp-package-utils/clp_package_utils/scripts/decompress.py
echo
echo "Check the IR parser does not define --dataset:"
rg -n -A 5 -B 5 $'ir_extraction_parser' components/clp-package-utils/clp_package_utils/scripts/decompress.pyLength of output: 1780
Tighten dataset error handling in JSON extraction branch
Verified: rg shows parsed_args.dataset only used in the JSON branch (decompress.py lines ~221–225); ir_extraction_parser does not define --dataset (decompress.py lines ~282–293). Apply the narrower exception handling and improved log message below.
Files to update
- components/clp-package-utils/clp_package_utils/scripts/decompress.py — JSON extraction branch (around lines 221–225)
- IR parser (around lines 282–293) — no changes required (no --dataset present)
Suggested diff:
- try:
- clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True)
- validate_dataset_name(clp_db_connection_params["table_prefix"], dataset)
- except Exception as e:
- logger.error(e)
- return -1
+ try:
+ clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True)
+ validate_dataset_name(clp_db_connection_params["table_prefix"], dataset)
+ except ValueError as e:
+ logger.error(f"Invalid dataset `{dataset}`: {e}")
+ return -1📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| dataset = parsed_args.dataset | |
| dataset = CLP_DEFAULT_DATASET_NAME if dataset is None else dataset | |
| try: | |
| clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True) | |
| validate_dataset_name(clp_db_connection_params["table_prefix"], dataset) | |
| except Exception as e: | |
| logger.error(e) | |
| return -1 | |
| dataset = parsed_args.dataset | |
| dataset = CLP_DEFAULT_DATASET_NAME if dataset is None else dataset | |
| try: | |
| clp_db_connection_params = clp_config.database.get_clp_connection_params_and_type(True) | |
| validate_dataset_name(clp_db_connection_params["table_prefix"], dataset) | |
| except ValueError as e: | |
| logger.error(f"Invalid dataset `{dataset}`: {e}") | |
| return -1 |
kirkrodrigues
left a comment
There was a problem hiding this comment.
For the PR title, how about:
fix(package): Update `native/decompress.py` to use CLI args and env vars for DB config (fixes #1199); Prevent IR extraction with clp-s storage engine and only use dataset argument for JSON extraction (fixes #1200).
native/decompress.py to use CLI args and env vars for DB config (fixes #1199); Fallback to dataset=None when IR extraction CLI parser is used (fixes #1200).native/decompress.py to use CLI args and env vars for DB config (fixes #1199); Prevent IR extraction with clp-s storage engine and only use dataset argument for JSON extraction (fixes #1200).
…ars for DB config (fixes y-scope#1199); Prevent IR extraction with clp-s storage engine and only use dataset argument for JSON extraction (fixes y-scope#1200). (y-scope#1202)
Description
This PR addresses two issues:
Fixes
native/decompress.pystill uses deprecated --db-config-file argument instead of command-line database parameters. #1199: Thenative/decompress.pyscript was still using the deprecated--db-config-fileargument instead of command-line database parameters. This caused runtime errors since the core CLP components no longer accept the--db-config-fileargument after the refactoring in feat(core)!: Replace YAML config with CLI args and env vars for metadata DB (resolves #1146). #1148. This PR updatednative/decompress.pyto pass database connection parameters as command-line arguments and environment variables instead of using a temporary YAML config file.Fixes IR extraction parser missing --dataset argument causes AttributeError #1200: The IR extraction parser was missing the
--datasetargument, causing anAttributeErrorwhen trying to accessparsed_args.datasetin thehandle_extract_stream_cmdfunction (i.e.,sbin/decompress.sh i). The fix usesgetattrwith a fallback toNonewhen the dataset attribute is not available.Check if engine type before submitting an IR extraction job. If type is clp-s, error out and do not submit.
Checklist
breaking change.
Validation performed
task packageFor CLP-Text
cd build/clp-package/sbin./compress.sh ~/samples/hive-24hr./decompress.sh x./decompress.sh i --orig-file-id <id> 0./decompress.sh j a8722927-820e-4d04-9b0e-bfb739fb91c9->2025-08-14T15:17:20.637 ERROR [decompress] JSON extraction is not supported for storage engineclp.For CLP-JSON
./compress.sh ~/samples/postgresql/ --timestamp-key=timestamp./decompress.sh x->2025-08-14T15:21:51.715 ERROR [decompress] File extraction is not supported for archive storage typefswith storage engineclp-s../decompress.sh i --orig-file-id <id> 0->2025-08-14T15:24:21.453 ERROR [decompress] IR extraction is not supported for storage engineclp-s../decompress.sh j <id>Summary by CodeRabbit
New Features
Bug Fixes