feat(clp-package)!: Add Spider config interfaces; Integrate Spider compression orchestration into the Docker Compose project.#1606
Conversation
…d to wait for all services become healthy before existing
…consistent type-safe environment variable management; group the env vars; replace `Dict` usages with `dict`.
…rocess.run` and update docstrings accordingly
…thods and functions
…n start/stop scripts
…configuration file path
…S_* and CLP_DEFAULT_STREAM_* -> CLP_DEFAULT_STREAMS_*.
…n validation functions.
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (5)
components/package-template/src/etc/clp-config.template.text.yaml (2)
28-30: ❌ Add contextual comments to the nested database names structure.The nested
namesfield structure is unclear. Users may not understand why the structure changed or what theclpandspiderkeys represent. Consider adding a brief inline comment explaining that these map logical component names to actual database instances.Apply this diff to clarify the mapping:
# port: 3306 -# names: +# # Database names for different components +# names: # clp: "clp-db" # spider: "spider-db"
35-35: ❌ Expand the type field comment to clarify orchestration options.The inline comment is too brief. Users need guidance on when to use each orchestration option. Consider expanding the comment to briefly describe "celery" and "spider" and their intended use cases.
Apply this diff to provide clearer guidance:
# logging_level: "INFO" -# type: "celery" # "celery" or "spider" +# # Orchestration engine: "celery" (default, single/multi-worker) or "spider" (distributed multi-node compression) +# type: "celery"tools/deployment/package/docker-compose-all.yaml (1)
317-349:⚠️ Résoudre les noms d'hôte en double et l'incohérence de spécification d'hôte.Le service
spider-compression-workerutilise le même nom d'hôte"compression_worker"que le servicecompression-worker(ligne 284). De plus, la commande (ligne 343) spécifie"--host", "compression_worker"en dur, ce qui est le nom d'hôte du worker Celery existant, pas du worker Spider.Cela crée deux problèmes :
- Ambiguïté dans les journaux et les métriques : Les deux services partageant le même nom d'hôte rendront la débogage difficile.
- Identité de service incorrecte : Le worker Spider utilisera l'identité du worker Celery auprès du scheduler Spider, ce qui pourrait causer une confusion de routage et des problèmes de suivi des tâches.
Appliquer ce correctif pour donner au worker Spider un nom d'hôte unique :
spider-compression-worker: <<: *service_defaults - hostname: "compression_worker" + hostname: "spider_compression_worker" environment: CLP_CONFIG_PATH: "/etc/clp-config.yaml" CLP_HOME: "/opt/clp" CLP_LOGGING_LEVEL: "${CLP_COMPRESSION_WORKER_LOGGING_LEVEL:-INFO}" CLP_LOGS_DIR: "/var/log/compression_worker" CLP_WORKER_LOG_PATH: "/var/log/compression_worker/worker.log" PYTHONPATH: "/opt/clp/lib/python3/site-packages" SPIDER_LOG_DIR: "/var/log/compression_worker" command: [ "python3", "-u", "/opt/clp/lib/python3/site-packages/clp_py_utils/start-spider-worker.py", - "--host", "compression_worker" + "--host", "spider_compression_worker", "--num-workers", "${CLP_COMPRESSION_WORKER_CONCURRENCY:-1}", "--storage-url", "jdbc:mariadb://database:3306/\ ${SPIDER_DB_NAME:-spider-db}?\ user=${SPIDER_DB_USER:?Please set a value.}\ &password=${SPIDER_DB_PASS:?Please set a value.}", ]components/clp-package-utils/clp_package_utils/controller.py (1)
187-205: Guard Spider DB credential export by orchestration typeLines 201 and 204 unconditionally export
SPIDER_DB_PASS_ENV_VAR_NAMEandSPIDER_DB_USER_ENV_VAR_NAMEby accessingcredentials[ClpDbUserType.SPIDER], whileSPIDER_DB_NAMEat line 194 is only set whencompression_scheduler.type == OrchestrationType.SPIDER.On Celery-only deployments where Spider orchestration is not configured, the
credentials[ClpDbUserType.SPIDER]entry may not exist, which would raise aKeyErrorand breakset_up_env()even though Spider isn't being used.Gate the Spider credential environment variables on the orchestration type, similar to how you gate
SPIDER_DB_NAME:env_vars |= { CLP_DB_PASS_ENV_VAR_NAME: credentials[ClpDbUserType.CLP].password, CLP_DB_ROOT_PASS_ENV_VAR_NAME: credentials[ClpDbUserType.ROOT].password, - SPIDER_DB_PASS_ENV_VAR_NAME: credentials[ClpDbUserType.SPIDER].password, CLP_DB_ROOT_USER_ENV_VAR_NAME: credentials[ClpDbUserType.ROOT].username, CLP_DB_USER_ENV_VAR_NAME: credentials[ClpDbUserType.CLP].username, - SPIDER_DB_USER_ENV_VAR_NAME: credentials[ClpDbUserType.SPIDER].username, } + if self._clp_config.compression_scheduler.type == OrchestrationType.SPIDER: + env_vars[SPIDER_DB_PASS_ENV_VAR_NAME] = credentials[ClpDbUserType.SPIDER].password + env_vars[SPIDER_DB_USER_ENV_VAR_NAME] = credentials[ClpDbUserType.SPIDER].usernameThis keeps Spider credentials required only when Spider orchestration is enabled and avoids breaking Celery-only deployments.
components/clp-py-utils/clp_py_utils/initialize-spider-db.py (1)
268-270: Critical: Parameterize the Spider DB password in CREATE USER statementThe
CREATE USERstatement at line 269 embedsspider_db_passworddirectly into an f-string, which creates two concrete problems:
- Correctness: passwords containing quotes or special characters will break the SQL statement and cause runtime failures
- Security hygiene: this bypasses the project's standard practice of using parameterized SQL with
%splaceholders and tuple arguments to prevent injection (as documented in learnings from prior work on this repository)Database and user names are already validated by
_validate_nameand can remain safely interpolated. The password must be passed as a bound parameter:db_cursor.execute( - f"""CREATE USER IF NOT EXISTS '{spider_db_user}'@'%' IDENTIFIED BY '{spider_db_password}'""" + f"""CREATE USER IF NOT EXISTS '{spider_db_user}'@'%' IDENTIFIED BY %s""", + (spider_db_password,) )This lets the MariaDB driver handle escaping the secret value correctly, consistent with the rest of the codebase. Based on learnings
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (8)
components/clp-package-utils/clp_package_utils/controller.py(15 hunks)components/clp-py-utils/clp_py_utils/clp_config.py(16 hunks)components/clp-py-utils/clp_py_utils/initialize-spider-db.py(1 hunks)components/job-orchestration/job_orchestration/executor/compress/compression_task.py(2 hunks)components/job-orchestration/job_orchestration/executor/start-spider-worker.py(1 hunks)components/package-template/src/etc/clp-config.template.json.yaml(1 hunks)components/package-template/src/etc/clp-config.template.text.yaml(1 hunks)tools/deployment/package/docker-compose-all.yaml(5 hunks)
🧰 Additional context used
🧠 Learnings (32)
📓 Common learnings
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
📚 Learning: 2025-08-08T06:59:42.436Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.
Applied to files:
components/package-template/src/etc/clp-config.template.text.yamlcomponents/package-template/src/etc/clp-config.template.json.yamlcomponents/clp-package-utils/clp_package_utils/controller.pycomponents/clp-py-utils/clp_py_utils/initialize-spider-db.pycomponents/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-09-28T15:00:22.170Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.
Applied to files:
components/package-template/src/etc/clp-config.template.json.yamlcomponents/job-orchestration/job_orchestration/executor/compress/compression_task.pytools/deployment/package/docker-compose-all.yamlcomponents/clp-package-utils/clp_package_utils/controller.py
📚 Learning: 2024-11-18T16:49:20.248Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.
Applied to files:
components/job-orchestration/job_orchestration/executor/start-spider-worker.py
📚 Learning: 2025-07-03T20:10:43.789Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1050
File: components/clp-package-utils/clp_package_utils/scripts/search.py:100-106
Timestamp: 2025-07-03T20:10:43.789Z
Learning: In the current CLP codebase implementation, dataset validation using validate_dataset() is performed within the native scripts (like clp_package_utils/scripts/native/search.py) rather than at the wrapper script level, where the native scripts handle their own parameter validation.
Applied to files:
components/job-orchestration/job_orchestration/executor/start-spider-worker.py
📚 Learning: 2025-08-13T14:48:49.020Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 1144
File: components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py:106-114
Timestamp: 2025-08-13T14:48:49.020Z
Learning: For the dataset manager scripts in components/clp-package-utils/clp_package_utils/scripts/, the native script (native/dataset_manager.py) is designed to only be called through the wrapper script (dataset_manager.py), so dataset validation is only performed at the wrapper level rather than duplicating it in the native script.
Applied to files:
components/job-orchestration/job_orchestration/executor/start-spider-worker.py
📚 Learning: 2025-10-27T07:07:37.901Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1501
File: tools/deployment/presto-clp/scripts/init.py:10-13
Timestamp: 2025-10-27T07:07:37.901Z
Learning: In `tools/deployment/presto-clp/scripts/init.py`, the `DATABASE_COMPONENT_NAME` and `DATABASE_DEFAULT_PORT` constants are intentionally duplicated from `clp_py_utils.clp_config` because `clp_py_utils` is not installed in the Presto init script's runtime environment. The two flows are separate and this duplication is documented. There are plans to merge these flows after a future release.
Applied to files:
components/job-orchestration/job_orchestration/executor/start-spider-worker.pycomponents/clp-package-utils/clp_package_utils/controller.pycomponents/clp-py-utils/clp_py_utils/initialize-spider-db.pycomponents/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-10-17T19:59:25.596Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:315-315
Timestamp: 2025-10-17T19:59:25.596Z
Learning: In components/clp-package-utils/clp_package_utils/controller.py, worker log directories (compression_worker, query_worker, reducer) created via `mkdir()` do not need `_chown_paths_if_root()` calls because directories are created with the same owner as the script caller. This differs from infrastructure service directories (database, queue, Redis, results cache) which do require explicit ownership changes.
Applied to files:
components/job-orchestration/job_orchestration/executor/compress/compression_task.pytools/deployment/package/docker-compose-all.yamlcomponents/clp-package-utils/clp_package_utils/controller.pycomponents/clp-py-utils/clp_py_utils/initialize-spider-db.py
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
Applied to files:
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Applied to files:
components/job-orchestration/job_orchestration/executor/compress/compression_task.pycomponents/clp-py-utils/clp_py_utils/initialize-spider-db.pycomponents/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-09-15T22:20:40.750Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1125
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:267-291
Timestamp: 2025-09-15T22:20:40.750Z
Learning: For CLP compression jobs, the team has decided to fail the entire job immediately upon encountering any invalid input path, rather than continuing to process valid paths. This decision was made during PR #1125 development.
Applied to files:
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
📚 Learning: 2025-09-25T05:13:13.298Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:217-223
Timestamp: 2025-09-25T05:13:13.298Z
Learning: The compression scheduler service in CLP runs with CLP_UID_GID (current user's UID:GID) rather than CLP_SERVICE_CONTAINER_UID_GID (999:999), unlike infrastructure services such as database, queue, redis, and results cache which run with the service container UID:GID.
Applied to files:
components/job-orchestration/job_orchestration/executor/compress/compression_task.pytools/deployment/package/docker-compose-all.yaml
📚 Learning: 2025-12-04T03:31:55.239Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1690
File: tools/deployment/package/docker-compose-all.yaml:424-427
Timestamp: 2025-12-04T03:31:55.239Z
Learning: In tools/deployment/package/docker-compose-all.yaml, the query-worker service writes to /var/data/streams (CLP_STREAM_OUTPUT_DIR_HOST mount), so this directory must remain read-write and should not be mounted with :ro flag.
Applied to files:
tools/deployment/package/docker-compose-all.yaml
📚 Learning: 2025-10-07T07:54:32.427Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-py-utils/clp_py_utils/clp_config.py:47-47
Timestamp: 2025-10-07T07:54:32.427Z
Learning: In components/clp-py-utils/clp_py_utils/clp_config.py, the CONTAINER_AWS_CONFIG_DIRECTORY constant is intentionally set to pathlib.Path("/") / ".aws" (i.e., `/.aws`) rather than a user-specific home directory. This hardcoded path is part of the container orchestration design.
Applied to files:
tools/deployment/package/docker-compose-all.yaml
📚 Learning: 2025-10-20T21:05:30.417Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1413
File: tools/docker-images/clp-package/Dockerfile:22-24
Timestamp: 2025-10-20T21:05:30.417Z
Learning: In the clp repository's Dockerfiles, ENV directives should be consolidated into multi-line ENV statements when possible to reduce image layers. ENV statements should only be split into separate commands when consolidation is not possible due to dependencies (e.g., when later variables must reference earlier ones that need to be set first, or when PATH must be modified sequentially).
Applied to files:
tools/deployment/package/docker-compose-all.yaml
📚 Learning: 2025-10-13T03:32:19.293Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).
Applied to files:
tools/deployment/package/docker-compose-all.yamlcomponents/clp-package-utils/clp_package_utils/controller.pycomponents/clp-py-utils/clp_py_utils/initialize-spider-db.pycomponents/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-04-17T16:55:23.796Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 831
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:0-0
Timestamp: 2025-04-17T16:55:23.796Z
Learning: In the CLP project, SQL queries should use parameterized queries with placeholders (%s) and pass values as a tuple to `db_cursor.execute()` to prevent SQL injection, rather than directly interpolating values into the query string.
Applied to files:
components/clp-py-utils/clp_py_utils/initialize-spider-db.pycomponents/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-01-23T17:08:55.566Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 672
File: components/core/src/clp_s/indexer/MySQLIndexStorage.cpp:30-38
Timestamp: 2025-01-23T17:08:55.566Z
Learning: SQL identifiers (table names, column names) in the codebase are currently from trusted sources and directly interpolated into queries. However, the team prefers to implement sanitization for robustness, even for trusted inputs.
Applied to files:
components/clp-py-utils/clp_py_utils/initialize-spider-db.py
📚 Learning: 2025-11-28T15:12:53.530Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1655
File: components/clp-package-utils/clp_package_utils/controller.py:183-189
Timestamp: 2025-11-28T15:12:53.530Z
Learning: In the y-scope/clp project, root database credentials (root username and password) are required configuration items for all deployments. The `credentials[ClpDbUserType.ROOT]` entry is guaranteed to exist and can be accessed directly without optional handling.
Applied to files:
components/clp-py-utils/clp_py_utils/initialize-spider-db.pycomponents/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-08-25T16:27:50.549Z
Learnt from: davemarco
Repo: y-scope/clp PR: 1198
File: components/webui/server/src/plugins/app/Presto.ts:38-43
Timestamp: 2025-08-25T16:27:50.549Z
Learning: In the CLP webui Presto configuration, host and port are set via package settings (configurable), while user, catalog, and schema are set via environment variables (environment-specific). This mixed approach is intentional - settings are typically set by package and some values don't need to be package-configurable.
Applied to files:
components/clp-py-utils/clp_py_utils/initialize-spider-db.py
📚 Learning: 2024-11-15T16:22:33.635Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:90-96
Timestamp: 2024-11-15T16:22:33.635Z
Learning: In the CLP codebase, `table_prefix` is a trusted configuration value and can be safely embedded into SQL queries.
Applied to files:
components/clp-py-utils/clp_py_utils/initialize-spider-db.py
📚 Learning: 2025-06-24T20:13:46.758Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 939
File: components/package-template/src/etc/clp-config.yml:64-64
Timestamp: 2025-06-24T20:13:46.758Z
Learning: When users ask CodeRabbit to create an issue after providing suggestions, they want a GitHub issue created with the high-level requirements and context, not specific code implementations.
Applied to files:
components/clp-py-utils/clp_py_utils/initialize-spider-db.py
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.
Applied to files:
components/clp-py-utils/clp_py_utils/initialize-spider-db.pycomponents/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-11-17T22:58:50.056Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-11-17T22:58:50.056Z
Learning: In the y-scope/clp repository, when enabling new linting tools (ruff, mypy) on Python components, the team uses an incremental approach: first enable the tools with errors allowed (exit code 0), apply only safe auto-fixable fixes, then address remaining issues in follow-up PRs. During the initial enablement PR, reviews should focus on correctness of auto-fixes rather than flagging new code quality issues.
Applied to files:
components/clp-py-utils/clp_py_utils/initialize-spider-db.pycomponents/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-11-10T05:19:56.600Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1575
File: components/clp-py-utils/clp_py_utils/clp_config.py:602-607
Timestamp: 2025-11-10T05:19:56.600Z
Learning: In the y-scope/clp repository, the `ApiServer` class in `components/clp-py-utils/clp_py_utils/clp_config.py` does not need a `transform_for_container()` method because no other containerized service depends on the API server - it's only accessed from the host, so no docker-network communication is expected.
Applied to files:
components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-10-22T21:02:31.113Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-10-22T21:02:31.113Z
Learning: Repository y-scope/clp: Maintain deterministic CI/builds for Rust; add a check to verify Cargo.lock is in sync with Cargo.toml without updating dependencies (non-mutating verification in clp-rust-checks workflow).
Applied to files:
components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-08-25T06:29:59.610Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1261
File: .github/workflows/clp-core-build.yaml:294-332
Timestamp: 2025-08-25T06:29:59.610Z
Learning: In the CLP project, Bill-hbrhbr prefers a "fail fast" approach for CI workflows - allowing potential command availability issues (like getconf in musllinux) to surface through CI failures rather than preemptively adding fallback logic, as they will fix issues when they occur.
Applied to files:
components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-10-22T21:14:12.225Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1466
File: .github/workflows/clp-rust-checks.yaml:14-15
Timestamp: 2025-10-22T21:14:12.225Z
Learning: Repository y-scope/clp: In GitHub Actions workflows (e.g., .github/workflows/clp-rust-checks.yaml), YAML anchors/aliases are acceptable and preferred to avoid duplication; if actionlint flags an alias node (e.g., on push.paths) as an error, treat it as a tool limitation and do not require inlining unless the team asks to silence the warning.
Applied to files:
components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-07-07T17:41:15.655Z
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.
Applied to files:
components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-07-29T14:04:13.769Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-07-29T14:04:13.769Z
Learning: User haiqi96 requested creating a GitHub issue to document a bug fix from PR #1136, which addressed MySQL compatibility issues with invalid SQL CAST operations in the WebUI component.
Applied to files:
components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-10-13T03:24:35.074Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1405
File: components/clp-package-utils/pyproject.toml:5-15
Timestamp: 2025-10-13T03:24:35.074Z
Learning: In the y-scope/clp repository, the Python 3.9 to 3.10 version requirement change was intentionally deferred to a separate PR (after PR #1405) to reduce review effort, as decided in an offline discussion between junhaoliao and kirkrodrigues.
Applied to files:
components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-08-17T16:10:38.722Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/fixtures/integration_test_logs.py:54-56
Timestamp: 2025-08-17T16:10:38.722Z
Learning: For PR #1100 (feat(integration-tests): Add CLP package integration tests boilerplate), do not raise cache weakness problems related to the pytest cache implementation in the integration test logs fixtures.
Applied to files:
components/clp-py-utils/clp_py_utils/clp_config.py
🧬 Code graph analysis (4)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
ClpDbUserType(193-198)
components/clp-package-utils/clp_package_utils/controller.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (7)
ClpConfig(779-1060)ClpDbNameType(201-205)OrchestrationType(134-136)ClpDbUserType(193-198)BundledService(116-120)DeploymentType(101-105)get_deployment_type(983-990)
components/clp-py-utils/clp_py_utils/initialize-spider-db.py (2)
components/clp-py-utils/clp_py_utils/clp_config.py (2)
ClpDbNameType(201-205)ClpDbUserType(193-198)components/clp-py-utils/clp_py_utils/core.py (1)
read_yaml_config_file(58-64)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
components/clp-py-utils/clp_py_utils/core.py (1)
get_config_value(28-42)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: package-image
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build (macos-15)
- GitHub Check: build (ubuntu-24.04)
- GitHub Check: rust-checks
🔇 Additional comments (11)
components/package-template/src/etc/clp-config.template.text.yaml (1)
37-40: ✅ Spider scheduler configuration is now complete.The
spider_schedulerblock now includeslogging_levelfor consistency with other scheduler configurations (compression_schedulerandquery_scheduler). The structure and fields align well with the rest of the template.components/job-orchestration/job_orchestration/executor/start-spider-worker.py (1)
99-117: LGTM!The
wait_exitfunction correctly handles process termination with timeout and SIGKILL fallback. The return of the first non-zero exit code is a sensible approach.components/package-template/src/etc/clp-config.template.json.yaml (2)
34-36: ✓ Database configuration structure correctly reflects breaking change.The nested
namesmapping withclpandspiderkeys properly implements the documented breaking change fromdatabase.nametodatabase.names.clp. Default values are reasonable ("clp-db"and"spider-db"), and the structure maintains consistency with support for multiple database backends.
44-47: ✓ Spider scheduler configuration is complete and consistent.The
spider_schedulerblock includeshost,port(6000), andlogging_levelfields that are consistent with thequery_schedulerstructure. Port selection is appropriate and distinct from other scheduler services. Previous review feedback about field completeness has been properly addressed.tools/deployment/package/docker-compose-all.yaml (5)
42-45: ✓ Volume definition follows established patterns.The new
clp-tmpvolume anchor is correctly defined with appropriate bind-mount configuration and defaults, consistent with existing volume definitions in the file.
96-97: ✓ Spider database credentials properly exposed to table creator.The environment variables follow the established pattern with required-value syntax (
:?), ensuring deployment fails with a clear error if credentials are not provided.
227-247: ✓ Spider-scheduler service is properly configured with consistent port mapping.The port mapping (target
6000matches the hardcoded--port 6000in the command) and storage URL are correct. The service depends ondb-table-creatorwith the appropriate condition.
249-280: ✓ Graceful shutdown period appropriately extended for compression scheduler.The
stop_grace_periodincrease to 300 seconds allows sufficient time for in-flight compression tasks to complete before forced termination, improving system reliability.
282-315: ✓ Temporary volume correctly added to compression worker.The
volume_clp_tmpmount is a necessary addition for worker processes to store intermediate files, and the binding is appropriate.components/job-orchestration/job_orchestration/executor/compress/compression_task.py (2)
610-624: Good fix: error-path now returns the same serialised shape as the success-pathSwitching the config-load failure case to return
CompressionTaskResult(...).model_dump()aligns the error-path return type with the final return at the end ofcompression_entry_point. This removes an inconsistency where callers previously received a Pydantic model instance only on this early-failure path, and a plain dict otherwise. No further changes needed here.
184-202: Normalise DB config before indexingnamesto avoid enum/string key mismatchesHere you index the raw config as
clp_metadata_db_connection_config["names"][ClpDbUserType.CLP]. If this dict was produced via serialisation (e.g.,Database.model_dump(mode="json")), the keys undernamesare likely plain strings such as"clp", notClpDbUserTypemembers, which would cause aKeyErrorat runtime when building theclp/indexercommands.You are already using
Database.model_validatein_get_db_connection_env_vars_for_clp_cmd; doing the same here would both de-couple this code from the serialisation format and ensure keys are coerced toClpDbUserTypesafely:def _get_db_connection_args_for_clp_cmd( clp_metadata_db_connection_config: dict[str, Any], ) -> list[str]: @@ - :param clp_metadata_db_connection_config: - :return: List of database connection arguments for the clp command. - """ - return [ + :param clp_metadata_db_connection_config: + :return: List of database connection arguments for the clp command. + """ + db_config = Database.model_validate(clp_metadata_db_connection_config) + return [ @@ - "--db-name", - clp_metadata_db_connection_config["names"][ClpDbUserType.CLP], + "--db-name", + db_config.names[ClpDbUserType.CLP],Please also double-check how
clp_metadata_db_connection_configis constructed (Python vs JSON mode) to confirm there is no existing data using string keys that would be incompatible with the current indexing.
Co-authored-by: Junhao Liao <junhao@junhao.ca>
| names: dict[ClpDbNameType, NonEmptyStr] = { | ||
| ClpDbNameType.CLP: "clp-db", |
There was a problem hiding this comment.
note this interface change also needs to be ported to the K8s Helm chart at tools/deployment/package-helm. @coderabbitai submit an issue
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
gh issue create \
--title "Port database configuration interface change to K8s Helm chart" \
--body "## Context
In PR #1606, the database configuration interface was updated with a breaking change:
- **Old:** \`database.name\` (single string value)
- **New:** \`database.names\` (dictionary with \`clp\` and \`spider\` keys)
This change allows separate database names for CLP and Spider components.
## Task
Port this interface change to the K8s Helm chart located at \`tools/deployment/package-helm\`.
## References
- PR: https://github.com/y-scope/clp/pull/1606
- Comment: https://github.com/y-scope/clp/pull/1606#discussion_r2633528642
- Requested by: @junhaoliao" \
--assignee junhaoliao
Length of output: 88
🏁 Script executed:
#!/bin/bash
gh issue create \
--title "Port database configuration interface change to K8s Helm chart" \
--body "## Context
In PR #1606, the database configuration interface was updated with a breaking change:
- **Old:** \`database.name\` (single string value)
- **New:** \`database.names\` (dictionary with \`clp\` and \`spider\` keys)
This change allows separate database names for CLP and Spider components.
## Task
Port this interface change to the K8s Helm chart located at \`tools/deployment/package-helm\`.
## References
- PR: https://github.com/y-scope/clp/pull/1606
- Comment: https://github.com/y-scope/clp/pull/1606#discussion_r2633528642
- Requested by: @junhaoliao" \
--assignee junhaoliao
Length of output: 42
@junhaoliao, I've created issue #1811 to track porting the database configuration interface change to the K8s Helm chart. The issue has been assigned to you and includes references to this PR and comment.
Issue link: #1811
🧬 Code Graph Analysis Results
### File: `components/clp-py-utils/clp_py_utils/clp_config.py`<possibly_relevant_code_snippets>
Possibly relevant snippets from components/clp-py-utils/clp_py_utils/serialization_utils.py
At line range 16-23:
def serialize_path(path: pathlib.Path) -> str:
"""
Serializes a `pathlib.Path` to its string representation.
:param path:
:return: The string representation of the path.
"""
return str(path)
Possibly relevant snippets from components/clp-py-utils/clp_py_utils/serialization_utils.py
At line range 6-13:
def serialize_str_enum(member: StrEnum) -> str:
"""
Serializes a `strenum.StrEnum` member to its underlying value.
:param member:
:return: The underlying string value of the enum member.
"""
return member.value
Possibly relevant snippets from components/clp-py-utils/clp_py_utils/core.py
At line range 28-42:
def get_config_value(config, key):
"""
Gets a value from the given dictionary using a dot-separated configuration
key, where each dot represents a deeper dictionary. NOTE: This method does
not support keys that contain dots since that is indistinguishable from a
deeper dictionary.
:param config:
:param key:
"""
singular_keys = key.split(".")
current_config = config
for current_key in singular_keys:
current_config = current_config[current_key]
return current_config
At line range 45-55:
def make_config_path_absolute(default_root: pathlib.Path, config_path: pathlib.Path):
"""
Turns relative paths into absolute paths by prefixing them with the
default_root
:param default_root:
:param config_path:
"""
if config_path.is_absolute():
return config_path
return default_root / config_path
At line range 58-64:
def read_yaml_config_file(yaml_config_file_path: pathlib.Path):
with open(yaml_config_file_path, "r") as yaml_config_file:
try:
config = yaml.load(yaml_config_file, Loader=yaml.CSafeLoader)
except ParserError as ex:
raise ValueError(f"Unable to parse configuration from {yaml_config_file_path}: {ex}")
return config
At line range 82-103:
def resolve_host_path_in_container(host_path: pathlib.Path) -> pathlib.Path:
"""
Resolves a host path and translates it into its container-mount equivalent absolute path. See
`resolve_host_path` for details about the resolution.
:param host_path: The host path.
:return: The translated path (with /mnt/host prefix).
"""
host_path = host_path.expanduser()
if not host_path.is_absolute():
pwd_host = os.getenv("CLP_PWD_HOST")
if pwd_host is None:
raise ValueError("Relative host path provided but CLP_PWD_HOST is not set.")
host_path = pathlib.Path(pwd_host) / host_path
host_path = host_path.absolute()
resolved_path = _resolve_symlinks_in_path(host_path)
if resolved_path is not None:
return resolved_path
return CONTAINER_DIR_FOR_HOST_ROOT / host_path.relative_to("/")
At line range 106-113:
def validate_path_could_be_dir(path: pathlib.Path):
part = path
while True:
if part.exists():
if not part.is_dir():
raise ValueError(f"{part} is not a directory.")
return
part = part.parent
</possibly_relevant_code_snippets>
🧠 Learnings used
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1501
File: tools/deployment/presto-clp/scripts/init.py:10-13
Timestamp: 2025-10-27T07:07:37.901Z
Learning: In `tools/deployment/presto-clp/scripts/init.py`, the `DATABASE_COMPONENT_NAME` and `DATABASE_DEFAULT_PORT` constants are intentionally duplicated from `clp_py_utils.clp_config` because `clp_py_utils` is not installed in the Presto init script's runtime environment. The two flows are separate and this duplication is documented. There are plans to merge these flows after a future release.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
There was a problem hiding this comment.
a similar issue is found in the MCP server, which is more severe as a bug that prevents the MCP server from starting:
junhao@ASUS-X870E:~/workspace/clp$ docker logs clp-package-de74-mcp-server-1
2025-12-19 05:20:56,882 mcp_server [ERROR] Failed to start MCP server.
Traceback (most recent call last):
File "/opt/clp/lib/python3/site-packages/clp_mcp_server/clp_mcp_server.py", line 84, in main
mcp = create_mcp_server(clp_config)
File "/opt/clp/lib/python3/site-packages/clp_mcp_server/server/server.py", line 33, in create_mcp_server
connector = ClpConnector(clp_config)
File "/opt/clp/lib/python3/site-packages/clp_mcp_server/clp_connector.py", line 35, in __init__
"db": clp_config.database.name,
File "/opt/clp/lib/python3/site-packages/pydantic/main.py", line 1026, in __getattr__
raise AttributeError(f'{type(self).__name__!r} object has no attribute {item!r}')
AttributeError: 'Database' object has no attribute 'name'. Did you mean: 'names'?@coderabbitai submit another issue
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
gh issue create \
--title "MCP server fails to start due to database.name interface change" \
--body "## Problem
The MCP server fails to start with an AttributeError after the database configuration interface change in PR #1606.
## Error
\`\`\`
2025-12-19 05:20:56,882 mcp_server [ERROR] Failed to start MCP server.
Traceback (most recent call last):
File \"/opt/clp/lib/python3/site-packages/clp_mcp_server/clp_mcp_server.py\", line 84, in main
mcp = create_mcp_server(clp_config)
File \"/opt/clp/lib/python3/site-packages/clp_mcp_server/server/server.py\", line 33, in create_mcp_server
connector = ClpConnector(clp_config)
File \"/opt/clp/lib/python3/site-packages/clp_mcp_server/clp_connector.py\", line 35, in __init__
\"db\": clp_config.database.name,
File \"/opt/clp/lib/python3/site-packages/pydantic/main.py\", line 1026, in __getattr__
raise AttributeError(f'{type(self).__name__!r} object has no attribute {item!r}')
AttributeError: 'Database' object has no attribute 'name'. Did you mean: 'names'?
\`\`\`
## Root Cause
In PR #1606, the database configuration was changed from:
- **Old:** \`database.name\` (single string)
- **New:** \`database.names\` (dict with \`ClpDbNameType.CLP\` and \`ClpDbNameType.SPIDER\` keys)
The MCP server code at \`clp_mcp_server/clp_connector.py:35\` still uses the old interface.
## Solution
Update the MCP server to use \`clp_config.database.names[ClpDbNameType.CLP]\` instead of \`clp_config.database.name\`.
## References
- PR: https://github.com/y-scope/clp/pull/1606
- Comment: https://github.com/y-scope/clp/pull/1606#discussion_r2633528642
- Reported by: @junhaoliao" \
--assignee junhaoliao \
--label bug
Length of output: 42
@junhaoliao, I've created issue #1820 to track the MCP server bug. This is indeed a critical issue since it prevents the MCP server from starting entirely due to the database.name → database.names interface change.
The fix will be to update clp_mcp_server/clp_connector.py line 35 to use clp_config.database.names[ClpDbNameType.CLP] instead of clp_config.database.name.
Issue link: #1820
🧠 Learnings used
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1501
File: tools/deployment/presto-clp/scripts/init.py:10-13
Timestamp: 2025-10-27T07:07:37.901Z
Learning: In `tools/deployment/presto-clp/scripts/init.py`, the `DATABASE_COMPONENT_NAME` and `DATABASE_DEFAULT_PORT` constants are intentionally duplicated from `clp_py_utils.clp_config` because `clp_py_utils` is not installed in the Presto init script's runtime environment. The two flows are separate and this duplication is documented. There are plans to merge these flows after a future release.
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1575
File: components/clp-py-utils/clp_py_utils/clp_config.py:602-607
Timestamp: 2025-11-10T05:19:56.600Z
Learning: In the y-scope/clp repository, the `ApiServer` class in `components/clp-py-utils/clp_py_utils/clp_config.py` does not need a `transform_for_container()` method because no other containerized service depends on the API server - it's only accessed from the host, so no docker-network communication is expected.
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.
Learnt from: 20001020ycx
Repo: y-scope/clp PR: 1368
File: components/clp-mcp-server/clp_mcp_server/__init__.py:11-15
Timestamp: 2025-10-02T15:48:58.961Z
Learning: In the clp-mcp-server component (components/clp-mcp-server/clp_mcp_server/__init__.py), the default host binding of 0.0.0.0 is intentional because the server is designed to be deployed in Docker containers where this binding is necessary to accept external connections.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
…mpression orchestration into the Docker Compose project. (y-scope#1606) Co-authored-by: Junhao Liao <junhao@junhao.ca> Co-authored-by: Junhao Liao <junhao.liao@yscope.com> Co-authored-by: Lin Zhihao <59785146+LinZhihao-723@users.noreply.github.com>
Description
This PR integrates spider compression orchestration into docker compose package by
ClpConfig. Minor cleanup for unnecessarydump_to_primitive_type.This PR also:
mariadbpython dependency to>=1.1.14.spider-pyversion to0.2.0.Spiderto latest version for fix in starting python workers.This PR is a breaking change because it updates the dependencies versions and and switches the CLP database name configuration in
clp-config.yamlfromdatabase.nametodatabase.names.clp.The doc change will be put in #1647.
Checklist
breaking change.
Validation performed
Summary by CodeRabbit
New Features
Chores
✏️ Tip: You can customize this high-level summary in your review settings.