Skip to content

feat(clp-package): Support selectively enabling 3rd-party services via bundled config (resolves #1623).#1648

Merged
hoophalab merged 12 commits into
y-scope:mainfrom
hoophalab:bundled
Nov 27, 2025
Merged

feat(clp-package): Support selectively enabling 3rd-party services via bundled config (resolves #1623).#1648
hoophalab merged 12 commits into
y-scope:mainfrom
hoophalab:bundled

Conversation

@hoophalab

@hoophalab hoophalab commented Nov 21, 2025

Copy link
Copy Markdown
Contributor

Description

Resolves #1623 using the suggested approach

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  1. enable all bundled services

  2. compress a file

  3. search in webui works.

  4. remove one service

  5. it isn't boot when running start-clp

Summary by CodeRabbit

  • New Features

    • Add configurable "bundled" setting to select which services (database, queue, redis, results_cache) are bundled by default.
    • Docker Compose per-service replica controls via environment flags (0/1) to enable/disable services at deploy time.
    • Environment variables now reflect bundling vs runtime paths and auto-disable services not listed in the bundled set.
  • Bug Fixes

    • Removed duplicate api-server entry in Docker Compose.

✏️ Tip: You can customize this high-level summary in your review settings.

@hoophalab hoophalab requested a review from a team as a code owner November 21, 2025 17:40
@coderabbitai

coderabbitai Bot commented Nov 21, 2025

Copy link
Copy Markdown
Contributor

Walkthrough

Splits per-service environment setup into bundling vs runtime paths, adds a BundledService enum and ClpConfig.bundled, updates controller methods to emit CLP_ENABLED flags for non-bundled services and populate runtime host/port/creds when bundled, updates default clp-config, deduplicates api-server, and drives deploy.replicas by CLP*_ENABLED.

Changes

Cohort / File(s) Summary
Config enum & model
components/clp-py-utils/clp_py_utils/clp_config.py
Added BundledService (lowercase string enum) and BundledServiceStr; extended ClpConfig with bundled: list[BundledServiceStr] defaulting to [database, queue, redis, results_cache].
Controller — bundling vs runtime split
components/clp-package-utils/clp_package_utils/controller.py
Imported BundledService; added per-service bundling methods (_set_up_env_for_database_bundling, _set_up_env_for_queue_bundling, _set_up_env_for_redis_bundling, _set_up_env_for_results_cache_bundling) that set CLP_<SERVICE>_ENABLED = "0" when not bundled and path/host vars for bundling path; retained/adjusted runtime methods (_set_up_env_for_database, _set_up_env_for_queue, _set_up_env_for_redis, _set_up_env_for_results_cache) to populate runtime host/port/creds/paths; set_up_env now aggregates bundling envs first then non-bundling envs.
Config template
components/package-template/src/etc/clp-config.yaml
Added top-level bundled key with ["database","queue","redis","results_cache"].
docker-compose files
tools/deployment/package/docker-compose.yaml
tools/deployment/package/docker-compose-all.yaml
docker-compose.yaml: removed duplicate api-server block (single instance retained). docker-compose-all.yaml: added deploy.replicas entries for database, queue, redis, results-cache driven by CLP_DATABASE_ENABLED, CLP_QUEUE_ENABLED, CLP_REDIS_ENABLED, CLP_RESULTS_CACHE_ENABLED (defaults to 1) so replicas respect enabled flags.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant U as User
    participant C as DockerComposeController
    participant CFG as ClpConfig
    participant ENV as Environment
    participant DC as Docker Compose

    U->>C: set_up_env()
    C->>CFG: read bundled list
    CFG-->>C: bundled list

    par Bundling checks
        C->>C: _set_up_env_for_database_bundling()
        alt database NOT in bundled
            C->>ENV: CLP_DATABASE_ENABLED = "0"
        else
            C->>ENV: leave/enable database
        end
        C->>C: _set_up_env_for_queue_bundling()
        alt queue NOT in bundled
            C->>ENV: CLP_QUEUE_ENABLED = "0"
        end
        C->>C: _set_up_env_for_redis_bundling()
        alt redis NOT in bundled
            C->>ENV: CLP_REDIS_ENABLED = "0"
        end
        C->>C: _set_up_env_for_results_cache_bundling()
        alt results_cache NOT in bundled
            C->>ENV: CLP_RESULTS_CACHE_ENABLED = "0"
        end
    and Runtime env population
        C->>C: _set_up_env_for_database() — set host/port/creds/paths when runtime used
        C->>C: _set_up_env_for_queue() — uses DB config when applicable
        C->>C: _set_up_env_for_redis() — set host/port/creds/paths when runtime used
        C->>C: _set_up_env_for_results_cache() — set host/port/db when runtime used
    end

    C->>DC: run compose (deploy.replicas use CLP_*_ENABLED)
    DC-->>U: services started/stopped per flags
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Review focus:
    • components/clp-package-utils/clp_package_utils/controller.py: correctness of bundling vs runtime branching, env var names/values (especially "0" for disabled), and consistency across added methods.
    • components/clp-py-utils/clp_py_utils/clp_config.py: enum and serializer usage, default list handling.
    • tools/deployment/package/docker-compose-all.yaml: correctness of deploy.replicas expressions and behaviour when CLP_*_ENABLED = "0".
    • tools/deployment/package/docker-compose.yaml: ensure api-server deduplication retained required configuration.

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed The PR implements all core requirements from issue #1623: adds bundled configuration array, creates bundling-specific setup functions for each service, and decouples deployment from environment setup.
Out of Scope Changes check ✅ Passed All changes are directly related to implementing bundled service configuration. Docker Compose modifications to add api-server and enable/disable logic for services are within scope.
Docstring Coverage ✅ Passed Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.
Title check ✅ Passed The title directly and clearly describes the main feature: selective enabling of 3rd-party services via a 'bundled' config option, which matches the comprehensive changes across configuration files and controller logic in the changeset.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@hoophalab hoophalab changed the title feat(clp-package): Allow users to specify which 3rd-party services to… feat(clp-package): Allow users to specify which 3rd-party services to start via bundled configuration (resolves #1623). Nov 21, 2025
@hoophalab

hoophalab commented Nov 21, 2025

Copy link
Copy Markdown
Contributor Author

@junhaoliao It looks like there is no way to specify exactly what to boot in docker compose. Say if we remove results_cache from the bundled list, the webui still depends on results_cache and results_cache still starts. Additionally, since webui needs the results_cache env, so we also cannot remove the env from it.

@hoophalab hoophalab requested a review from junhaoliao November 21, 2025 17:41

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/clp-package-utils/clp_package_utils/controller.py (1)

785-799: Drop or use the local bundled variable in set_up_env

bundled = self._clp_config.bundled is currently unused. If you don’t plan to gate any env setup on bundled, you can remove this line to keep set_up_env leaner; otherwise, consider wiring it into whatever conditional logic you intend.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 55044b5 and 6aefd0f.

📒 Files selected for processing (4)
  • components/clp-package-utils/clp_package_utils/controller.py (5 hunks)
  • components/clp-py-utils/clp_py_utils/clp_config.py (2 hunks)
  • components/package-template/src/etc/clp-config.yaml (1 hunks)
  • tools/deployment/package/docker-compose.yaml (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-11-10T05:19:56.600Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1575
File: components/clp-py-utils/clp_py_utils/clp_config.py:602-607
Timestamp: 2025-11-10T05:19:56.600Z
Learning: In the y-scope/clp repository, the `ApiServer` class in `components/clp-py-utils/clp_py_utils/clp_config.py` does not need a `transform_for_container()` method because no other containerized service depends on the API server - it's only accessed from the host, so no docker-network communication is expected.

Applied to files:

  • components/clp-package-utils/clp_package_utils/controller.py
📚 Learning: 2025-10-27T07:07:37.901Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1501
File: tools/deployment/presto-clp/scripts/init.py:10-13
Timestamp: 2025-10-27T07:07:37.901Z
Learning: In `tools/deployment/presto-clp/scripts/init.py`, the `DATABASE_COMPONENT_NAME` and `DATABASE_DEFAULT_PORT` constants are intentionally duplicated from `clp_py_utils.clp_config` because `clp_py_utils` is not installed in the Presto init script's runtime environment. The two flows are separate and this duplication is documented. There are plans to merge these flows after a future release.

Applied to files:

  • components/clp-package-utils/clp_package_utils/controller.py
🧬 Code graph analysis (1)
components/clp-package-utils/clp_package_utils/controller.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (3)
  • BundledService (105-109)
  • get_deployment_type (827-830)
  • DeploymentType (92-94)
🔇 Additional comments (2)
components/clp-py-utils/clp_py_utils/clp_config.py (1)

105-112: New BundledService enum and serializer look consistent

The enum values and BundledServiceStr alias integrate cleanly with existing pattern (e.g., StorageEngineStr) and match the lower‑case strings used in YAML and the controller. I don’t see behavioural issues here.

tools/deployment/package/docker-compose.yaml (1)

57-60: api-server service wiring matches controller expectations

Adding api-server here, extending from docker-compose-all.yaml, aligns with the new DOCKER_COMPOSE_BASE_SERVICES list in the controller and should avoid duplicate service definitions. No issues from my side.

Comment on lines +68 to +83
DOCKER_COMPOSE_BASE_SERVICES = [
"database",
"db-table-creator",
"queue",
"redis",
"results-cache",
"results-cache-indices-creator",
"compression-scheduler",
"compression-worker",
"webui",
"garbage-collector",
"api-server"
]

DOCKER_COMPOSE_EXTENDED_SERVICES = ["query-scheduler", "query-worker", "reducer", "mcp-server"]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

Fix global list mutation and return type in _get_docker_compose_services

A couple of issues here:

  • services = DOCKER_COMPOSE_BASE_SERVICES followed by services += DOCKER_COMPOSE_EXTENDED_SERVICES and multiple services.remove(...) calls mutates the module‑level lists in place. Repeated calls (or different controller instances) can end up with duplicated extended services or missing base services, and further remove calls may raise ValueError if an element was already removed.
  • _get_docker_compose_services is annotated to return str but actually returns a list[str], while start() relies on list semantics (cmd += ...). This will confuse type checkers and could mask future changes where the function accidentally returns a string.

Suggested adjustments:

# Keep these as true constants:
DOCKER_COMPOSE_BASE_SERVICES = [
    "database",
    "db-table-creator",
    "queue",
    "redis",
    "results-cache",
    "results-cache-indices-creator",
    "compression-scheduler",
    "compression-worker",
    "webui",
    "garbage-collector",
    "api-server",
]

DOCKER_COMPOSE_EXTENDED_SERVICES = [
    "query-scheduler",
    "query-worker",
    "reducer",
    "mcp-server",
]

def _get_docker_compose_services(self) -> list[str]:
    """
    :return: The Docker Compose services to start based on the config.
    """
    deployment_type = self._clp_config.get_deployment_type()

    services: list[str] = list(DOCKER_COMPOSE_BASE_SERVICES)
    if deployment_type == DeploymentType.FULL:
        services += DOCKER_COMPOSE_EXTENDED_SERVICES

    bundled = self._clp_config.bundled
    if BundledService.DATABASE not in bundled and "database" in services:
        services.remove("database")
    if BundledService.QUEUE not in bundled and "queue" in services:
        services.remove("queue")
    if BundledService.REDIS not in bundled and "redis" in services:
        services.remove("redis")
    if BundledService.RESULTS_CACHE not in bundled:
        services = [
            s for s in services if s not in ("results-cache", "results-cache-indices-creator")
        ]

    return services

This keeps the constants immutable, avoids surprising cross‑call side effects, and aligns the return type with how start() uses the result.

Also applies to: 819-825, 882-903


I need to verify the code issues identified. Let me examine the file and the function in question.


🏁 Script executed:

#!/bin/bash
# Examine the controller.py file to verify the issues
cd components/clp-package-utils/clp_package_utils

# First, show lines 60-90 to see the constants and function context
echo "=== Lines around 68-83 (constants and start of _get_docker_compose_services) ==="
sed -n '60,90p' controller.py

echo ""
echo "=== Look for the _get_docker_compose_services function definition ==="
rg -n "_get_docker_compose_services" controller.py -A 30 -B 2

echo ""
echo "=== Check lines 819-825 ==="
sed -n '815,830p' controller.py

echo ""
echo "=== Check lines 882-903 ==="
sed -n '880,905p' controller.py

Length of output: 5114


🏁 Script executed:

#!/bin/bash
# Search for all usages of the function and globals
cd components/clp-package-utils/clp_package_utils

echo "=== All calls to _get_docker_compose_services ==="
rg "_get_docker_compose_services" controller.py -n

echo ""
echo "=== Any direct accesses to DOCKER_COMPOSE_BASE_SERVICES ==="
rg "DOCKER_COMPOSE_BASE_SERVICES" controller.py -n

echo ""
echo "=== Any direct accesses to DOCKER_COMPOSE_EXTENDED_SERVICES ==="
rg "DOCKER_COMPOSE_EXTENDED_SERVICES" controller.py -n

Length of output: 533


Fix return type annotation and global list mutation in _get_docker_compose_services

Verification confirms both issues:

  1. Return type mismatch: Line 882 declares -> str but line 903 returns a list. Line 825 uses the result with list semantics (cmd += ...). Type checkers will flag this.

  2. Global list mutation: Line 889 assigns services = DOCKER_COMPOSE_BASE_SERVICES (reference to global), then lines 891, 895, 897, 899–902 perform in-place mutations (.remove(), +=). This mutates the module-level lists. Repeated calls or multiple controller instances will hit ValueError when attempting to remove already-removed services.

The suggested refactor is correct:

  • Create a local copy: services: list[str] = list(DOCKER_COMPOSE_BASE_SERVICES)
  • Extend conditionally: services += DOCKER_COMPOSE_EXTENDED_SERVICES
  • Filter immutably: use list comprehension for results-cache removal
  • Fix annotation: -> list[str]

Apply changes at lines 882–903 (function definition).

🤖 Prompt for AI Agents
components/clp-package-utils/clp_package_utils/controller.py around lines
882-903, the function currently declares -> str but returns a list and mutates
module-level lists; change the return annotation to -> list[str], create a local
copy services: list[str] = list(DOCKER_COMPOSE_BASE_SERVICES) (do not assign the
global), conditionally extend with DOCKER_COMPOSE_EXTENDED_SERVICES via services
+= DOCKER_COMPOSE_EXTENDED_SERVICES, remove the "results-cache" entry immutably
using a list comprehension (e.g. services = [s for s in services if s !=
"results-cache"]) instead of .remove(), and return the local services list so
globals remain unmodified.

Comment thread components/clp-py-utils/clp_py_utils/clp_config.py
Comment on lines +17 to +18
#bundled: ["database", "queue", "redis", "results_cache"]
#

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider documenting what bundled controls in the template

The example list is correct and matches the enum/controller logic, but it may not be obvious that this controls which 3rd‑party backing services CLP will start vs expect externally. Consider adding a short comment above, e.g., “# Which third‑party services CLP should deploy (omit to use external services instead)”, to reduce user confusion.

🤖 Prompt for AI Agents
In components/package-template/src/etc/clp-config.yaml around lines 17 to 18,
add a brief explanatory comment above the bundled key that states this list
controls which third‑party services CLP will deploy versus expect externally
(e.g., “Which third‑party services CLP should deploy; omit to use external
services instead”), so users understand the flag’s effect and when to remove it;
keep the comment short and placed immediately above the example list.

@junhaoliao

Copy link
Copy Markdown
Member

Say if we remove results_cache from the bundled list, the webui still depends on results_cache

right. a similar relationship is between the database and db-table-creator.
is it possible to (come with another docker-compose-xxx.yaml) and extend the webui service and override the depends_on value?

webui needs the results_cache env
not sure if i get this. mind clarifying?

@hoophalab hoophalab marked this pull request as draft November 21, 2025 17:50
@hoophalab

This comment was marked as resolved.

@hoophalab hoophalab marked this pull request as ready for review November 24, 2025 20:08

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/clp-package-utils/clp_package_utils/controller.py (1)

780-783: Consider conditionally calling setup functions for non-bundled services.

The setup functions for database, queue, redis, and results_cache are called unconditionally (lines 780-783), even when these services are not in the bundled list. These functions create local directories (e.g., data_dir, logs_dir) and perform validation that may not be necessary when using external services (e.g., AWS RDS, external Redis cluster).

While the created directories are currently harmless (unused when replicas=0), this creates unnecessary filesystem artifacts and potentially confusing validation errors for external service configurations.

Consider wrapping these calls in conditional checks:

# Component-specific config
if BundledService.DATABASE in bundled:
    env_vars |= self._set_up_env_for_database()
if BundledService.QUEUE in bundled:
    env_vars |= self._set_up_env_for_queue()
if BundledService.REDIS in bundled:
    env_vars |= self._set_up_env_for_redis()
if BundledService.RESULTS_CACHE in bundled:
    env_vars |= self._set_up_env_for_results_cache()

Note: Connection parameters (host, port, credentials) from the config will still be needed for other services to connect to external instances, so you may need to separate environment setup from directory creation in these functions, or ensure the config still provides connection details for external services.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6aefd0f and 366b027.

📒 Files selected for processing (2)
  • components/clp-package-utils/clp_package_utils/controller.py (2 hunks)
  • tools/deployment/package/docker-compose-all.yaml (4 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-10T05:19:56.600Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1575
File: components/clp-py-utils/clp_py_utils/clp_config.py:602-607
Timestamp: 2025-11-10T05:19:56.600Z
Learning: In the y-scope/clp repository, the `ApiServer` class in `components/clp-py-utils/clp_py_utils/clp_config.py` does not need a `transform_for_container()` method because no other containerized service depends on the API server - it's only accessed from the host, so no docker-network communication is expected.

Applied to files:

  • components/clp-package-utils/clp_package_utils/controller.py
🧬 Code graph analysis (1)
components/clp-package-utils/clp_package_utils/controller.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
  • BundledService (105-109)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: package-image
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)

Comment thread components/clp-package-utils/clp_package_utils/controller.py Outdated
Comment on lines +48 to +50
deploy:
# Value must be either 0 or 1. Set to 0 to disable the database.
replicas: "${CLP_DATABASE_ENABLED:-1}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Critical: depends_on relationships break when services have replicas: 0.

This change introduces a fundamental issue: when a bundled service is disabled (e.g., CLP_DATABASE_ENABLED=0), the service has replicas: 0 and never starts. However, initialization services still declare hard dependencies on these disabled services:

  • Lines 93-94: db-table-creator depends on database with condition: service_healthy
  • Lines 209-210: results-cache-indices-creator depends on results-cache with condition: service_healthy

When replicas: 0, no containers start, so the health check never succeeds. This causes initialization services to wait indefinitely, preventing the entire stack from starting.

Impact: Users cannot actually disable bundled services as intended—disabling database or results-cache will cause db-table-creator or results-cache-indices-creator to hang, which blocks all dependent services (compression-scheduler, webui, garbage-collector, query-scheduler, reducer, mcp-server, api-server).

Run the following script to confirm which services depend on the bundled services:

Potential solutions:

  1. Make depends_on conditional using Docker Compose profiles or separate compose files for bundled vs. external service configurations
  2. Modify initialization services to skip health checks and connect directly to configured endpoints (bundled or external)
  3. Use a startup script that checks service availability rather than relying on Docker Compose depends_on with service_healthy conditions

Also applies to: 83-94, 107-109, 134-136, 173-175, 203-210


🏁 Script executed:

#!/bin/bash
# Find all services with depends_on relationships to bundled services

echo "=== Services depending on 'database' ==="
rg -A 2 'depends_on:' tools/deployment/package/docker-compose-all.yaml | rg -B 1 'database:'

echo ""
echo "=== Services depending on 'queue' ==="
rg -A 2 'depends_on:' tools/deployment/package/docker-compose-all.yaml | rg -B 1 'queue:'

echo ""
echo "=== Services depending on 'redis' ==="
rg -A 2 'depends_on:' tools/deployment/package/docker-compose-all.yaml | rg -B 1 'redis:'

echo ""
echo "=== Services depending on 'results-cache' ==="
rg -A 2 'depends_on:' tools/deployment/package/docker-compose-all.yaml | rg -B 1 'results-cache:'

Length of output: 234


Merci, je vais approfondir l'analyse. Le résultat précédent est trop minimaliste. Laissez-moi extraire la structure complète des dépendances avec les conditions.

Maintenant je dois examiner le fichier docker-compose-all.yaml pour vérifier la structure exacte des dépendances et les conditions.

Now let me extract the exact structure of the docker-compose file to confirm the dependencies:

Based on my verification, I need to examine the actual docker-compose file to determine the exact Docker Compose version being used and the current configuration. Let me extract the complete depends_on structure from the file:



🏁 Script executed:

#!/bin/bash
# Extract docker-compose version and full depends_on structure with conditions

echo "=== Docker Compose version from file ==="
head -1 tools/deployment/package/docker-compose-all.yaml

echo ""
echo "=== Services with depends_on and their conditions ==="
rg -B 1 'depends_on:' tools/deployment/package/docker-compose-all.yaml | head -80

echo ""
echo "=== Database service with replicas ==="
rg -A 5 '^  database:' tools/deployment/package/docker-compose-all.yaml

echo ""
echo "=== db-table-creator service with full config ==="
rg -A 25 '^  db-table-creator:' tools/deployment/package/docker-compose-all.yaml

Length of output: 1702


Je dois vérifier la structure complète des dépendances problématiques, en particulier pour results-cache-indices-creator:



🏁 Script executed:

#!/bin/bash
# Check results-cache and results-cache-indices-creator configuration

echo "=== Results-cache service with replicas ==="
rg -A 8 '^  results-cache:' tools/deployment/package/docker-compose-all.yaml

echo ""
echo "=== Results-cache-indices-creator service ==="
rg -A 20 '^  results-cache-indices-creator:' tools/deployment/package/docker-compose-all.yaml

echo ""
echo "=== Check if any depends_on use 'required: false' ==="
rg 'required:' tools/deployment/package/docker-compose-all.yaml

echo ""
echo "=== Check for healthchecks in database service ==="
rg -A 15 '^  database:' tools/deployment/package/docker-compose-all.yaml | grep -i health

Length of output: 1179


Now let me check which other services depend on the initialization services, to see the full impact chain:



🏁 Script executed:

#!/bin/bash
# Find all services that depend on initialization services

echo "=== Services depending on db-table-creator ==="
rg -B 2 'db-table-creator:' tools/deployment/package/docker-compose-all.yaml | grep -A 2 'depends_on'

echo ""
echo "=== Services depending on results-cache-indices-creator ==="
rg -B 2 'results-cache-indices-creator:' tools/deployment/package/docker-compose-all.yaml | grep -A 2 'depends_on'

echo ""
echo "=== Complete list of all services ==="
rg '^  [a-z-]+:$' tools/deployment/package/docker-compose-all.yaml | sed 's/^.*:  //' | sed 's/:$//'

echo ""
echo "=== Check which services depend on db-table-creator and results-cache-indices-creator ==="
rg 'db-table-creator:' tools/deployment/package/docker-compose-all.yaml
echo "---"
rg 'results-cache-indices-creator:' tools/deployment/package/docker-compose-all.yaml

Length of output: 1139




🏁 Script executed:

#!/bin/bash
# Get specific service names that depend on initialization services

echo "=== Which services depend on db-table-creator ==="
rg -B 5 'db-table-creator:' tools/deployment/package/docker-compose-all.yaml | rg '^  [a-z-]+:$' 

echo ""
echo "=== Which services depend on results-cache-indices-creator ==="
rg -B 5 'results-cache-indices-creator:' tools/deployment/package/docker-compose-all.yaml | rg '^  [a-z-]+:$'

Length of output: 166




🏁 Script executed:

#!/bin/bash
# Extract services with their full depends_on sections for better clarity

echo "=== All services with depends_on sections ==="
rg -A 5 'depends_on:' tools/deployment/package/docker-compose-all.yaml | head -150

Length of output: 1574




🏁 Script executed:

#!/bin/bash
# Get the service names that depend on db-table-creator and results-cache-indices-creator

echo "=== Services and their complete depends_on config ==="
python3 << 'EOF'
import re

with open('tools/deployment/package/docker-compose-all.yaml', 'r') as f:
    content = f.read()

# Find all service blocks
service_pattern = r'^  ([a-z-]+):\s*(?:<<:|image:|hostname:)'
services = re.findall(service_pattern, content, re.MULTILINE)

# For each service, check if it depends on db-table-creator or results-cache-indices-creator
lines = content.split('\n')
current_service = None
in_depends_on = False
depends_info = {}

for i, line in enumerate(lines):
    # Check if this is a service definition
    service_match = re.match(r'^  ([a-z-]+):', line)
    if service_match:
        current_service = service_match.group(1)
        in_depends_on = False
        continue
    
    # Check if this is a depends_on section
    if re.match(r'\s+depends_on:', line):
        in_depends_on = True
        if current_service not in depends_info:
            depends_info[current_service] = []
        continue
    
    # Check if we're in depends_on and found a dependency
    if in_depends_on and current_service:
        dep_match = re.match(r'\s+([a-z-]+):', line)
        if dep_match:
            depends_info[current_service].append(dep_match.group(1))
        elif not re.match(r'\s+condition:', line) and line.strip() and not line.startswith('  '):
            in_depends_on = False

# Print services that depend on db-table-creator or results-cache-indices-creator
print("Services depending on db-table-creator or results-cache-indices-creator:")
for service, deps in sorted(depends_info.items()):
    if 'db-table-creator' in deps or 'results-cache-indices-creator' in deps:
        print(f"  {service}: {', '.join(deps)}")
EOF

Length of output: 855


Critical: depends_on relationships break when services have replicas: 0.

When a bundled service is disabled (e.g., CLP_DATABASE_ENABLED=0), that service is configured with replicas: 0 and no containers are launched. However, initialization services declare hard dependencies with health checks on these disabled services:

  • Lines 93-94: db-table-creator depends on database with condition: service_healthy
  • Lines 209-210: results-cache-indices-creator depends on results-cache with condition: service_healthy

When replicas: 0, no containers start, so the health check never succeeds. This causes initialization services to wait indefinitely for a condition that can never be satisfied.

Impact: Disabling database or results-cache causes db-table-creator or results-cache-indices-creator to hang, which blocks all dependent services from starting:

  • db-table-creator blocks: api-server, compression-scheduler, garbage-collector, mcp-server, query-scheduler, webui
  • results-cache-indices-creator blocks: api-server, garbage-collector, mcp-server, webui, reducer (via query-scheduler)

The entire stack fails to start when any bundled service is disabled, preventing the intended configuration flexibility.

Potential solutions:

  1. Use required: false on initialization service dependencies (Docker Compose v2.20.2+) to allow them to proceed when dependencies are unavailable
  2. Implement conditional depends_on logic using profiles or separate compose files for bundled vs. external configurations
  3. Modify initialization services to validate and connect directly to configured endpoints rather than relying on Docker Compose service_healthy conditions

@hoophalab

Copy link
Copy Markdown
Contributor Author

@junhaoliao ready for another review

@junhaoliao junhaoliao left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the rest lgtm

Comment thread components/clp-package-utils/clp_package_utils/controller.py Outdated
@hoophalab hoophalab requested a review from junhaoliao November 25, 2025 21:07
Comment thread components/clp-package-utils/clp_package_utils/controller.py Outdated
@hoophalab hoophalab requested a review from junhaoliao November 26, 2025 19:42

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
components/clp-package-utils/clp_package_utils/controller.py (2)

238-285: Redis backend database config missing when redis is not bundled.

Lines 271-275 set CLP_REDIS_BACKEND_DB_COMPRESSION and CLP_REDIS_BACKEND_DB_QUERY only when redis is bundled. However, when redis is NOT bundled (line 252 returns early), these env vars are never set. These backend database numbers are required by compression-scheduler, compression-worker, query-scheduler, and query-worker services (referenced in docker-compose-all.yaml lines 231-232, 264-265, 378-379, 417-418) to determine which logical database in Redis to use. Without these vars, services cannot connect to an external redis correctly.

Move the backend DB configuration to the runtime method _set_up_env_for_redis() so it's always set:

     def _set_up_env_for_redis_bundling(self) -> EnvVarsDict:
         """
         Sets up environment variables and directories for bundling the redis component.

         :return: Dictionary of environment variables necessary to bundle the component.
         """
         component_name = REDIS_COMPONENT_NAME

         if BundledService.REDIS not in self._clp_config.bundled:
             logger.info(
                 "%s is not included in the 'bundled' configuration, skipping service bundling...",
                 component_name,
             )
             # Bundling
             return EnvVarsDict({"CLP_REDIS_ENABLED": "0"})

         logger.info("Setting up environment for bundling %s...", component_name)

         conf_file = self._conf_dir / "redis" / "redis.conf"
         data_dir = self._clp_config.data_directory / component_name
         logs_dir = self._clp_config.logs_directory / component_name
         validate_redis_config(self._clp_config, conf_file, data_dir, logs_dir)

         resolved_data_dir = resolve_host_path_in_container(data_dir)
         resolved_logs_dir = resolve_host_path_in_container(logs_dir)

         resolved_data_dir.mkdir(exist_ok=True, parents=True)
         resolved_logs_dir.mkdir(exist_ok=True, parents=True)
         _chown_paths_if_root(resolved_data_dir, resolved_logs_dir)

         env_vars = EnvVarsDict()

-        # Backend databases
-        env_vars |= {
-            "CLP_REDIS_BACKEND_DB_COMPRESSION": str(
-                self._clp_config.redis.compression_backend_database
-            ),
-            "CLP_REDIS_BACKEND_DB_QUERY": str(self._clp_config.redis.query_backend_database),
-        }
-
         # Paths
         env_vars |= {
             "CLP_REDIS_CONF_FILE_HOST": str(conf_file),
             "CLP_REDIS_DATA_DIR_HOST": str(data_dir),
             "CLP_REDIS_LOGS_DIR_HOST": str(logs_dir),
         }

         return env_vars

     def _set_up_env_for_redis(self) -> EnvVarsDict:
         """
         Sets up environment variables for the Redis component.

         :return: Dictionary of environment variables necessary to launch the component.
         """
         component_name = REDIS_COMPONENT_NAME
         logger.info(f"Setting up environment for {component_name}...")

         env_vars = EnvVarsDict()

+        # Backend databases
+        env_vars |= {
+            "CLP_REDIS_BACKEND_DB_COMPRESSION": str(
+                self._clp_config.redis.compression_backend_database
+            ),
+            "CLP_REDIS_BACKEND_DB_QUERY": str(self._clp_config.redis.query_backend_database),
+        }
+
         # Connection config
         env_vars |= {
             "CLP_REDIS_HOST": _get_ip_from_hostname(self._clp_config.redis.host),
             "CLP_REDIS_PORT": str(self._clp_config.redis.port),
         }

         # Credentials
         env_vars |= {
             "CLP_REDIS_PASS": self._clp_config.redis.password,
         }

         return env_vars

311-356: Results cache collection name missing when results_cache is not bundled.

Lines 344-348 set CLP_RESULTS_CACHE_STREAM_COLLECTION_NAME only when results_cache is bundled. When results_cache is NOT bundled (line 325 returns early), this env var is never set. However, this collection name is required by results-cache-indices-creator and webui services (referenced in docker-compose-all.yaml lines 216 and 604) to know which MongoDB collection to use. Without this var, services cannot interact with an external results cache correctly.

Move the collection name configuration to the runtime method _set_up_env_for_results_cache():

     def _set_up_env_for_results_cache_bundling(self) -> EnvVarsDict:
         """
-        Sets up environment variables and directories for bundling the queue component.
+        Sets up environment variables and directories for bundling the results cache component.

         :return: Dictionary of environment variables necessary to bundle the component.
         """
         component_name = RESULTS_CACHE_COMPONENT_NAME

         if BundledService.RESULTS_CACHE not in self._clp_config.bundled:
             logger.info(
                 "%s is not included in the 'bundled' configuration, skipping service bundling...",
                 component_name,
             )
             # Bundling
             return EnvVarsDict({"CLP_RESULTS_CACHE_ENABLED": "0"})

         logger.info("Setting up environment for bundling %s...", component_name)

         conf_file = self._conf_dir / "mongo" / "mongod.conf"
         data_dir = self._clp_config.data_directory / component_name
         logs_dir = self._clp_config.logs_directory / component_name
         validate_results_cache_config(self._clp_config, conf_file, data_dir, logs_dir)

         resolved_data_dir = resolve_host_path_in_container(data_dir)
         resolved_logs_dir = resolve_host_path_in_container(logs_dir)

         resolved_data_dir.mkdir(exist_ok=True, parents=True)
         resolved_logs_dir.mkdir(exist_ok=True, parents=True)
         _chown_paths_if_root(resolved_data_dir, resolved_logs_dir)

         env_vars = EnvVarsDict()

-        # Collections
-        env_vars |= {
-            "CLP_RESULTS_CACHE_STREAM_COLLECTION_NAME": (
-                self._clp_config.results_cache.stream_collection_name
-            ),
-        }
         # Paths
         env_vars |= {
             "CLP_RESULTS_CACHE_CONF_FILE_HOST": str(conf_file),
             "CLP_RESULTS_CACHE_DATA_DIR_HOST": str(data_dir),
             "CLP_RESULTS_CACHE_LOGS_DIR_HOST": str(logs_dir),
         }

         return env_vars

     def _set_up_env_for_results_cache(self) -> EnvVarsDict:
         """
         Sets up environment variables for the results cache (MongoDB) component.

         :return: Dictionary of environment variables necessary to launch the component.
         """
         component_name = RESULTS_CACHE_COMPONENT_NAME
         logger.info(f"Setting up environment for {component_name}...")

         env_vars = EnvVarsDict()

+        # Collections
+        env_vars |= {
+            "CLP_RESULTS_CACHE_STREAM_COLLECTION_NAME": (
+                self._clp_config.results_cache.stream_collection_name
+            ),
+        }
+
         # Connection config
         env_vars |= {
             "CLP_RESULTS_CACHE_DB_NAME": self._clp_config.results_cache.db_name,
             "CLP_RESULTS_CACHE_HOST": _get_ip_from_hostname(self._clp_config.results_cache.host),
             "CLP_RESULTS_CACHE_PORT": str(self._clp_config.results_cache.port),
         }

         return env_vars
♻️ Duplicate comments (2)
tools/deployment/package/docker-compose-all.yaml (1)

48-50: Critical: Hard dependencies with health checks block startup when bundled services are disabled.

The issue identified in previous reviews persists: when a bundled service is set to replicas: 0, initialization services that depend on it with condition: service_healthy will hang indefinitely. Specifically:

  • db-table-creator (lines 93-94) requires database to be healthy
  • results-cache-indices-creator (lines 209-210) requires results-cache to be healthy

When CLP_DATABASE_ENABLED=0 or CLP_RESULTS_CACHE_ENABLED=0, no containers start for these services, the health condition can never be satisfied, and downstream services (api-server, compression-scheduler, garbage-collector, mcp-server, query-scheduler, webui, reducer) remain blocked.

The suggested solutions from the previous review remain valid:

  1. Use required: false on dependencies (Docker Compose v2.20.2+)
  2. Implement conditional depends_on using profiles or separate compose files
  3. Modify initialization services to validate endpoints directly rather than relying on Docker Compose health checks

Also applies to: 83-94, 107-109, 134-136, 173-175, 203-210

components/package-template/src/etc/clp-config.yaml (1)

18-19: Consider adding an explanatory comment for the bundled configuration.

Users may not immediately understand what the bundled list controls. Adding a brief comment would clarify that this determines which third-party services CLP will deploy internally versus expect externally.

Add an explanatory comment:

 #credentials_file_path: "etc/credentials.yaml"
 #
+## Which third-party services CLP should deploy. Omit services to use external instances instead.
 #bundled: ["database", "queue", "redis", "results_cache"]
 #
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 41a4cb6 and 1325483.

📒 Files selected for processing (3)
  • components/clp-package-utils/clp_package_utils/controller.py (9 hunks)
  • components/package-template/src/etc/clp-config.yaml (1 hunks)
  • tools/deployment/package/docker-compose-all.yaml (4 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1413
File: tools/docker-images/clp-package/Dockerfile:22-24
Timestamp: 2025-10-20T21:05:30.417Z
Learning: In the clp repository's Dockerfiles, ENV directives should be consolidated into multi-line ENV statements when possible to reduce image layers. ENV statements should only be split into separate commands when consolidation is not possible due to dependencies (e.g., when later variables must reference earlier ones that need to be set first, or when PATH must be modified sequentially).
📚 Learning: 2025-10-17T19:59:25.596Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:315-315
Timestamp: 2025-10-17T19:59:25.596Z
Learning: In components/clp-package-utils/clp_package_utils/controller.py, worker log directories (compression_worker, query_worker, reducer) created via `mkdir()` do not need `_chown_paths_if_root()` calls because directories are created with the same owner as the script caller. This differs from infrastructure service directories (database, queue, Redis, results cache) which do require explicit ownership changes.

Applied to files:

  • components/clp-package-utils/clp_package_utils/controller.py
📚 Learning: 2025-10-13T03:32:19.293Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).

Applied to files:

  • components/clp-package-utils/clp_package_utils/controller.py
📚 Learning: 2025-11-10T05:19:56.600Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1575
File: components/clp-py-utils/clp_py_utils/clp_config.py:602-607
Timestamp: 2025-11-10T05:19:56.600Z
Learning: In the y-scope/clp repository, the `ApiServer` class in `components/clp-py-utils/clp_py_utils/clp_config.py` does not need a `transform_for_container()` method because no other containerized service depends on the API server - it's only accessed from the host, so no docker-network communication is expected.

Applied to files:

  • components/clp-package-utils/clp_package_utils/controller.py
🧬 Code graph analysis (1)
components/clp-package-utils/clp_package_utils/controller.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
  • BundledService (105-109)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: package-image
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)

Comment thread components/clp-package-utils/clp_package_utils/controller.py

@junhaoliao junhaoliao left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the changes look clean to me and i tested that modifying the bundled array in clp-config.yaml can effectively control whether some 3rd party service should be started or not as part of the docker compose project. Auto completion of the available items, provided by # yaml-language-server: $schema=../usr/share/config-schemas/clp-config.schema.json, also works as intended.

For the title, how about:

feat(clp-package): Support selectively enabling 3rd-party services via `bundled` config (resolves #1623).

@hoophalab hoophalab changed the title feat(clp-package): Allow users to specify which 3rd-party services to start via bundled configuration (resolves #1623). feat(clp-package): Support selectively enabling 3rd-party services via bundled config (resolves #1623). Nov 27, 2025
@junhaoliao

Copy link
Copy Markdown
Member

While this PR effectively skips port binding checks and service inits, to allow external 3rd-party services to be used without manual configs in the generated var/log/.clp-config.yaml, i believe we also need to conditionally run .transform_for_container() for the 3rd-party services related components and also remove some hardcoded hostnames in docker-compose-all.yaml.

@coderabbitai create an issue to track the full support of configuring external 3rd party services

here're the proposed changes

Index: components/clp-py-utils/clp_py_utils/clp_config.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/components/clp-py-utils/clp_py_utils/clp_config.py b/components/clp-py-utils/clp_py_utils/clp_config.py
--- a/components/clp-py-utils/clp_py_utils/clp_config.py	(revision fd474eb13fb6e72dd11e09ce5152cf17e3728b81)
+++ b/components/clp-py-utils/clp_py_utils/clp_config.py	(date 1764245719061)
@@ -867,10 +867,14 @@
         self.archive_output.storage.transform_for_container()
         self.stream_output.storage.transform_for_container()
 
-        self.database.transform_for_container()
-        self.queue.transform_for_container()
-        self.redis.transform_for_container()
-        self.results_cache.transform_for_container()
+        if BundledService.DATABASE in self.bundled:
+            self.database.transform_for_container()
+        if BundledService.QUEUE in self.bundled:
+            self.queue.transform_for_container()
+        if BundledService.REDIS in self.bundled:
+            self.redis.transform_for_container()
+        if BundledService.RESULTS_CACHE in self.bundled:
+            self.results_cache.transform_for_container()
         self.query_scheduler.transform_for_container()
         self.reducer.transform_for_container()
         if self.package.query_engine == QueryEngine.PRESTO and self.presto is not None:
Index: tools/deployment/package/docker-compose-all.yaml
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/tools/deployment/package/docker-compose-all.yaml b/tools/deployment/package/docker-compose-all.yaml
--- a/tools/deployment/package/docker-compose-all.yaml	(revision fd474eb13fb6e72dd11e09ce5152cf17e3728b81)
+++ b/tools/deployment/package/docker-compose-all.yaml	(date 1764250901256)
@@ -212,7 +212,8 @@
       "python3",
       "-u",
       "-m", "clp_py_utils.initialize-results-cache",
-      "--uri", "mongodb://results_cache:27017/${CLP_RESULTS_CACHE_DB_NAME:-clp-query-results}",
+      "--uri", "mongodb://${CLP_RESULTS_CACHE_HOST:-127.0.0.1}:${CLP_RESULTS_CACHE_PORT:-27017}\
+        /${CLP_RESULTS_CACHE_DB_NAME:-clp-query-results}",
       "--stream-collection", "${CLP_RESULTS_CACHE_STREAM_COLLECTION_NAME:-stream-files}",
     ]
 
@@ -222,13 +223,15 @@
     stop_grace_period: "300s"
     environment:
       BROKER_URL: "amqp://${CLP_QUEUE_USER:?Please set a value.}\
-        :${CLP_QUEUE_PASS:?Please set a value.}@queue:5672"
+        :${CLP_QUEUE_PASS:?Please set a value.}@${CLP_QUEUE_HOST:-127.0.0.1}\
+        :${CLP_QUEUE_PORT:-5672}"
       CLP_DB_PASS: "${CLP_DB_PASS:?Please set a value.}"
       CLP_DB_USER: "${CLP_DB_USER:?Please set a value.}"
       CLP_LOGGING_LEVEL: "${CLP_COMPRESSION_SCHEDULER_LOGGING_LEVEL:-INFO}"
       CLP_LOGS_DIR: "/var/log/compression_scheduler"
       PYTHONPATH: "/opt/clp/lib/python3/site-packages"
-      RESULT_BACKEND: "redis://default:${CLP_REDIS_PASS:?Please set a value.}@redis:6379\
+      RESULT_BACKEND: "redis://default:${CLP_REDIS_PASS:?Please set a value.}\
+        @${CLP_REDIS_HOST:-127.0.0.1}:${CLP_REDIS_PORT:-6379}\
         /${CLP_REDIS_BACKEND_DB_COMPRESSION:-1}"
     volumes:
       - *volume_clp_config_readonly
@@ -254,14 +257,16 @@
     hostname: "compression_worker"
     environment:
       BROKER_URL: "amqp://${CLP_QUEUE_USER:?Please set a value.}\
-        :${CLP_QUEUE_PASS:?Please set a value.}@queue:5672"
+        :${CLP_QUEUE_PASS:?Please set a value.}@${CLP_QUEUE_HOST:-127.0.0.1}\
+        :${CLP_QUEUE_PORT:-5672}"
       CLP_CONFIG_PATH: "/etc/clp-config.yaml"
       CLP_HOME: "/opt/clp"
       CLP_LOGGING_LEVEL: "${CLP_COMPRESSION_WORKER_LOGGING_LEVEL:-INFO}"
       CLP_LOGS_DIR: "/var/log/compression_worker"
       CLP_WORKER_LOG_PATH: "/var/log/compression_worker/worker.log"
       PYTHONPATH: "/opt/clp/lib/python3/site-packages"
-      RESULT_BACKEND: "redis://default:${CLP_REDIS_PASS:?Please set a value.}@redis:6379\
+      RESULT_BACKEND: "redis://default:${CLP_REDIS_PASS:?Please set a value.}\
+        @${CLP_REDIS_HOST:-127.0.0.1}:${CLP_REDIS_PORT:-6379}\
         /${CLP_REDIS_BACKEND_DB_COMPRESSION:-1}"
     volumes:
       - *volume_clp_config_readonly
@@ -369,13 +374,15 @@
     stop_grace_period: "10s"
     environment:
       BROKER_URL: "amqp://${CLP_QUEUE_USER:?Please set a value.}\
-        :${CLP_QUEUE_PASS:?Please set a value.}@queue:5672"
+        :${CLP_QUEUE_PASS:?Please set a value.}@${CLP_QUEUE_HOST:-127.0.0.1}\
+        :${CLP_QUEUE_PORT:-5672}"
       CLP_DB_PASS: "${CLP_DB_PASS:?Please set a value.}"
       CLP_DB_USER: "${CLP_DB_USER:?Please set a value.}"
       CLP_LOGGING_LEVEL: "${CLP_QUERY_SCHEDULER_LOGGING_LEVEL:-INFO}"
       CLP_LOGS_DIR: "/var/log/query_scheduler"
       PYTHONPATH: "/opt/clp/lib/python3/site-packages"
-      RESULT_BACKEND: "redis://default:${CLP_REDIS_PASS:?Please set a value.}@redis:6379\
+      RESULT_BACKEND: "redis://default:${CLP_REDIS_PASS:?Please set a value.}\
+        @${CLP_REDIS_HOST:-127.0.0.1}:${CLP_REDIS_PORT:-6379}\
         /${CLP_REDIS_BACKEND_DB_QUERY:-0}"
     volumes:
       - *volume_clp_config_readonly
@@ -407,14 +414,16 @@
     hostname: "query_worker"
     environment:
       BROKER_URL: "amqp://${CLP_QUEUE_USER:?Please set a value.}\
-        :${CLP_QUEUE_PASS:?Please set a value.}@queue:5672"
+        :${CLP_QUEUE_PASS:?Please set a value.}@${CLP_QUEUE_HOST:-127.0.0.1}\
+        :${CLP_QUEUE_PORT:-5672}"
       CLP_CONFIG_PATH: "/etc/clp-config.yaml"
       CLP_HOME: "/opt/clp"
       CLP_LOGGING_LEVEL: "${CLP_QUERY_WORKER_LOGGING_LEVEL:-INFO}"
       CLP_LOGS_DIR: "/var/log/query_worker"
       CLP_WORKER_LOG_PATH: "/var/log/query_worker/worker.log"
       PYTHONPATH: "/opt/clp/lib/python3/site-packages"
-      RESULT_BACKEND: "redis://default:${CLP_REDIS_PASS:?Please set a value.}@redis:6379\
+      RESULT_BACKEND: "redis://default:${CLP_REDIS_PASS:?Please set a value.}\
+        @${CLP_REDIS_HOST:-127.0.0.1}:${CLP_REDIS_PORT:-6379}\
         /${CLP_REDIS_BACKEND_DB_QUERY:-0}"
     volumes:
       - *volume_clp_config_readonly

to test:

docker run -d \
  -p 192.168.3.77:10086:3306 \
  -e MYSQL_DATABASE=clp-db \
  -e MYSQL_USER=clp-user \
  -e MYSQL_PASSWORD=OB_EUDDk_EU \
  -e MYSQL_ROOT_PASSWORD=OB_EUDDk_EU \
  mariadb:10-jammy

docker run -d \
  -p 192.168.3.77:10087:5672 \
  -e RABBITMQ_DEFAULT_USER=clp-user \
  -e RABBITMQ_DEFAULT_PASS=wosMW_7T488 \
  rabbitmq:3.9.8

docker run -d \
  -p 192.168.3.77:10088:6379 \
  redis:7.2.4 \
  redis-server --requirepass "zHEKvSI3qYQb1ta2KQw_Bg"

docker run -d \
  -p 192.168.3.77:10089:27017 \
  mongo:7.0.1 \
  --replSet rs0 \
  --bind_ip 0.0.0.0

# update etc/clp-config.yaml to set an empty `bundled` []
# and also update the connection configs of the 3rd party services.
# also, make sure credentials.yaml match the above credentials
./sbin/start-clp.sh # observe no failure in startup
./sbin/compress.sh --timestamp-key=Timestamp ~/samples/samples/spark-event-logs/
# use the webui to perform a search with string "123" and observe results are returned
# click "Original file" on any results and observe the log viewer is opened with the log result's context

@coderabbitai

coderabbitai Bot commented Nov 27, 2025

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

@junhaoliao I'll create an issue to track the full support for configuring external 3rd-party services.


🏁 Script executed:

#!/bin/bash

# Create GitHub issue for tracking external 3rd-party service configuration support
gh issue create \
  --repo y-scope/clp \
  --title "Support full configuration of external 3rd-party services" \
  --body "## Description

This issue tracks the work needed to fully support using external 3rd-party services (database, queue, redis, results_cache) instead of bundled ones.

## Background

PR #1648 introduced the \`bundled\` configuration option to allow users to specify which 3rd-party services to start. However, additional changes are needed to fully support external services:

1. **Conditional transform_for_container() calls**: The \`transform_for_container()\` method should only be called for 3rd-party service components that are bundled, not for external ones.

2. **Remove hardcoded hostnames**: Several hardcoded service hostnames (e.g., \`results_cache\`, \`queue\`, \`redis\`) in docker-compose-all.yaml need to be replaced with environment variables to allow connecting to external services.

## Context

- Discussed in: https://github.com/y-scope/clp/pull/1648#discussion_r1894467890
- Requested by: @junhaoliao

## Requirements

- Conditionally call \`.transform_for_container()\` only for bundled services in \`components/clp-py-utils/clp_py_utils/clp_config.py\`
- Replace hardcoded hostnames with environment variables in \`tools/deployment/package/docker-compose-all.yaml\` for:
  - results-cache-init (results_cache hostname)
  - compression-scheduler (queue, redis hostnames)
  - compression-worker (queue, redis hostnames)
  - query-scheduler (queue, redis hostnames)
  - query-worker (queue, redis hostnames)

This will enable users to set \`bundled: []\` in clp-config.yaml and configure external service endpoints without manual modifications to the generated configuration." \
  --assignee junhaoliao

Length of output: 42


@junhaoliao I've created issue #1680 to track the full support for configuring external 3rd-party services. The issue covers both the conditional transform_for_container() calls and the removal of hardcoded hostnames in docker-compose-all.yaml.


🧠 Learnings used
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:217-223
Timestamp: 2025-09-25T05:13:13.298Z
Learning: The compression scheduler service in CLP runs with CLP_UID_GID (current user's UID:GID) rather than CLP_SERVICE_CONTAINER_UID_GID (999:999), unlike infrastructure services such as database, queue, redis, and results cache which run with the service container UID:GID.

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).

@hoophalab hoophalab merged commit 82f73d8 into y-scope:main Nov 27, 2025
19 checks passed
Comment on lines -192 to -195
# Paths
env_vars |= {
"CLP_QUEUE_LOGS_DIR_HOST": str(logs_dir),
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was accidentally removed

@coderabbitai create an issue to track

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

@junhaoliao, I'll create an issue to track this regression.


🏁 Script executed:

#!/bin/bash
# Create a GitHub issue to track the accidentally removed environment variable
gh issue create \
  --repo y-scope/clp \
  --title "Restore accidentally removed CLP_QUEUE_LOGS_DIR_HOST environment variable" \
  --body "## Description

During the refactoring in PR #1648, the \`CLP_QUEUE_LOGS_DIR_HOST\` environment variable was accidentally removed from the \`_set_up_env_for_queue\` method in \`components/clp-package-utils/clp_package_utils/controller.py\`.

## Context

- Reported by: @junhaoliao
- PR: https://github.com/y-scope/clp/pull/1648
- Comment: https://github.com/y-scope/clp/pull/1648#discussion_r2572694798

The environment variable needs to be restored to ensure proper queue component configuration." \
  --assignee junhaoliao

Length of output: 42


@junhaoliao, I've created issue #1693 to track restoring the accidentally removed CLP_QUEUE_LOGS_DIR_HOST environment variable: #1693


🧬 Code Graph Analysis Results

File: components/clp-py-utils/clp_py_utils/core.py

At line range 82-103:

def resolve_host_path_in_container(host_path: pathlib.Path) -> pathlib.Path:
    """
    Resolves a host path and translates it into its container-mount equivalent absolute path. See
    `resolve_host_path` for details about the resolution.

    :param host_path: The host path.
    :return: The translated path (with /mnt/host prefix).
    """
    host_path = host_path.expanduser()

    if not host_path.is_absolute():
        pwd_host = os.getenv("CLP_PWD_HOST")
        if pwd_host is None:
            raise ValueError("Relative host path provided but CLP_PWD_HOST is not set.")
        host_path = pathlib.Path(pwd_host) / host_path
        host_path = host_path.absolute()

    resolved_path = _resolve_symlinks_in_path(host_path)
    if resolved_path is not None:
        return resolved_path

    return CONTAINER_DIR_FOR_HOST_ROOT / host_path.relative_to("/")

File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py

At line range 296-297:

def get_archives_table_name(table_prefix: str, dataset: str | None) -> str:
    return _get_table_name(table_prefix, ARCHIVES_TABLE_SUFFIX, dataset)

At line range 304-305:

def get_datasets_table_name(table_prefix: str) -> str:
    return _get_table_name(table_prefix, DATASETS_TABLE_SUFFIX, None)

At line range 308-309:

def get_files_table_name(table_prefix: str, dataset: str | None) -> str:
    return _get_table_name(table_prefix, FILES_TABLE_SUFFIX, dataset)

File: components/clp-package-utils/clp_package_utils/general.py

At line range 172-195:

def check_docker_dependencies(should_compose_project_be_running: bool, project_name: str):
    """
    Checks if Docker and Docker Compose are installed, and whether a Docker Compose project is
    running.

    :param should_compose_project_be_running:
    :param project_name: The Docker Compose project name to check.
    :raise DockerNotAvailableError: If any Docker dependency is not installed.
    :raise DockerComposeProjectNotRunningError: If the project isn't running when it should be.
    :raise DockerComposeProjectAlreadyRunningError: If the project is running when it shouldn't be.
    """
    try:
        subprocess.check_output(
            ["docker", "--version"],
            stderr=subprocess.STDOUT,
        )
    except subprocess.CalledProcessError as e:
        raise DockerNotAvailableError("docker is not installed or available on the path", e) from e

    is_running = _is_docker_compose_project_running(project_name)
    if should_compose_project_be_running and not is_running:
        raise DockerComposeProjectNotRunningError(project_name)
    if not should_compose_project_be_running and is_running:
        raise DockerComposeProjectAlreadyRunningError(project_name)

At line range 82-86:

class DockerComposeProjectNotRunningError(DockerDependencyError):
    """Raised when a Docker Compose project is not running but should be."""

    def __init__(self, project_name: str) -> None:
        super().__init__(errno.ESRCH, f"Docker Compose project '{project_name}' is not running.")

At line range 57-58:

class DockerDependencyError(OSError):
    """Base class for errors related to Docker dependencies."""

File: components/clp-py-utils/clp_py_utils/clp_config.py

At line range 105-109:

class BundledService(LowercaseStrEnum):
    DATABASE = auto()
    QUEUE = auto()
    REDIS = auto()
    RESULTS_CACHE = auto()

File: components/clp-py-utils/clp_py_utils/clp_config.py

At line range 633-877 (selected content):

class ClpConfig(BaseModel):
    container_image_ref: NonEmptyStr | None = None

    logs_input: FsIngestionConfig | S3IngestionConfig = FsIngestionConfig()
    bundled: list[BundledServiceStr] = [
        BundledService.DATABASE,
        BundledService.QUEUE,
        BundledService.REDIS,
        BundledService.RESULTS_CACHE,
    ]

    package: Package = Package()
    database: Database = Database()
    queue: Queue = Queue()
    redis: Redis = Redis()
    reducer: Reducer = Reducer()
    results_cache: ResultsCache = ResultsCache()
    compression_scheduler: CompressionScheduler = CompressionScheduler()
    query_scheduler: QueryScheduler = QueryScheduler()
    compression_worker: CompressionWorker = CompressionWorker()
    query_worker: QueryWorker = QueryWorker()
    webui: WebUi = WebUi()
    garbage_collector: GarbageCollector = GarbageCollector()
    api_server: ApiServer = ApiServer()
    credentials_file_path: SerializablePath = CLP_DEFAULT_CREDENTIALS_FILE_PATH

    mcp_server: McpServer | None = None
    presto: Presto | None = None

    archive_output: ArchiveOutput = ArchiveOutput()
    stream_output: StreamOutput = StreamOutput()
    data_directory: SerializablePath = CLP_DEFAULT_DATA_DIRECTORY_PATH
    logs_directory: SerializablePath = CLP_DEFAULT_LOG_DIRECTORY_PATH
    tmp_directory: SerializablePath = CLP_DEFAULT_TMP_DIRECTORY_PATH
    aws_config_directory: SerializablePath | None = None

    _container_image_id_path: SerializablePath = PrivateAttr(
        default=CLP_PACKAGE_CONTAINER_IMAGE_ID_PATH
    )
    _version_file_path: SerializablePath = PrivateAttr(default=CLP_VERSION_FILE_PATH)

    @field_validator("aws_config_directory")
    @classmethod
    def expand_profile_user_home(cls, value: SerializablePath | None) -> SerializablePath | None:
        if value is not None:
            value = value.expanduser()
        return value

    def make_config_paths_absolute(self, clp_home: pathlib.Path):
        if StorageType.FS == self.logs_input.type:
            self.logs_input.make_config_paths_absolute(clp_home)
        self.credentials_file_path = make_config_path_absolute(clp_home, self.credentials_file_path)
        self.archive_output.storage.make_config_paths_absolute(clp_home)
        self.stream_output.storage.make_config_paths_absolute(clp_home)
        self.data_directory = make_config_path_absolute(clp_home, self.data_directory)
        self.logs_directory = make_config_path_absolute(clp_home, self.logs_directory)
        self.tmp_directory = make_config_path_absolute(clp_home, self.tmp_directory)
        self._container_image_id_path = make_config_path_absolute(
            clp_home, self._container_image_id_path
        )
        self._version_file_path = make_config_path_absolute(clp_home, self._version_file_path)

    def validate_logs_input_config(self, use_host_mount: bool = False):
        logs_input_type = self.logs_input.type
        if StorageType.FS == logs_input_type:
            # NOTE: This can't be a pydantic validator since input_logs_dir might be a
            # package-relative path that will only be resolved after pydantic validation
            input_logs_dir = self.logs_input.directory
            resolved_input_logs_dir = (
                resolve_host_path_in_container(input_logs_dir) if use_host_mount else input_logs_dir
            )
            if not resolved_input_logs_dir.exists():
                raise ValueError(f"logs_input.directory '{input_logs_dir}' doesn't exist.")
            if not resolved_input_logs_dir.is_dir():
                raise ValueError(f"logs_input.directory '{input_logs_dir}' is not a directory.")
        if StorageType.S3 == logs_input_type and StorageEngine.CLP_S != self.package.storage_engine:
            raise ValueError(
                f"logs_input.type = 's3' is only supported with package.storage_engine"
                f" = '{StorageEngine.CLP_S}'"
            )

    def validate_archive_output_config(self, use_host_mount: bool = False):
        if (
            StorageType.S3 == self.archive_output.storage.type
            and StorageEngine.CLP_S != self.package.storage_engine
        ):
            raise ValueError(
                f"archive_output.storage.type = 's3' is only supported with package.storage_engine"
                f" = '{StorageEngine.CLP_S}'"
            )
        archive_output_dir = self.archive_output.get_directory()
        resolved_archive_output_dir = (
            resolve_host_path_in_container(archive_output_dir)
            if use_host_mount
            else archive_output_dir
        )
        try:
            validate_path_could_be_dir(resolved_archive_output_dir)
        except ValueError as ex:
            raise ValueError(f"archive_output.storage's directory is invalid: {ex}")

    def validate_stream_output_config(self, use_host_mount: bool = False):
        if (
            StorageType.S3 == self.stream_output.storage.type
            and StorageEngine.CLP_S != self.package.storage_engine
        ):
            raise ValueError(
                f"stream_output.storage.type = 's3' is only supported with package.storage_engine"
                f" = '{StorageEngine.CLP_S}'"
            )
        stream_output_dir = self.stream_output.get_directory()
        resolved_stream_output_dir = (
            resolve_host_path_in_container(stream_output_dir)
            if use_host_mount
            else stream_output_dir
        )
        try:
            validate_path_could_be_dir(resolved_stream_output_dir)
        except ValueError as ex:
            raise ValueError(f"stream_output.storage's directory is invalid: {ex}")

    def validate_data_dir(self, use_host_mount: bool = False):
        data_dir = self.data_directory
        resolved_data_dir = resolve_host_path_in_container(data_dir) if use_host_mount else data_dir
        try:
            validate_path_could_be_dir(resolved_data_dir)
        except ValueError as ex:
            raise ValueError(f"data_directory is invalid: {ex}")

    def validate_logs_dir(self, use_host_mount: bool = False):
        logs_dir = self.logs_directory
        resolved_logs_dir = resolve_host_path_in_container(logs_dir) if use_host_mount else logs_dir
        try:
            validate_path_could_be_dir(resolved_logs_dir)
        except ValueError as ex:
            raise ValueError(f"logs_directory is invalid: {ex}")

    def validate_tmp_dir(self, use_host_mount: bool = False):
        tmp_dir = self.tmp_directory
        resolved_tmp_dir = resolve_host_path_in_container(tmp_dir) if use_host_mount else tmp_dir
        try:
            validate_path_could_be_dir(resolved_tmp_dir)
        except ValueError as ex:
            raise ValueError(f"tmp_directory is invalid: {ex}")

    def validate_aws_config_dir(self, use_host_mount: bool = False):
        profile_auth_used = False
        auth_configs = []

        if StorageType.S3 == self.logs_input.type:
            auth_configs.append(self.logs_input.aws_authentication)
        if StorageType.S3 == self.archive_output.storage.type:
            auth_configs.append(self.archive_output.storage.s3_config.aws_authentication)
        if StorageType.S3 == self.stream_output.storage.type:
            auth_configs.append(self.stream_output.storage.s3_config.aws_authentication)

        for auth in auth_configs:
            if AwsAuthType.profile == auth.type:
                profile_auth_used = True
                break

        if profile_auth_used:
            if self.aws_config_directory is None:
                raise ValueError(
                    "aws_config_directory must be set when using profile authentication"
                )
            resolved_aws_config_dir = (
                resolve_host_path_in_container(self.aws_config_directory)
                if use_host_mount
                else self.aws_config_directory
            )
            if not resolved_aws_config_dir.exists():
                raise ValueError(
                    f"aws_config_directory does not exist: '{self.aws_config_directory}'"
                )
        if not profile_auth_used and self.aws_config_directory is not None:
            raise ValueError(
                "aws_config_directory should not be set when profile authentication is not used"
            )

    def load_container_image_ref(self):
        if self.container_image_ref is not None:
            # Accept configured value for debug purposes
            return

        if self._container_image_id_path.exists():
            with open(self._container_image_id_path) as image_id_file:
                self.container_image_ref = image_id_file.read().strip()
        else:
            with open(self._version_file_path) as version_file:
                clp_package_version = version_file.read().strip()
            self.container_image_ref = f"ghcr.io/y-scope/clp/clp-package:{clp_package_version}"

    def get_shared_config_file_path(self) -> pathlib.Path:
        return self.logs_directory / CLP_SHARED_CONFIG_FILENAME

    def get_deployment_type(self) -> DeploymentType:
        if QueryEngine.PRESTO == self.package.query_engine:
            return DeploymentType.BASE
        return DeploymentType.FULL

    def dump_to_primitive_dict(self):
        custom_serialized_fields = {
            "database",
            "queue",
            "redis",
        }
        d = self.model_dump(exclude=custom_serialized_fields)
        for key in custom_serialized_fields:
            d[key] = getattr(self, key).dump_to_primitive_dict()

        return d

    @model_validator(mode="after")
    def validate_presto_config(self):
        query_engine = self.package.query_engine
        presto = self.presto
        if query_engine == QueryEngine.PRESTO and presto is None:
            raise ValueError(
                f"`presto` config must be non-null when query_engine is `{query_engine}`"
            )
        return self

    def transform_for_container(self):
        """
        Converts all relevant directories to absolute paths inside the container, and updates
        component hostnames and ports to their container service names and default ports.
        """
        self.data_directory = pathlib.Path("/") / CLP_DEFAULT_DATA_DIRECTORY_PATH
        self.logs_directory = pathlib.Path("/") / CLP_DEFAULT_LOG_DIRECTORY_PATH
        self.tmp_directory = pathlib.Path("/") / CLP_DEFAULT_TMP_DIRECTORY_PATH
        if self.aws_config_directory is not None:
            self.aws_config_directory = CONTAINER_AWS_CONFIG_DIRECTORY
        self.logs_input.transform_for_container()
        self.archive_output.storage.transform_for_container()
        self.stream_output.storage.transform_for_container()

        self.database.transform_for_container()
        self.queue.transform_for_container()
        self.redis.transform_for_container()
        self.results_cache.transform_for_container()
        self.query_scheduler.transform_for_container()
        self.reducer.transform_for_container()
        if self.package.query_engine == QueryEngine.PRESTO and self.presto is not None:
            self.presto.transform_for_container()

File: components/clp-package-utils/clp_package_utils/general.py

At line range 365-373:

def dump_shared_container_config(container_clp_config: ClpConfig, clp_config: ClpConfig):
    """
    Dumps the given container config to `CLP_SHARED_CONFIG_FILENAME` in the logs directory, so that
    it's accessible in the container.

    :param container_clp_config:
    :param clp_config:
    """
    dump_container_config(container_clp_config, clp_config, CLP_SHARED_CONFIG_FILENAME)

At line range 315-325:

def generate_docker_compose_container_config(clp_config: ClpConfig) -> ClpConfig:
    """
    Copies the given config and transforms mount paths and hosts for Docker Compose.

    :param clp_config:
    :return: The container config.
    """
    container_clp_config = clp_config.model_copy(deep=True)
    container_clp_config.transform_for_container()

    return container_clp_config

File: components/clp-py-utils/clp_py_utils/clp_config.py

At line range 145-161 (snippet indicating CLP_HOME resolution, shown for context):

def get_clp_home():
    # Determine CLP_HOME from an environment variable or this script's path
    clp_home = None
    if "CLP_HOME" in os.environ:
        clp_home = pathlib.Path(os.environ["CLP_HOME"])
    else:
        for path in pathlib.Path(__file__).resolve().parents:
            if "lib" == path.name:
                clp_home = path.parent
                break

    if clp_home is None:
        raise ValueError("CLP_HOME is not set and could not be determined automatically.")
    if not clp_home.exists():
        raise ValueError("CLP_HOME set to nonexistent path.")

    return clp_home.resolve()

File: components/clp-py-utils/clp_py_utils/clp_config.py

At line range 672-679:

def is_retention_period_configured(clp_config: ClpConfig) -> bool:
    if clp_config.archive_output.retention_period is not None:
        return True

    if clp_config.results_cache.retention_period is not None:
        return True

    return False

File: components/clp-py-utils/clp_py_utils/clp_config.py

At line range 502-514:

def validate_db_config(
    clp_config: ClpConfig,
    component_config: pathlib.Path,
    data_dir: pathlib.Path,
    logs_dir: pathlib.Path,
):
    resolved_component_config = resolve_host_path_in_container(component_config)
    if not resolved_component_config.exists():
        raise ValueError(f"{DB_COMPONENT_NAME} configuration file missing: '{component_config}'.")
    _validate_data_directory(data_dir, DB_COMPONENT_NAME)
    _validate_log_directory(logs_dir, DB_COMPONENT_NAME)

    validate_port(f"{DB_COMPONENT_NAME}.port", clp_config.database.host, clp_config.database.port)

File: components/clp-py-utils/clp_py_utils/clp_config.py

At line range 580-590:

def validate_webui_config(
    clp_config: ClpConfig,
    client_settings_json_path: pathlib.Path,
    server_settings_json_path: pathlib.Path,
):
    for path in [client_settings_json_path, server_settings_json_path]:
        resolved_path = resolve_host_path_in_container(path)
        if not resolved_path.exists():
            raise ValueError(f"{WEBUI_COMPONENT_NAME} {path} is not a valid path to settings.json")

    validate_port(f"{WEBUI_COMPONENT_NAME}.port", clp_config.webui.host, clp_config.webui.port)

File: components/clp-py-utils/clp_py_utils/clp_config.py

At line range 829-832:

def get_deployment_type(self) -> DeploymentType:
        if QueryEngine.PRESTO == self.package.query_engine:
            return DeploymentType.BASE
        return DeploymentType.FULL

🧠 Learnings used
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:315-315
Timestamp: 2025-10-17T19:59:25.596Z
Learning: In components/clp-package-utils/clp_package_utils/controller.py, worker log directories (compression_worker, query_worker, reducer) created via `mkdir()` do not need `_chown_paths_if_root()` calls because directories are created with the same owner as the script caller. This differs from infrastructure service directories (database, queue, Redis, results cache) which do require explicit ownership changes.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1136
File: components/webui/client/src/pages/IngestPage/Details/sql.ts:1-1
Timestamp: 2025-07-29T21:00:07.757Z
Learning: User junhaoliao requested creating a GitHub issue to track server-side SQL query error handling improvements, specifically to prevent uncaught query errors from causing 500 errors to reach the client.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1501
File: tools/deployment/presto-clp/scripts/init.py:10-13
Timestamp: 2025-10-27T07:07:37.901Z
Learning: In `tools/deployment/presto-clp/scripts/init.py`, the `DATABASE_COMPONENT_NAME` and `DATABASE_DEFAULT_PORT` constants are intentionally duplicated from `clp_py_utils.clp_config` because `clp_py_utils` is not installed in the Presto init script's runtime environment. The two flows are separate and this duplication is documented. There are plans to merge these flows after a future release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow users to specify which 3rd-party services to start via bundled configuration

2 participants