feat(clp-package): Support selectively enabling 3rd-party services via bundled config (resolves #1623).#1648
Conversation
… start via bundled configuration (resolves y-scope#1623).
WalkthroughSplits per-service environment setup into bundling vs runtime paths, adds a BundledService enum and ClpConfig.bundled, updates controller methods to emit CLP_ENABLED flags for non-bundled services and populate runtime host/port/creds when bundled, updates default clp-config, deduplicates api-server, and drives deploy.replicas by CLP*_ENABLED. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant U as User
participant C as DockerComposeController
participant CFG as ClpConfig
participant ENV as Environment
participant DC as Docker Compose
U->>C: set_up_env()
C->>CFG: read bundled list
CFG-->>C: bundled list
par Bundling checks
C->>C: _set_up_env_for_database_bundling()
alt database NOT in bundled
C->>ENV: CLP_DATABASE_ENABLED = "0"
else
C->>ENV: leave/enable database
end
C->>C: _set_up_env_for_queue_bundling()
alt queue NOT in bundled
C->>ENV: CLP_QUEUE_ENABLED = "0"
end
C->>C: _set_up_env_for_redis_bundling()
alt redis NOT in bundled
C->>ENV: CLP_REDIS_ENABLED = "0"
end
C->>C: _set_up_env_for_results_cache_bundling()
alt results_cache NOT in bundled
C->>ENV: CLP_RESULTS_CACHE_ENABLED = "0"
end
and Runtime env population
C->>C: _set_up_env_for_database() — set host/port/creds/paths when runtime used
C->>C: _set_up_env_for_queue() — uses DB config when applicable
C->>C: _set_up_env_for_redis() — set host/port/creds/paths when runtime used
C->>C: _set_up_env_for_results_cache() — set host/port/db when runtime used
end
C->>DC: run compose (deploy.replicas use CLP_*_ENABLED)
DC-->>U: services started/stopped per flags
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Pre-merge checks and finishing touches✅ Passed checks (5 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@junhaoliao It looks like there is no way to specify exactly what to boot in docker compose. Say if we remove results_cache from the bundled list, the webui still depends on results_cache and results_cache still starts. Additionally, since webui needs the results_cache env, so we also cannot remove the env from it. |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
components/clp-package-utils/clp_package_utils/controller.py (1)
785-799: Drop or use the localbundledvariable inset_up_env
bundled = self._clp_config.bundledis currently unused. If you don’t plan to gate any env setup onbundled, you can remove this line to keepset_up_envleaner; otherwise, consider wiring it into whatever conditional logic you intend.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (4)
components/clp-package-utils/clp_package_utils/controller.py(5 hunks)components/clp-py-utils/clp_py_utils/clp_config.py(2 hunks)components/package-template/src/etc/clp-config.yaml(1 hunks)tools/deployment/package/docker-compose.yaml(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-11-10T05:19:56.600Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1575
File: components/clp-py-utils/clp_py_utils/clp_config.py:602-607
Timestamp: 2025-11-10T05:19:56.600Z
Learning: In the y-scope/clp repository, the `ApiServer` class in `components/clp-py-utils/clp_py_utils/clp_config.py` does not need a `transform_for_container()` method because no other containerized service depends on the API server - it's only accessed from the host, so no docker-network communication is expected.
Applied to files:
components/clp-package-utils/clp_package_utils/controller.py
📚 Learning: 2025-10-27T07:07:37.901Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1501
File: tools/deployment/presto-clp/scripts/init.py:10-13
Timestamp: 2025-10-27T07:07:37.901Z
Learning: In `tools/deployment/presto-clp/scripts/init.py`, the `DATABASE_COMPONENT_NAME` and `DATABASE_DEFAULT_PORT` constants are intentionally duplicated from `clp_py_utils.clp_config` because `clp_py_utils` is not installed in the Presto init script's runtime environment. The two flows are separate and this duplication is documented. There are plans to merge these flows after a future release.
Applied to files:
components/clp-package-utils/clp_package_utils/controller.py
🧬 Code graph analysis (1)
components/clp-package-utils/clp_package_utils/controller.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (3)
BundledService(105-109)get_deployment_type(827-830)DeploymentType(92-94)
🔇 Additional comments (2)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
105-112: NewBundledServiceenum and serializer look consistentThe enum values and
BundledServiceStralias integrate cleanly with existing pattern (e.g.,StorageEngineStr) and match the lower‑case strings used in YAML and the controller. I don’t see behavioural issues here.tools/deployment/package/docker-compose.yaml (1)
57-60:api-serverservice wiring matches controller expectationsAdding
api-serverhere, extending fromdocker-compose-all.yaml, aligns with the newDOCKER_COMPOSE_BASE_SERVICESlist in the controller and should avoid duplicate service definitions. No issues from my side.
| DOCKER_COMPOSE_BASE_SERVICES = [ | ||
| "database", | ||
| "db-table-creator", | ||
| "queue", | ||
| "redis", | ||
| "results-cache", | ||
| "results-cache-indices-creator", | ||
| "compression-scheduler", | ||
| "compression-worker", | ||
| "webui", | ||
| "garbage-collector", | ||
| "api-server" | ||
| ] | ||
|
|
||
| DOCKER_COMPOSE_EXTENDED_SERVICES = ["query-scheduler", "query-worker", "reducer", "mcp-server"] | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
Fix global list mutation and return type in _get_docker_compose_services
A couple of issues here:
services = DOCKER_COMPOSE_BASE_SERVICESfollowed byservices += DOCKER_COMPOSE_EXTENDED_SERVICESand multipleservices.remove(...)calls mutates the module‑level lists in place. Repeated calls (or different controller instances) can end up with duplicated extended services or missing base services, and furtherremovecalls may raiseValueErrorif an element was already removed._get_docker_compose_servicesis annotated to returnstrbut actually returns alist[str], whilestart()relies on list semantics (cmd += ...). This will confuse type checkers and could mask future changes where the function accidentally returns a string.
Suggested adjustments:
# Keep these as true constants:
DOCKER_COMPOSE_BASE_SERVICES = [
"database",
"db-table-creator",
"queue",
"redis",
"results-cache",
"results-cache-indices-creator",
"compression-scheduler",
"compression-worker",
"webui",
"garbage-collector",
"api-server",
]
DOCKER_COMPOSE_EXTENDED_SERVICES = [
"query-scheduler",
"query-worker",
"reducer",
"mcp-server",
]
def _get_docker_compose_services(self) -> list[str]:
"""
:return: The Docker Compose services to start based on the config.
"""
deployment_type = self._clp_config.get_deployment_type()
services: list[str] = list(DOCKER_COMPOSE_BASE_SERVICES)
if deployment_type == DeploymentType.FULL:
services += DOCKER_COMPOSE_EXTENDED_SERVICES
bundled = self._clp_config.bundled
if BundledService.DATABASE not in bundled and "database" in services:
services.remove("database")
if BundledService.QUEUE not in bundled and "queue" in services:
services.remove("queue")
if BundledService.REDIS not in bundled and "redis" in services:
services.remove("redis")
if BundledService.RESULTS_CACHE not in bundled:
services = [
s for s in services if s not in ("results-cache", "results-cache-indices-creator")
]
return servicesThis keeps the constants immutable, avoids surprising cross‑call side effects, and aligns the return type with how start() uses the result.
Also applies to: 819-825, 882-903
I need to verify the code issues identified. Let me examine the file and the function in question.
🏁 Script executed:
#!/bin/bash
# Examine the controller.py file to verify the issues
cd components/clp-package-utils/clp_package_utils
# First, show lines 60-90 to see the constants and function context
echo "=== Lines around 68-83 (constants and start of _get_docker_compose_services) ==="
sed -n '60,90p' controller.py
echo ""
echo "=== Look for the _get_docker_compose_services function definition ==="
rg -n "_get_docker_compose_services" controller.py -A 30 -B 2
echo ""
echo "=== Check lines 819-825 ==="
sed -n '815,830p' controller.py
echo ""
echo "=== Check lines 882-903 ==="
sed -n '880,905p' controller.pyLength of output: 5114
🏁 Script executed:
#!/bin/bash
# Search for all usages of the function and globals
cd components/clp-package-utils/clp_package_utils
echo "=== All calls to _get_docker_compose_services ==="
rg "_get_docker_compose_services" controller.py -n
echo ""
echo "=== Any direct accesses to DOCKER_COMPOSE_BASE_SERVICES ==="
rg "DOCKER_COMPOSE_BASE_SERVICES" controller.py -n
echo ""
echo "=== Any direct accesses to DOCKER_COMPOSE_EXTENDED_SERVICES ==="
rg "DOCKER_COMPOSE_EXTENDED_SERVICES" controller.py -nLength of output: 533
Fix return type annotation and global list mutation in _get_docker_compose_services
Verification confirms both issues:
-
Return type mismatch: Line 882 declares
-> strbut line 903 returns a list. Line 825 uses the result with list semantics (cmd += ...). Type checkers will flag this. -
Global list mutation: Line 889 assigns
services = DOCKER_COMPOSE_BASE_SERVICES(reference to global), then lines 891, 895, 897, 899–902 perform in-place mutations (.remove(),+=). This mutates the module-level lists. Repeated calls or multiple controller instances will hitValueErrorwhen attempting to remove already-removed services.
The suggested refactor is correct:
- Create a local copy:
services: list[str] = list(DOCKER_COMPOSE_BASE_SERVICES) - Extend conditionally:
services += DOCKER_COMPOSE_EXTENDED_SERVICES - Filter immutably: use list comprehension for results-cache removal
- Fix annotation:
-> list[str]
Apply changes at lines 882–903 (function definition).
🤖 Prompt for AI Agents
components/clp-package-utils/clp_package_utils/controller.py around lines
882-903, the function currently declares -> str but returns a list and mutates
module-level lists; change the return annotation to -> list[str], create a local
copy services: list[str] = list(DOCKER_COMPOSE_BASE_SERVICES) (do not assign the
global), conditionally extend with DOCKER_COMPOSE_EXTENDED_SERVICES via services
+= DOCKER_COMPOSE_EXTENDED_SERVICES, remove the "results-cache" entry immutably
using a list comprehension (e.g. services = [s for s in services if s !=
"results-cache"]) instead of .remove(), and return the local services list so
globals remain unmodified.
| #bundled: ["database", "queue", "redis", "results_cache"] | ||
| # |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider documenting what bundled controls in the template
The example list is correct and matches the enum/controller logic, but it may not be obvious that this controls which 3rd‑party backing services CLP will start vs expect externally. Consider adding a short comment above, e.g., “# Which third‑party services CLP should deploy (omit to use external services instead)”, to reduce user confusion.
🤖 Prompt for AI Agents
In components/package-template/src/etc/clp-config.yaml around lines 17 to 18,
add a brief explanatory comment above the bundled key that states this list
controls which third‑party services CLP will deploy versus expect externally
(e.g., “Which third‑party services CLP should deploy; omit to use external
services instead”), so users understand the flag’s effect and when to remove it;
keep the comment short and placed immediately above the example list.
right. a similar relationship is between the database and db-table-creator.
|
This comment was marked as resolved.
This comment was marked as resolved.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
components/clp-package-utils/clp_package_utils/controller.py (1)
780-783: Consider conditionally calling setup functions for non-bundled services.The setup functions for database, queue, redis, and results_cache are called unconditionally (lines 780-783), even when these services are not in the bundled list. These functions create local directories (e.g., data_dir, logs_dir) and perform validation that may not be necessary when using external services (e.g., AWS RDS, external Redis cluster).
While the created directories are currently harmless (unused when replicas=0), this creates unnecessary filesystem artifacts and potentially confusing validation errors for external service configurations.
Consider wrapping these calls in conditional checks:
# Component-specific config if BundledService.DATABASE in bundled: env_vars |= self._set_up_env_for_database() if BundledService.QUEUE in bundled: env_vars |= self._set_up_env_for_queue() if BundledService.REDIS in bundled: env_vars |= self._set_up_env_for_redis() if BundledService.RESULTS_CACHE in bundled: env_vars |= self._set_up_env_for_results_cache()Note: Connection parameters (host, port, credentials) from the config will still be needed for other services to connect to external instances, so you may need to separate environment setup from directory creation in these functions, or ensure the config still provides connection details for external services.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
components/clp-package-utils/clp_package_utils/controller.py(2 hunks)tools/deployment/package/docker-compose-all.yaml(4 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-10T05:19:56.600Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1575
File: components/clp-py-utils/clp_py_utils/clp_config.py:602-607
Timestamp: 2025-11-10T05:19:56.600Z
Learning: In the y-scope/clp repository, the `ApiServer` class in `components/clp-py-utils/clp_py_utils/clp_config.py` does not need a `transform_for_container()` method because no other containerized service depends on the API server - it's only accessed from the host, so no docker-network communication is expected.
Applied to files:
components/clp-package-utils/clp_package_utils/controller.py
🧬 Code graph analysis (1)
components/clp-package-utils/clp_package_utils/controller.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
BundledService(105-109)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: package-image
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
| deploy: | ||
| # Value must be either 0 or 1. Set to 0 to disable the database. | ||
| replicas: "${CLP_DATABASE_ENABLED:-1}" |
There was a problem hiding this comment.
🧩 Analysis chain
Critical: depends_on relationships break when services have replicas: 0.
This change introduces a fundamental issue: when a bundled service is disabled (e.g., CLP_DATABASE_ENABLED=0), the service has replicas: 0 and never starts. However, initialization services still declare hard dependencies on these disabled services:
- Lines 93-94:
db-table-creatordepends ondatabasewithcondition: service_healthy - Lines 209-210:
results-cache-indices-creatordepends onresults-cachewithcondition: service_healthy
When replicas: 0, no containers start, so the health check never succeeds. This causes initialization services to wait indefinitely, preventing the entire stack from starting.
Impact: Users cannot actually disable bundled services as intended—disabling database or results-cache will cause db-table-creator or results-cache-indices-creator to hang, which blocks all dependent services (compression-scheduler, webui, garbage-collector, query-scheduler, reducer, mcp-server, api-server).
Run the following script to confirm which services depend on the bundled services:
Potential solutions:
- Make
depends_onconditional using Docker Compose profiles or separate compose files for bundled vs. external service configurations - Modify initialization services to skip health checks and connect directly to configured endpoints (bundled or external)
- Use a startup script that checks service availability rather than relying on Docker Compose
depends_onwithservice_healthyconditions
Also applies to: 83-94, 107-109, 134-136, 173-175, 203-210
🏁 Script executed:
#!/bin/bash
# Find all services with depends_on relationships to bundled services
echo "=== Services depending on 'database' ==="
rg -A 2 'depends_on:' tools/deployment/package/docker-compose-all.yaml | rg -B 1 'database:'
echo ""
echo "=== Services depending on 'queue' ==="
rg -A 2 'depends_on:' tools/deployment/package/docker-compose-all.yaml | rg -B 1 'queue:'
echo ""
echo "=== Services depending on 'redis' ==="
rg -A 2 'depends_on:' tools/deployment/package/docker-compose-all.yaml | rg -B 1 'redis:'
echo ""
echo "=== Services depending on 'results-cache' ==="
rg -A 2 'depends_on:' tools/deployment/package/docker-compose-all.yaml | rg -B 1 'results-cache:'Length of output: 234
Merci, je vais approfondir l'analyse. Le résultat précédent est trop minimaliste. Laissez-moi extraire la structure complète des dépendances avec les conditions.
Maintenant je dois examiner le fichier docker-compose-all.yaml pour vérifier la structure exacte des dépendances et les conditions.
Now let me extract the exact structure of the docker-compose file to confirm the dependencies:
Based on my verification, I need to examine the actual docker-compose file to determine the exact Docker Compose version being used and the current configuration. Let me extract the complete depends_on structure from the file:
🏁 Script executed:
#!/bin/bash
# Extract docker-compose version and full depends_on structure with conditions
echo "=== Docker Compose version from file ==="
head -1 tools/deployment/package/docker-compose-all.yaml
echo ""
echo "=== Services with depends_on and their conditions ==="
rg -B 1 'depends_on:' tools/deployment/package/docker-compose-all.yaml | head -80
echo ""
echo "=== Database service with replicas ==="
rg -A 5 '^ database:' tools/deployment/package/docker-compose-all.yaml
echo ""
echo "=== db-table-creator service with full config ==="
rg -A 25 '^ db-table-creator:' tools/deployment/package/docker-compose-all.yamlLength of output: 1702
Je dois vérifier la structure complète des dépendances problématiques, en particulier pour results-cache-indices-creator:
🏁 Script executed:
#!/bin/bash
# Check results-cache and results-cache-indices-creator configuration
echo "=== Results-cache service with replicas ==="
rg -A 8 '^ results-cache:' tools/deployment/package/docker-compose-all.yaml
echo ""
echo "=== Results-cache-indices-creator service ==="
rg -A 20 '^ results-cache-indices-creator:' tools/deployment/package/docker-compose-all.yaml
echo ""
echo "=== Check if any depends_on use 'required: false' ==="
rg 'required:' tools/deployment/package/docker-compose-all.yaml
echo ""
echo "=== Check for healthchecks in database service ==="
rg -A 15 '^ database:' tools/deployment/package/docker-compose-all.yaml | grep -i healthLength of output: 1179
Now let me check which other services depend on the initialization services, to see the full impact chain:
🏁 Script executed:
#!/bin/bash
# Find all services that depend on initialization services
echo "=== Services depending on db-table-creator ==="
rg -B 2 'db-table-creator:' tools/deployment/package/docker-compose-all.yaml | grep -A 2 'depends_on'
echo ""
echo "=== Services depending on results-cache-indices-creator ==="
rg -B 2 'results-cache-indices-creator:' tools/deployment/package/docker-compose-all.yaml | grep -A 2 'depends_on'
echo ""
echo "=== Complete list of all services ==="
rg '^ [a-z-]+:$' tools/deployment/package/docker-compose-all.yaml | sed 's/^.*: //' | sed 's/:$//'
echo ""
echo "=== Check which services depend on db-table-creator and results-cache-indices-creator ==="
rg 'db-table-creator:' tools/deployment/package/docker-compose-all.yaml
echo "---"
rg 'results-cache-indices-creator:' tools/deployment/package/docker-compose-all.yamlLength of output: 1139
🏁 Script executed:
#!/bin/bash
# Get specific service names that depend on initialization services
echo "=== Which services depend on db-table-creator ==="
rg -B 5 'db-table-creator:' tools/deployment/package/docker-compose-all.yaml | rg '^ [a-z-]+:$'
echo ""
echo "=== Which services depend on results-cache-indices-creator ==="
rg -B 5 'results-cache-indices-creator:' tools/deployment/package/docker-compose-all.yaml | rg '^ [a-z-]+:$'Length of output: 166
🏁 Script executed:
#!/bin/bash
# Extract services with their full depends_on sections for better clarity
echo "=== All services with depends_on sections ==="
rg -A 5 'depends_on:' tools/deployment/package/docker-compose-all.yaml | head -150Length of output: 1574
🏁 Script executed:
#!/bin/bash
# Get the service names that depend on db-table-creator and results-cache-indices-creator
echo "=== Services and their complete depends_on config ==="
python3 << 'EOF'
import re
with open('tools/deployment/package/docker-compose-all.yaml', 'r') as f:
content = f.read()
# Find all service blocks
service_pattern = r'^ ([a-z-]+):\s*(?:<<:|image:|hostname:)'
services = re.findall(service_pattern, content, re.MULTILINE)
# For each service, check if it depends on db-table-creator or results-cache-indices-creator
lines = content.split('\n')
current_service = None
in_depends_on = False
depends_info = {}
for i, line in enumerate(lines):
# Check if this is a service definition
service_match = re.match(r'^ ([a-z-]+):', line)
if service_match:
current_service = service_match.group(1)
in_depends_on = False
continue
# Check if this is a depends_on section
if re.match(r'\s+depends_on:', line):
in_depends_on = True
if current_service not in depends_info:
depends_info[current_service] = []
continue
# Check if we're in depends_on and found a dependency
if in_depends_on and current_service:
dep_match = re.match(r'\s+([a-z-]+):', line)
if dep_match:
depends_info[current_service].append(dep_match.group(1))
elif not re.match(r'\s+condition:', line) and line.strip() and not line.startswith(' '):
in_depends_on = False
# Print services that depend on db-table-creator or results-cache-indices-creator
print("Services depending on db-table-creator or results-cache-indices-creator:")
for service, deps in sorted(depends_info.items()):
if 'db-table-creator' in deps or 'results-cache-indices-creator' in deps:
print(f" {service}: {', '.join(deps)}")
EOFLength of output: 855
Critical: depends_on relationships break when services have replicas: 0.
When a bundled service is disabled (e.g., CLP_DATABASE_ENABLED=0), that service is configured with replicas: 0 and no containers are launched. However, initialization services declare hard dependencies with health checks on these disabled services:
- Lines 93-94:
db-table-creatordepends ondatabasewithcondition: service_healthy - Lines 209-210:
results-cache-indices-creatordepends onresults-cachewithcondition: service_healthy
When replicas: 0, no containers start, so the health check never succeeds. This causes initialization services to wait indefinitely for a condition that can never be satisfied.
Impact: Disabling database or results-cache causes db-table-creator or results-cache-indices-creator to hang, which blocks all dependent services from starting:
db-table-creatorblocks: api-server, compression-scheduler, garbage-collector, mcp-server, query-scheduler, webuiresults-cache-indices-creatorblocks: api-server, garbage-collector, mcp-server, webui, reducer (via query-scheduler)
The entire stack fails to start when any bundled service is disabled, preventing the intended configuration flexibility.
Potential solutions:
- Use
required: falseon initialization service dependencies (Docker Compose v2.20.2+) to allow them to proceed when dependencies are unavailable - Implement conditional depends_on logic using profiles or separate compose files for bundled vs. external configurations
- Modify initialization services to validate and connect directly to configured endpoints rather than relying on Docker Compose
service_healthyconditions
|
@junhaoliao ready for another review |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
components/clp-package-utils/clp_package_utils/controller.py (2)
238-285: Redis backend database config missing when redis is not bundled.Lines 271-275 set
CLP_REDIS_BACKEND_DB_COMPRESSIONandCLP_REDIS_BACKEND_DB_QUERYonly when redis is bundled. However, when redis is NOT bundled (line 252 returns early), these env vars are never set. These backend database numbers are required by compression-scheduler, compression-worker, query-scheduler, and query-worker services (referenced in docker-compose-all.yaml lines 231-232, 264-265, 378-379, 417-418) to determine which logical database in Redis to use. Without these vars, services cannot connect to an external redis correctly.Move the backend DB configuration to the runtime method
_set_up_env_for_redis()so it's always set:def _set_up_env_for_redis_bundling(self) -> EnvVarsDict: """ Sets up environment variables and directories for bundling the redis component. :return: Dictionary of environment variables necessary to bundle the component. """ component_name = REDIS_COMPONENT_NAME if BundledService.REDIS not in self._clp_config.bundled: logger.info( "%s is not included in the 'bundled' configuration, skipping service bundling...", component_name, ) # Bundling return EnvVarsDict({"CLP_REDIS_ENABLED": "0"}) logger.info("Setting up environment for bundling %s...", component_name) conf_file = self._conf_dir / "redis" / "redis.conf" data_dir = self._clp_config.data_directory / component_name logs_dir = self._clp_config.logs_directory / component_name validate_redis_config(self._clp_config, conf_file, data_dir, logs_dir) resolved_data_dir = resolve_host_path_in_container(data_dir) resolved_logs_dir = resolve_host_path_in_container(logs_dir) resolved_data_dir.mkdir(exist_ok=True, parents=True) resolved_logs_dir.mkdir(exist_ok=True, parents=True) _chown_paths_if_root(resolved_data_dir, resolved_logs_dir) env_vars = EnvVarsDict() - # Backend databases - env_vars |= { - "CLP_REDIS_BACKEND_DB_COMPRESSION": str( - self._clp_config.redis.compression_backend_database - ), - "CLP_REDIS_BACKEND_DB_QUERY": str(self._clp_config.redis.query_backend_database), - } - # Paths env_vars |= { "CLP_REDIS_CONF_FILE_HOST": str(conf_file), "CLP_REDIS_DATA_DIR_HOST": str(data_dir), "CLP_REDIS_LOGS_DIR_HOST": str(logs_dir), } return env_vars def _set_up_env_for_redis(self) -> EnvVarsDict: """ Sets up environment variables for the Redis component. :return: Dictionary of environment variables necessary to launch the component. """ component_name = REDIS_COMPONENT_NAME logger.info(f"Setting up environment for {component_name}...") env_vars = EnvVarsDict() + # Backend databases + env_vars |= { + "CLP_REDIS_BACKEND_DB_COMPRESSION": str( + self._clp_config.redis.compression_backend_database + ), + "CLP_REDIS_BACKEND_DB_QUERY": str(self._clp_config.redis.query_backend_database), + } + # Connection config env_vars |= { "CLP_REDIS_HOST": _get_ip_from_hostname(self._clp_config.redis.host), "CLP_REDIS_PORT": str(self._clp_config.redis.port), } # Credentials env_vars |= { "CLP_REDIS_PASS": self._clp_config.redis.password, } return env_vars
311-356: Results cache collection name missing when results_cache is not bundled.Lines 344-348 set
CLP_RESULTS_CACHE_STREAM_COLLECTION_NAMEonly when results_cache is bundled. When results_cache is NOT bundled (line 325 returns early), this env var is never set. However, this collection name is required by results-cache-indices-creator and webui services (referenced in docker-compose-all.yaml lines 216 and 604) to know which MongoDB collection to use. Without this var, services cannot interact with an external results cache correctly.Move the collection name configuration to the runtime method
_set_up_env_for_results_cache():def _set_up_env_for_results_cache_bundling(self) -> EnvVarsDict: """ - Sets up environment variables and directories for bundling the queue component. + Sets up environment variables and directories for bundling the results cache component. :return: Dictionary of environment variables necessary to bundle the component. """ component_name = RESULTS_CACHE_COMPONENT_NAME if BundledService.RESULTS_CACHE not in self._clp_config.bundled: logger.info( "%s is not included in the 'bundled' configuration, skipping service bundling...", component_name, ) # Bundling return EnvVarsDict({"CLP_RESULTS_CACHE_ENABLED": "0"}) logger.info("Setting up environment for bundling %s...", component_name) conf_file = self._conf_dir / "mongo" / "mongod.conf" data_dir = self._clp_config.data_directory / component_name logs_dir = self._clp_config.logs_directory / component_name validate_results_cache_config(self._clp_config, conf_file, data_dir, logs_dir) resolved_data_dir = resolve_host_path_in_container(data_dir) resolved_logs_dir = resolve_host_path_in_container(logs_dir) resolved_data_dir.mkdir(exist_ok=True, parents=True) resolved_logs_dir.mkdir(exist_ok=True, parents=True) _chown_paths_if_root(resolved_data_dir, resolved_logs_dir) env_vars = EnvVarsDict() - # Collections - env_vars |= { - "CLP_RESULTS_CACHE_STREAM_COLLECTION_NAME": ( - self._clp_config.results_cache.stream_collection_name - ), - } # Paths env_vars |= { "CLP_RESULTS_CACHE_CONF_FILE_HOST": str(conf_file), "CLP_RESULTS_CACHE_DATA_DIR_HOST": str(data_dir), "CLP_RESULTS_CACHE_LOGS_DIR_HOST": str(logs_dir), } return env_vars def _set_up_env_for_results_cache(self) -> EnvVarsDict: """ Sets up environment variables for the results cache (MongoDB) component. :return: Dictionary of environment variables necessary to launch the component. """ component_name = RESULTS_CACHE_COMPONENT_NAME logger.info(f"Setting up environment for {component_name}...") env_vars = EnvVarsDict() + # Collections + env_vars |= { + "CLP_RESULTS_CACHE_STREAM_COLLECTION_NAME": ( + self._clp_config.results_cache.stream_collection_name + ), + } + # Connection config env_vars |= { "CLP_RESULTS_CACHE_DB_NAME": self._clp_config.results_cache.db_name, "CLP_RESULTS_CACHE_HOST": _get_ip_from_hostname(self._clp_config.results_cache.host), "CLP_RESULTS_CACHE_PORT": str(self._clp_config.results_cache.port), } return env_vars
♻️ Duplicate comments (2)
tools/deployment/package/docker-compose-all.yaml (1)
48-50: Critical: Hard dependencies with health checks block startup when bundled services are disabled.The issue identified in previous reviews persists: when a bundled service is set to
replicas: 0, initialization services that depend on it withcondition: service_healthywill hang indefinitely. Specifically:
db-table-creator(lines 93-94) requiresdatabaseto be healthyresults-cache-indices-creator(lines 209-210) requiresresults-cacheto be healthyWhen
CLP_DATABASE_ENABLED=0orCLP_RESULTS_CACHE_ENABLED=0, no containers start for these services, the health condition can never be satisfied, and downstream services (api-server, compression-scheduler, garbage-collector, mcp-server, query-scheduler, webui, reducer) remain blocked.The suggested solutions from the previous review remain valid:
- Use
required: falseon dependencies (Docker Compose v2.20.2+)- Implement conditional
depends_onusing profiles or separate compose files- Modify initialization services to validate endpoints directly rather than relying on Docker Compose health checks
Also applies to: 83-94, 107-109, 134-136, 173-175, 203-210
components/package-template/src/etc/clp-config.yaml (1)
18-19: Consider adding an explanatory comment for thebundledconfiguration.Users may not immediately understand what the
bundledlist controls. Adding a brief comment would clarify that this determines which third-party services CLP will deploy internally versus expect externally.Add an explanatory comment:
#credentials_file_path: "etc/credentials.yaml" # +## Which third-party services CLP should deploy. Omit services to use external instances instead. #bundled: ["database", "queue", "redis", "results_cache"] #
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
components/clp-package-utils/clp_package_utils/controller.py(9 hunks)components/package-template/src/etc/clp-config.yaml(1 hunks)tools/deployment/package/docker-compose-all.yaml(4 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1413
File: tools/docker-images/clp-package/Dockerfile:22-24
Timestamp: 2025-10-20T21:05:30.417Z
Learning: In the clp repository's Dockerfiles, ENV directives should be consolidated into multi-line ENV statements when possible to reduce image layers. ENV statements should only be split into separate commands when consolidation is not possible due to dependencies (e.g., when later variables must reference earlier ones that need to be set first, or when PATH must be modified sequentially).
📚 Learning: 2025-10-17T19:59:25.596Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:315-315
Timestamp: 2025-10-17T19:59:25.596Z
Learning: In components/clp-package-utils/clp_package_utils/controller.py, worker log directories (compression_worker, query_worker, reducer) created via `mkdir()` do not need `_chown_paths_if_root()` calls because directories are created with the same owner as the script caller. This differs from infrastructure service directories (database, queue, Redis, results cache) which do require explicit ownership changes.
Applied to files:
components/clp-package-utils/clp_package_utils/controller.py
📚 Learning: 2025-10-13T03:32:19.293Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).
Applied to files:
components/clp-package-utils/clp_package_utils/controller.py
📚 Learning: 2025-11-10T05:19:56.600Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1575
File: components/clp-py-utils/clp_py_utils/clp_config.py:602-607
Timestamp: 2025-11-10T05:19:56.600Z
Learning: In the y-scope/clp repository, the `ApiServer` class in `components/clp-py-utils/clp_py_utils/clp_config.py` does not need a `transform_for_container()` method because no other containerized service depends on the API server - it's only accessed from the host, so no docker-network communication is expected.
Applied to files:
components/clp-package-utils/clp_package_utils/controller.py
🧬 Code graph analysis (1)
components/clp-package-utils/clp_package_utils/controller.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
BundledService(105-109)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: package-image
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
There was a problem hiding this comment.
the changes look clean to me and i tested that modifying the bundled array in clp-config.yaml can effectively control whether some 3rd party service should be started or not as part of the docker compose project. Auto completion of the available items, provided by # yaml-language-server: $schema=../usr/share/config-schemas/clp-config.schema.json, also works as intended.
For the title, how about:
feat(clp-package): Support selectively enabling 3rd-party services via `bundled` config (resolves #1623).
bundled config (resolves #1623).
|
While this PR effectively skips port binding checks and service inits, to allow external 3rd-party services to be used without manual configs in the generated @coderabbitai create an issue to track the full support of configuring external 3rd party services here're the proposed changes Index: components/clp-py-utils/clp_py_utils/clp_config.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/components/clp-py-utils/clp_py_utils/clp_config.py b/components/clp-py-utils/clp_py_utils/clp_config.py
--- a/components/clp-py-utils/clp_py_utils/clp_config.py (revision fd474eb13fb6e72dd11e09ce5152cf17e3728b81)
+++ b/components/clp-py-utils/clp_py_utils/clp_config.py (date 1764245719061)
@@ -867,10 +867,14 @@
self.archive_output.storage.transform_for_container()
self.stream_output.storage.transform_for_container()
- self.database.transform_for_container()
- self.queue.transform_for_container()
- self.redis.transform_for_container()
- self.results_cache.transform_for_container()
+ if BundledService.DATABASE in self.bundled:
+ self.database.transform_for_container()
+ if BundledService.QUEUE in self.bundled:
+ self.queue.transform_for_container()
+ if BundledService.REDIS in self.bundled:
+ self.redis.transform_for_container()
+ if BundledService.RESULTS_CACHE in self.bundled:
+ self.results_cache.transform_for_container()
self.query_scheduler.transform_for_container()
self.reducer.transform_for_container()
if self.package.query_engine == QueryEngine.PRESTO and self.presto is not None:
Index: tools/deployment/package/docker-compose-all.yaml
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/tools/deployment/package/docker-compose-all.yaml b/tools/deployment/package/docker-compose-all.yaml
--- a/tools/deployment/package/docker-compose-all.yaml (revision fd474eb13fb6e72dd11e09ce5152cf17e3728b81)
+++ b/tools/deployment/package/docker-compose-all.yaml (date 1764250901256)
@@ -212,7 +212,8 @@
"python3",
"-u",
"-m", "clp_py_utils.initialize-results-cache",
- "--uri", "mongodb://results_cache:27017/${CLP_RESULTS_CACHE_DB_NAME:-clp-query-results}",
+ "--uri", "mongodb://${CLP_RESULTS_CACHE_HOST:-127.0.0.1}:${CLP_RESULTS_CACHE_PORT:-27017}\
+ /${CLP_RESULTS_CACHE_DB_NAME:-clp-query-results}",
"--stream-collection", "${CLP_RESULTS_CACHE_STREAM_COLLECTION_NAME:-stream-files}",
]
@@ -222,13 +223,15 @@
stop_grace_period: "300s"
environment:
BROKER_URL: "amqp://${CLP_QUEUE_USER:?Please set a value.}\
- :${CLP_QUEUE_PASS:?Please set a value.}@queue:5672"
+ :${CLP_QUEUE_PASS:?Please set a value.}@${CLP_QUEUE_HOST:-127.0.0.1}\
+ :${CLP_QUEUE_PORT:-5672}"
CLP_DB_PASS: "${CLP_DB_PASS:?Please set a value.}"
CLP_DB_USER: "${CLP_DB_USER:?Please set a value.}"
CLP_LOGGING_LEVEL: "${CLP_COMPRESSION_SCHEDULER_LOGGING_LEVEL:-INFO}"
CLP_LOGS_DIR: "/var/log/compression_scheduler"
PYTHONPATH: "/opt/clp/lib/python3/site-packages"
- RESULT_BACKEND: "redis://default:${CLP_REDIS_PASS:?Please set a value.}@redis:6379\
+ RESULT_BACKEND: "redis://default:${CLP_REDIS_PASS:?Please set a value.}\
+ @${CLP_REDIS_HOST:-127.0.0.1}:${CLP_REDIS_PORT:-6379}\
/${CLP_REDIS_BACKEND_DB_COMPRESSION:-1}"
volumes:
- *volume_clp_config_readonly
@@ -254,14 +257,16 @@
hostname: "compression_worker"
environment:
BROKER_URL: "amqp://${CLP_QUEUE_USER:?Please set a value.}\
- :${CLP_QUEUE_PASS:?Please set a value.}@queue:5672"
+ :${CLP_QUEUE_PASS:?Please set a value.}@${CLP_QUEUE_HOST:-127.0.0.1}\
+ :${CLP_QUEUE_PORT:-5672}"
CLP_CONFIG_PATH: "/etc/clp-config.yaml"
CLP_HOME: "/opt/clp"
CLP_LOGGING_LEVEL: "${CLP_COMPRESSION_WORKER_LOGGING_LEVEL:-INFO}"
CLP_LOGS_DIR: "/var/log/compression_worker"
CLP_WORKER_LOG_PATH: "/var/log/compression_worker/worker.log"
PYTHONPATH: "/opt/clp/lib/python3/site-packages"
- RESULT_BACKEND: "redis://default:${CLP_REDIS_PASS:?Please set a value.}@redis:6379\
+ RESULT_BACKEND: "redis://default:${CLP_REDIS_PASS:?Please set a value.}\
+ @${CLP_REDIS_HOST:-127.0.0.1}:${CLP_REDIS_PORT:-6379}\
/${CLP_REDIS_BACKEND_DB_COMPRESSION:-1}"
volumes:
- *volume_clp_config_readonly
@@ -369,13 +374,15 @@
stop_grace_period: "10s"
environment:
BROKER_URL: "amqp://${CLP_QUEUE_USER:?Please set a value.}\
- :${CLP_QUEUE_PASS:?Please set a value.}@queue:5672"
+ :${CLP_QUEUE_PASS:?Please set a value.}@${CLP_QUEUE_HOST:-127.0.0.1}\
+ :${CLP_QUEUE_PORT:-5672}"
CLP_DB_PASS: "${CLP_DB_PASS:?Please set a value.}"
CLP_DB_USER: "${CLP_DB_USER:?Please set a value.}"
CLP_LOGGING_LEVEL: "${CLP_QUERY_SCHEDULER_LOGGING_LEVEL:-INFO}"
CLP_LOGS_DIR: "/var/log/query_scheduler"
PYTHONPATH: "/opt/clp/lib/python3/site-packages"
- RESULT_BACKEND: "redis://default:${CLP_REDIS_PASS:?Please set a value.}@redis:6379\
+ RESULT_BACKEND: "redis://default:${CLP_REDIS_PASS:?Please set a value.}\
+ @${CLP_REDIS_HOST:-127.0.0.1}:${CLP_REDIS_PORT:-6379}\
/${CLP_REDIS_BACKEND_DB_QUERY:-0}"
volumes:
- *volume_clp_config_readonly
@@ -407,14 +414,16 @@
hostname: "query_worker"
environment:
BROKER_URL: "amqp://${CLP_QUEUE_USER:?Please set a value.}\
- :${CLP_QUEUE_PASS:?Please set a value.}@queue:5672"
+ :${CLP_QUEUE_PASS:?Please set a value.}@${CLP_QUEUE_HOST:-127.0.0.1}\
+ :${CLP_QUEUE_PORT:-5672}"
CLP_CONFIG_PATH: "/etc/clp-config.yaml"
CLP_HOME: "/opt/clp"
CLP_LOGGING_LEVEL: "${CLP_QUERY_WORKER_LOGGING_LEVEL:-INFO}"
CLP_LOGS_DIR: "/var/log/query_worker"
CLP_WORKER_LOG_PATH: "/var/log/query_worker/worker.log"
PYTHONPATH: "/opt/clp/lib/python3/site-packages"
- RESULT_BACKEND: "redis://default:${CLP_REDIS_PASS:?Please set a value.}@redis:6379\
+ RESULT_BACKEND: "redis://default:${CLP_REDIS_PASS:?Please set a value.}\
+ @${CLP_REDIS_HOST:-127.0.0.1}:${CLP_REDIS_PORT:-6379}\
/${CLP_REDIS_BACKEND_DB_QUERY:-0}"
volumes:
- *volume_clp_config_readonlyto test: |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain@junhaoliao I'll create an issue to track the full support for configuring external 3rd-party services. 🏁 Script executed: Length of output: 42 @junhaoliao I've created issue #1680 to track the full support for configuring external 3rd-party services. The issue covers both the conditional 🧠 Learnings used |
| # Paths | ||
| env_vars |= { | ||
| "CLP_QUEUE_LOGS_DIR_HOST": str(logs_dir), | ||
| } |
There was a problem hiding this comment.
this was accidentally removed
@coderabbitai create an issue to track
There was a problem hiding this comment.
🧩 Analysis chain
@junhaoliao, I'll create an issue to track this regression.
🏁 Script executed:
#!/bin/bash
# Create a GitHub issue to track the accidentally removed environment variable
gh issue create \
--repo y-scope/clp \
--title "Restore accidentally removed CLP_QUEUE_LOGS_DIR_HOST environment variable" \
--body "## Description
During the refactoring in PR #1648, the \`CLP_QUEUE_LOGS_DIR_HOST\` environment variable was accidentally removed from the \`_set_up_env_for_queue\` method in \`components/clp-package-utils/clp_package_utils/controller.py\`.
## Context
- Reported by: @junhaoliao
- PR: https://github.com/y-scope/clp/pull/1648
- Comment: https://github.com/y-scope/clp/pull/1648#discussion_r2572694798
The environment variable needs to be restored to ensure proper queue component configuration." \
--assignee junhaoliao
Length of output: 42
@junhaoliao, I've created issue #1693 to track restoring the accidentally removed CLP_QUEUE_LOGS_DIR_HOST environment variable: #1693
🧬 Code Graph Analysis Results
File: components/clp-py-utils/clp_py_utils/core.py
At line range 82-103:
def resolve_host_path_in_container(host_path: pathlib.Path) -> pathlib.Path:
"""
Resolves a host path and translates it into its container-mount equivalent absolute path. See
`resolve_host_path` for details about the resolution.
:param host_path: The host path.
:return: The translated path (with /mnt/host prefix).
"""
host_path = host_path.expanduser()
if not host_path.is_absolute():
pwd_host = os.getenv("CLP_PWD_HOST")
if pwd_host is None:
raise ValueError("Relative host path provided but CLP_PWD_HOST is not set.")
host_path = pathlib.Path(pwd_host) / host_path
host_path = host_path.absolute()
resolved_path = _resolve_symlinks_in_path(host_path)
if resolved_path is not None:
return resolved_path
return CONTAINER_DIR_FOR_HOST_ROOT / host_path.relative_to("/")
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py
At line range 296-297:
def get_archives_table_name(table_prefix: str, dataset: str | None) -> str:
return _get_table_name(table_prefix, ARCHIVES_TABLE_SUFFIX, dataset)
At line range 304-305:
def get_datasets_table_name(table_prefix: str) -> str:
return _get_table_name(table_prefix, DATASETS_TABLE_SUFFIX, None)
At line range 308-309:
def get_files_table_name(table_prefix: str, dataset: str | None) -> str:
return _get_table_name(table_prefix, FILES_TABLE_SUFFIX, dataset)
File: components/clp-package-utils/clp_package_utils/general.py
At line range 172-195:
def check_docker_dependencies(should_compose_project_be_running: bool, project_name: str):
"""
Checks if Docker and Docker Compose are installed, and whether a Docker Compose project is
running.
:param should_compose_project_be_running:
:param project_name: The Docker Compose project name to check.
:raise DockerNotAvailableError: If any Docker dependency is not installed.
:raise DockerComposeProjectNotRunningError: If the project isn't running when it should be.
:raise DockerComposeProjectAlreadyRunningError: If the project is running when it shouldn't be.
"""
try:
subprocess.check_output(
["docker", "--version"],
stderr=subprocess.STDOUT,
)
except subprocess.CalledProcessError as e:
raise DockerNotAvailableError("docker is not installed or available on the path", e) from e
is_running = _is_docker_compose_project_running(project_name)
if should_compose_project_be_running and not is_running:
raise DockerComposeProjectNotRunningError(project_name)
if not should_compose_project_be_running and is_running:
raise DockerComposeProjectAlreadyRunningError(project_name)
At line range 82-86:
class DockerComposeProjectNotRunningError(DockerDependencyError):
"""Raised when a Docker Compose project is not running but should be."""
def __init__(self, project_name: str) -> None:
super().__init__(errno.ESRCH, f"Docker Compose project '{project_name}' is not running.")
At line range 57-58:
class DockerDependencyError(OSError):
"""Base class for errors related to Docker dependencies."""
File: components/clp-py-utils/clp_py_utils/clp_config.py
At line range 105-109:
class BundledService(LowercaseStrEnum):
DATABASE = auto()
QUEUE = auto()
REDIS = auto()
RESULTS_CACHE = auto()
File: components/clp-py-utils/clp_py_utils/clp_config.py
At line range 633-877 (selected content):
class ClpConfig(BaseModel):
container_image_ref: NonEmptyStr | None = None
logs_input: FsIngestionConfig | S3IngestionConfig = FsIngestionConfig()
bundled: list[BundledServiceStr] = [
BundledService.DATABASE,
BundledService.QUEUE,
BundledService.REDIS,
BundledService.RESULTS_CACHE,
]
package: Package = Package()
database: Database = Database()
queue: Queue = Queue()
redis: Redis = Redis()
reducer: Reducer = Reducer()
results_cache: ResultsCache = ResultsCache()
compression_scheduler: CompressionScheduler = CompressionScheduler()
query_scheduler: QueryScheduler = QueryScheduler()
compression_worker: CompressionWorker = CompressionWorker()
query_worker: QueryWorker = QueryWorker()
webui: WebUi = WebUi()
garbage_collector: GarbageCollector = GarbageCollector()
api_server: ApiServer = ApiServer()
credentials_file_path: SerializablePath = CLP_DEFAULT_CREDENTIALS_FILE_PATH
mcp_server: McpServer | None = None
presto: Presto | None = None
archive_output: ArchiveOutput = ArchiveOutput()
stream_output: StreamOutput = StreamOutput()
data_directory: SerializablePath = CLP_DEFAULT_DATA_DIRECTORY_PATH
logs_directory: SerializablePath = CLP_DEFAULT_LOG_DIRECTORY_PATH
tmp_directory: SerializablePath = CLP_DEFAULT_TMP_DIRECTORY_PATH
aws_config_directory: SerializablePath | None = None
_container_image_id_path: SerializablePath = PrivateAttr(
default=CLP_PACKAGE_CONTAINER_IMAGE_ID_PATH
)
_version_file_path: SerializablePath = PrivateAttr(default=CLP_VERSION_FILE_PATH)
@field_validator("aws_config_directory")
@classmethod
def expand_profile_user_home(cls, value: SerializablePath | None) -> SerializablePath | None:
if value is not None:
value = value.expanduser()
return value
def make_config_paths_absolute(self, clp_home: pathlib.Path):
if StorageType.FS == self.logs_input.type:
self.logs_input.make_config_paths_absolute(clp_home)
self.credentials_file_path = make_config_path_absolute(clp_home, self.credentials_file_path)
self.archive_output.storage.make_config_paths_absolute(clp_home)
self.stream_output.storage.make_config_paths_absolute(clp_home)
self.data_directory = make_config_path_absolute(clp_home, self.data_directory)
self.logs_directory = make_config_path_absolute(clp_home, self.logs_directory)
self.tmp_directory = make_config_path_absolute(clp_home, self.tmp_directory)
self._container_image_id_path = make_config_path_absolute(
clp_home, self._container_image_id_path
)
self._version_file_path = make_config_path_absolute(clp_home, self._version_file_path)
def validate_logs_input_config(self, use_host_mount: bool = False):
logs_input_type = self.logs_input.type
if StorageType.FS == logs_input_type:
# NOTE: This can't be a pydantic validator since input_logs_dir might be a
# package-relative path that will only be resolved after pydantic validation
input_logs_dir = self.logs_input.directory
resolved_input_logs_dir = (
resolve_host_path_in_container(input_logs_dir) if use_host_mount else input_logs_dir
)
if not resolved_input_logs_dir.exists():
raise ValueError(f"logs_input.directory '{input_logs_dir}' doesn't exist.")
if not resolved_input_logs_dir.is_dir():
raise ValueError(f"logs_input.directory '{input_logs_dir}' is not a directory.")
if StorageType.S3 == logs_input_type and StorageEngine.CLP_S != self.package.storage_engine:
raise ValueError(
f"logs_input.type = 's3' is only supported with package.storage_engine"
f" = '{StorageEngine.CLP_S}'"
)
def validate_archive_output_config(self, use_host_mount: bool = False):
if (
StorageType.S3 == self.archive_output.storage.type
and StorageEngine.CLP_S != self.package.storage_engine
):
raise ValueError(
f"archive_output.storage.type = 's3' is only supported with package.storage_engine"
f" = '{StorageEngine.CLP_S}'"
)
archive_output_dir = self.archive_output.get_directory()
resolved_archive_output_dir = (
resolve_host_path_in_container(archive_output_dir)
if use_host_mount
else archive_output_dir
)
try:
validate_path_could_be_dir(resolved_archive_output_dir)
except ValueError as ex:
raise ValueError(f"archive_output.storage's directory is invalid: {ex}")
def validate_stream_output_config(self, use_host_mount: bool = False):
if (
StorageType.S3 == self.stream_output.storage.type
and StorageEngine.CLP_S != self.package.storage_engine
):
raise ValueError(
f"stream_output.storage.type = 's3' is only supported with package.storage_engine"
f" = '{StorageEngine.CLP_S}'"
)
stream_output_dir = self.stream_output.get_directory()
resolved_stream_output_dir = (
resolve_host_path_in_container(stream_output_dir)
if use_host_mount
else stream_output_dir
)
try:
validate_path_could_be_dir(resolved_stream_output_dir)
except ValueError as ex:
raise ValueError(f"stream_output.storage's directory is invalid: {ex}")
def validate_data_dir(self, use_host_mount: bool = False):
data_dir = self.data_directory
resolved_data_dir = resolve_host_path_in_container(data_dir) if use_host_mount else data_dir
try:
validate_path_could_be_dir(resolved_data_dir)
except ValueError as ex:
raise ValueError(f"data_directory is invalid: {ex}")
def validate_logs_dir(self, use_host_mount: bool = False):
logs_dir = self.logs_directory
resolved_logs_dir = resolve_host_path_in_container(logs_dir) if use_host_mount else logs_dir
try:
validate_path_could_be_dir(resolved_logs_dir)
except ValueError as ex:
raise ValueError(f"logs_directory is invalid: {ex}")
def validate_tmp_dir(self, use_host_mount: bool = False):
tmp_dir = self.tmp_directory
resolved_tmp_dir = resolve_host_path_in_container(tmp_dir) if use_host_mount else tmp_dir
try:
validate_path_could_be_dir(resolved_tmp_dir)
except ValueError as ex:
raise ValueError(f"tmp_directory is invalid: {ex}")
def validate_aws_config_dir(self, use_host_mount: bool = False):
profile_auth_used = False
auth_configs = []
if StorageType.S3 == self.logs_input.type:
auth_configs.append(self.logs_input.aws_authentication)
if StorageType.S3 == self.archive_output.storage.type:
auth_configs.append(self.archive_output.storage.s3_config.aws_authentication)
if StorageType.S3 == self.stream_output.storage.type:
auth_configs.append(self.stream_output.storage.s3_config.aws_authentication)
for auth in auth_configs:
if AwsAuthType.profile == auth.type:
profile_auth_used = True
break
if profile_auth_used:
if self.aws_config_directory is None:
raise ValueError(
"aws_config_directory must be set when using profile authentication"
)
resolved_aws_config_dir = (
resolve_host_path_in_container(self.aws_config_directory)
if use_host_mount
else self.aws_config_directory
)
if not resolved_aws_config_dir.exists():
raise ValueError(
f"aws_config_directory does not exist: '{self.aws_config_directory}'"
)
if not profile_auth_used and self.aws_config_directory is not None:
raise ValueError(
"aws_config_directory should not be set when profile authentication is not used"
)
def load_container_image_ref(self):
if self.container_image_ref is not None:
# Accept configured value for debug purposes
return
if self._container_image_id_path.exists():
with open(self._container_image_id_path) as image_id_file:
self.container_image_ref = image_id_file.read().strip()
else:
with open(self._version_file_path) as version_file:
clp_package_version = version_file.read().strip()
self.container_image_ref = f"ghcr.io/y-scope/clp/clp-package:{clp_package_version}"
def get_shared_config_file_path(self) -> pathlib.Path:
return self.logs_directory / CLP_SHARED_CONFIG_FILENAME
def get_deployment_type(self) -> DeploymentType:
if QueryEngine.PRESTO == self.package.query_engine:
return DeploymentType.BASE
return DeploymentType.FULL
def dump_to_primitive_dict(self):
custom_serialized_fields = {
"database",
"queue",
"redis",
}
d = self.model_dump(exclude=custom_serialized_fields)
for key in custom_serialized_fields:
d[key] = getattr(self, key).dump_to_primitive_dict()
return d
@model_validator(mode="after")
def validate_presto_config(self):
query_engine = self.package.query_engine
presto = self.presto
if query_engine == QueryEngine.PRESTO and presto is None:
raise ValueError(
f"`presto` config must be non-null when query_engine is `{query_engine}`"
)
return self
def transform_for_container(self):
"""
Converts all relevant directories to absolute paths inside the container, and updates
component hostnames and ports to their container service names and default ports.
"""
self.data_directory = pathlib.Path("/") / CLP_DEFAULT_DATA_DIRECTORY_PATH
self.logs_directory = pathlib.Path("/") / CLP_DEFAULT_LOG_DIRECTORY_PATH
self.tmp_directory = pathlib.Path("/") / CLP_DEFAULT_TMP_DIRECTORY_PATH
if self.aws_config_directory is not None:
self.aws_config_directory = CONTAINER_AWS_CONFIG_DIRECTORY
self.logs_input.transform_for_container()
self.archive_output.storage.transform_for_container()
self.stream_output.storage.transform_for_container()
self.database.transform_for_container()
self.queue.transform_for_container()
self.redis.transform_for_container()
self.results_cache.transform_for_container()
self.query_scheduler.transform_for_container()
self.reducer.transform_for_container()
if self.package.query_engine == QueryEngine.PRESTO and self.presto is not None:
self.presto.transform_for_container()
File: components/clp-package-utils/clp_package_utils/general.py
At line range 365-373:
def dump_shared_container_config(container_clp_config: ClpConfig, clp_config: ClpConfig):
"""
Dumps the given container config to `CLP_SHARED_CONFIG_FILENAME` in the logs directory, so that
it's accessible in the container.
:param container_clp_config:
:param clp_config:
"""
dump_container_config(container_clp_config, clp_config, CLP_SHARED_CONFIG_FILENAME)
At line range 315-325:
def generate_docker_compose_container_config(clp_config: ClpConfig) -> ClpConfig:
"""
Copies the given config and transforms mount paths and hosts for Docker Compose.
:param clp_config:
:return: The container config.
"""
container_clp_config = clp_config.model_copy(deep=True)
container_clp_config.transform_for_container()
return container_clp_config
File: components/clp-py-utils/clp_py_utils/clp_config.py
At line range 145-161 (snippet indicating CLP_HOME resolution, shown for context):
def get_clp_home():
# Determine CLP_HOME from an environment variable or this script's path
clp_home = None
if "CLP_HOME" in os.environ:
clp_home = pathlib.Path(os.environ["CLP_HOME"])
else:
for path in pathlib.Path(__file__).resolve().parents:
if "lib" == path.name:
clp_home = path.parent
break
if clp_home is None:
raise ValueError("CLP_HOME is not set and could not be determined automatically.")
if not clp_home.exists():
raise ValueError("CLP_HOME set to nonexistent path.")
return clp_home.resolve()
File: components/clp-py-utils/clp_py_utils/clp_config.py
At line range 672-679:
def is_retention_period_configured(clp_config: ClpConfig) -> bool:
if clp_config.archive_output.retention_period is not None:
return True
if clp_config.results_cache.retention_period is not None:
return True
return False
File: components/clp-py-utils/clp_py_utils/clp_config.py
At line range 502-514:
def validate_db_config(
clp_config: ClpConfig,
component_config: pathlib.Path,
data_dir: pathlib.Path,
logs_dir: pathlib.Path,
):
resolved_component_config = resolve_host_path_in_container(component_config)
if not resolved_component_config.exists():
raise ValueError(f"{DB_COMPONENT_NAME} configuration file missing: '{component_config}'.")
_validate_data_directory(data_dir, DB_COMPONENT_NAME)
_validate_log_directory(logs_dir, DB_COMPONENT_NAME)
validate_port(f"{DB_COMPONENT_NAME}.port", clp_config.database.host, clp_config.database.port)
File: components/clp-py-utils/clp_py_utils/clp_config.py
At line range 580-590:
def validate_webui_config(
clp_config: ClpConfig,
client_settings_json_path: pathlib.Path,
server_settings_json_path: pathlib.Path,
):
for path in [client_settings_json_path, server_settings_json_path]:
resolved_path = resolve_host_path_in_container(path)
if not resolved_path.exists():
raise ValueError(f"{WEBUI_COMPONENT_NAME} {path} is not a valid path to settings.json")
validate_port(f"{WEBUI_COMPONENT_NAME}.port", clp_config.webui.host, clp_config.webui.port)
File: components/clp-py-utils/clp_py_utils/clp_config.py
At line range 829-832:
def get_deployment_type(self) -> DeploymentType:
if QueryEngine.PRESTO == self.package.query_engine:
return DeploymentType.BASE
return DeploymentType.FULL
🧠 Learnings used
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:315-315
Timestamp: 2025-10-17T19:59:25.596Z
Learning: In components/clp-package-utils/clp_package_utils/controller.py, worker log directories (compression_worker, query_worker, reducer) created via `mkdir()` do not need `_chown_paths_if_root()` calls because directories are created with the same owner as the script caller. This differs from infrastructure service directories (database, queue, Redis, results cache) which do require explicit ownership changes.
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1136
File: components/webui/client/src/pages/IngestPage/Details/sql.ts:1-1
Timestamp: 2025-07-29T21:00:07.757Z
Learning: User junhaoliao requested creating a GitHub issue to track server-side SQL query error handling improvements, specifically to prevent uncaught query errors from causing 500 errors to reach the client.
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1501
File: tools/deployment/presto-clp/scripts/init.py:10-13
Timestamp: 2025-10-27T07:07:37.901Z
Learning: In `tools/deployment/presto-clp/scripts/init.py`, the `DATABASE_COMPONENT_NAME` and `DATABASE_DEFAULT_PORT` constants are intentionally duplicated from `clp_py_utils.clp_config` because `clp_py_utils` is not installed in the Presto init script's runtime environment. The two flows are separate and this duplication is documented. There are plans to merge these flows after a future release.
…a `bundled` config (resolves y-scope#1623). (y-scope#1648)
Description
Resolves #1623 using the suggested approach
Checklist
breaking change.
Validation performed
enable all bundled services
compress a file
search in webui works.
remove one service
it isn't boot when running start-clp
Summary by CodeRabbit
New Features
Bug Fixes
✏️ Tip: You can customize this high-level summary in your review settings.