feat(deployment): Support configuration of external third-party services (resolves #1680).#1681
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughContainer-specific host/port mappings and JDBC URL construction were parameterized and conditionalized: transform_for_container methods now accept an is_bundled flag and container host/port/env provisioning is only applied when a service is marked bundled. Deployment templates and docs updated accordingly. Changes
Sequence Diagram(s)sequenceDiagram
participant Operator
participant Controller
participant CLPConfig
participant Database
participant Queue
participant Redis
participant ResultsCache
participant DockerCompose
Operator->>Controller: request containerized deployment
Controller->>CLPConfig: load clp-config (bundled list)
CLPConfig->>Controller: bundled membership for services
Controller->>Database: transform_for_container(is_bundled?)
Controller->>Queue: transform_for_container(is_bundled?)
Controller->>Redis: transform_for_container(is_bundled?)
Controller->>ResultsCache: transform_for_container(is_bundled?)
Database-->>Controller: return host/port/env only if bundled
Queue-->>Controller: return host/port/env only if bundled
Redis-->>Controller: return host/port/env only if bundled
ResultsCache-->>Controller: return host/port/env only if bundled
Controller->>DockerCompose: emit env vars / extra_hosts accordingly
DockerCompose->>Operator: start containers with configured endpoints
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/src/user-docs/guides-external-database.md (1)
176-215: Clarify how managed-service “connection strings” map onto the new host/port configThe new bundling and external
database/results_cachehost/port examples look good and line up with the code. One small docs mismatch: the earlier “AWS DocumentDB or MongoDB Atlas” section tells users to “use the provided connection string when configuring CLP (see below)”, but the new config steps only show host/port fields.To avoid confusion for users coming from a full MongoDB/Atlas URI, consider adding a short clarification here, for example:
- Explicitly state that CLP currently expects separate
host/port(plus optionaldb_name), not a rawmongodb+srv://...URI; and/or- Briefly describe how to extract the host and port from a typical managed-service connection string.
That would align the step-by-step config with the earlier managed-service guidance.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between 82f73d8 and f6be19af0140a79bcd170147451f64a674bacfb5.
📒 Files selected for processing (6)
components/clp-py-utils/clp_py_utils/clp_config.py(1 hunks)docs/src/user-docs/guides-external-database.md(1 hunks)docs/src/user-docs/guides-multi-host.md(2 hunks)docs/src/user-docs/quick-start/clp-json.md(1 hunks)docs/src/user-docs/quick-start/clp-text.md(1 hunks)tools/deployment/package/docker-compose-all.yaml(5 hunks)
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1575
File: components/clp-py-utils/clp_py_utils/clp_config.py:602-607
Timestamp: 2025-11-10T05:19:56.600Z
Learning: In the y-scope/clp repository, the `ApiServer` class in `components/clp-py-utils/clp_py_utils/clp_config.py` does not need a `transform_for_container()` method because no other containerized service depends on the API server - it's only accessed from the host, so no docker-network communication is expected.
📚 Learning: 2025-11-10T05:19:56.600Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1575
File: components/clp-py-utils/clp_py_utils/clp_config.py:602-607
Timestamp: 2025-11-10T05:19:56.600Z
Learning: In the y-scope/clp repository, the `ApiServer` class in `components/clp-py-utils/clp_py_utils/clp_config.py` does not need a `transform_for_container()` method because no other containerized service depends on the API server - it's only accessed from the host, so no docker-network communication is expected.
Applied to files:
components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-10-13T03:32:19.293Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).
Applied to files:
docs/src/user-docs/guides-multi-host.md
📚 Learning: 2025-06-18T20:39:05.899Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 968
File: docs/src/user-guide/quick-start/overview.md:73-109
Timestamp: 2025-06-18T20:39:05.899Z
Learning: The CLP project team prefers to use video content to demonstrate detailed procedural steps (like tarball extraction) rather than including every step in the written documentation, keeping the docs focused on conceptual guidance.
Applied to files:
docs/src/user-docs/guides-multi-host.mddocs/src/user-docs/quick-start/clp-text.md
📚 Learning: 2025-07-05T03:38:16.779Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1069
File: docs/src/user-guide/quick-start/clp-json.md:135-152
Timestamp: 2025-07-05T03:38:16.779Z
Learning: The CLP project team has decided to refrain from using include directives in their documentation at present, preferring to maintain duplicated content rather than using shared includes or partials for de-duplication.
Applied to files:
docs/src/user-docs/guides-multi-host.md
📚 Learning: 2025-06-18T20:48:48.990Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 968
File: docs/src/user-guide/quick-start/overview.md:53-54
Timestamp: 2025-06-18T20:48:48.990Z
Learning: CLP is designed to run on Linux systems where Python is typically pre-installed, so Python installation links are generally not needed in CLP documentation.
Applied to files:
docs/src/user-docs/guides-multi-host.md
📚 Learning: 2025-08-25T16:27:50.549Z
Learnt from: davemarco
Repo: y-scope/clp PR: 1198
File: components/webui/server/src/plugins/app/Presto.ts:38-43
Timestamp: 2025-08-25T16:27:50.549Z
Learning: In the CLP webui Presto configuration, host and port are set via package settings (configurable), while user, catalog, and schema are set via environment variables (environment-specific). This mixed approach is intentional - settings are typically set by package and some values don't need to be package-configurable.
Applied to files:
docs/src/user-docs/guides-multi-host.mdtools/deployment/package/docker-compose-all.yaml
📚 Learning: 2025-07-01T14:52:15.217Z
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:6-15
Timestamp: 2025-07-01T14:52:15.217Z
Learning: For installation scripts in the CLP project, maintain consistency in command patterns across different platforms (e.g., using separate update and install commands like `apk update && apk add`, `apt update && apt install`, `yum update && yum install`) rather than platform-specific optimizations, to ensure uniform script structure and readability.
Applied to files:
docs/src/user-docs/guides-multi-host.md
📚 Learning: 2025-09-25T05:13:13.298Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:217-223
Timestamp: 2025-09-25T05:13:13.298Z
Learning: The compression scheduler service in CLP runs with CLP_UID_GID (current user's UID:GID) rather than CLP_SERVICE_CONTAINER_UID_GID (999:999), unlike infrastructure services such as database, queue, redis, and results cache which run with the service container UID:GID.
Applied to files:
tools/deployment/package/docker-compose-all.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: package-image
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (4)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
870-877: Bundled checks correctly avoid clobbering external service endpointsConditioning the
transform_for_container()calls onBundledServicemembership cleanly prevents overwriting host/port when services are external while preserving existing behaviour when they’re bundled. This aligns with the new docs and bundling model.docs/src/user-docs/quick-start/clp-json.md (1)
21-24: Tip text and external-database cross-link read clearlyThe expanded tip nicely combines
--setup-onlyguidance with a clear pointer to the external-database guide, and the relative link path looks correct for this location.docs/src/user-docs/quick-start/clp-text.md (1)
23-26: Consistent external-database guidance across quick-startsThe updated tip matches the clp-json quick-start and clearly directs users who want external databases/third-party services to the dedicated guide. Consistent and easy to follow.
docs/src/user-docs/guides-multi-host.md (1)
40-41: Bundled vs external service guidance is aligned and clearThe multi-host steps now cleanly explain configuring the
bundledlist and explicitly tie external-database usage to skippingdatabase/results-cacheservices while still running the init jobs. This matches the config model and should make multi-host + external DB setups much easier to reason about.Also applies to: 55-65, 123-127
| @@ -37,7 +37,7 @@ rules: | |||
|
|
|||
| line-length: | |||
| max: 100 | |||
| allow-non-breakable-words: false | |||
| allow-non-breakable-words: true | |||
There was a problem hiding this comment.
this has to be set as true; otherwise, the clp-config.yaml with the long URL inside would fail the rule.
@coderabbitai create an issue to track this lint rule change, and i will submit the lint rule change in a separate PR
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Create a GitHub issue to track the yamllint rule change
gh issue create \
--repo y-scope/clp \
--title "Update yamllint configuration to allow non-breakable words in long lines" \
--body "## Context
This issue tracks the yamllint configuration change to set \`allow-non-breakable-words: true\` in the \`line-length\` rule. This change is necessary because \`clp-config.yaml\` contains long URLs that would fail the yamllint check with the current setting.
## Background
While working on PR #1681 (support for external third-party services configuration), it was identified that the yamllint \`allow-non-breakable-words\` setting needs to be changed from \`false\` to \`true\` to accommodate long URLs in configuration files.
## Related
- PR: https://github.com/y-scope/clp/pull/1681
- Comment: https://github.com/y-scope/clp/pull/1681#discussion_r2569846542
- Requested by: @junhaoliao
## Proposed Change
In \`.yamllint.yaml\`:
\`\`\`yaml
line-length:
max: 100
allow-non-breakable-words: true # Changed from false
\`\`\`" \
--assignee junhaoliao
Length of output: 42
@junhaoliao, I've created issue #1683 to track this yamllint configuration change. The issue includes the context about why this change is needed (to accommodate long URLs in clp-config.yaml) and references this PR and comment as backlinks. The issue has been assigned to you for the follow-up PR.
🧠 Learnings used
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1466
File: .github/workflows/clp-rust-checks.yaml:14-15
Timestamp: 2025-10-22T21:14:12.225Z
Learning: Repository y-scope/clp: In GitHub Actions workflows (e.g., .github/workflows/clp-rust-checks.yaml), YAML anchors/aliases are acceptable and preferred to avoid duplication; if actionlint flags an alias node (e.g., on push.paths) as an error, treat it as a tool limitation and do not require inlining unless the team asks to silence the warning.
Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-11-17T22:58:50.056Z
Learning: In the y-scope/clp repository, when enabling new linting tools (ruff, mypy) on Python components, the team uses an incremental approach: first enable the tools with errors allowed (exit code 0), apply only safe auto-fixable fixes, then address remaining issues in follow-up PRs. During the initial enablement PR, reviews should focus on correctness of auto-fixes rather than flagging new code quality issues.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/src/user-docs/guides-external-database.md (1)
218-223: Multi-host deployment note is helpful but could be more explicit.The note clarifies that
db-table-creatorandresults-cache-indices-creatorjobs are still needed, but it would be helpful to explicitly state that these initialization jobs should be run before starting other services. Consider adding a reference or cross-link to the initialization sequence in the multi-host guide.Suggested refinement:
When using external databases in a multi-host deployment, you do **not** need to start the `database` and `results-cache` Docker Compose services. Skip those services when following the -[multi-host deployment guide][multi-host-guide]. However, you still need to run the database -initialization jobs (`db-table-creator` and `results-cache-indices-creator`). +[multi-host deployment guide][multi-host-guide]. However, you still need to run the database +initialization jobs (`db-table-creator` and `results-cache-indices-creator`) **on one host before** +starting the other CLP services.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between c6d50515a0b2f9782fdd6a0c230e3a09db81583c and cc2aa722850e767e7c374445772df84e6f4f48d7.
📒 Files selected for processing (3)
.yamllint.yaml(1 hunks)components/package-template/src/etc/clp-config.yaml(2 hunks)docs/src/user-docs/guides-external-database.md(1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-07-08T17:20:02.560Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1078
File: components/webui/server/src/plugins/MongoSocketIoServer/index.ts:105-108
Timestamp: 2025-07-08T17:20:02.560Z
Learning: The y-scope/clp codebase has a 100-character line length limit that takes precedence over other style preferences like template literals when they conflict.
Applied to files:
.yamllint.yaml
📚 Learning: 2025-11-03T16:17:40.223Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1535
File: components/clp-rust-utils/src/clp_config/package/config.rs:47-61
Timestamp: 2025-11-03T16:17:40.223Z
Learning: In the y-scope/clp repository, the `ApiServer` struct in `components/clp-rust-utils/src/clp_config/package/config.rs` is a Rust-native configuration type and does not mirror any Python code, unlike other structs in the same file (Config, Database, ResultsCache, Package) which are mirrors of Python definitions.
Applied to files:
components/package-template/src/etc/clp-config.yaml
📚 Learning: 2025-10-22T21:14:12.225Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1466
File: .github/workflows/clp-rust-checks.yaml:14-15
Timestamp: 2025-10-22T21:14:12.225Z
Learning: Repository y-scope/clp: In GitHub Actions workflows (e.g., .github/workflows/clp-rust-checks.yaml), YAML anchors/aliases are acceptable and preferred to avoid duplication; if actionlint flags an alias node (e.g., on push.paths) as an error, treat it as a tool limitation and do not require inlining unless the team asks to silence the warning.
Applied to files:
components/package-template/src/etc/clp-config.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: package-image
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (4)
.yamllint.yaml (1)
40-40: Verify that non-breakable-words relaxation is necessary.The change allows non-breakable words to exceed the 100-character line-length limit. Confirm that the new environment-variable-based URIs and multi-service configurations in the accompanying changes (ClpConfig transformations, docker-compose-all.yaml) actually introduce non-breakable tokens that exceed this threshold and justify relaxing the constraint.
docs/src/user-docs/guides-external-database.md (2)
178-189: YAML syntax and bundled services example look good.The structured step-by-step configuration flow is clear and provides helpful YAML examples. The commented-out services demonstrate which services can be removed from the bundled list to use external instances.
210-216: I'm unable to access the repository to verify the credentials.yaml structure against the codebase configuration expectations due to cloning failures. However, I can provide a rewritten review comment that reflects this limitation:
Unable to verify credentials.yaml structure without codebase access.
The documentation shows credentials nested under
database:withusernameandpasswordfields. This structure should be verified against the actual configuration parser in the codebase to confirm it matches expectations for external database credential handling. Due to technical limitations preventing repository access, this verification could not be completed.components/package-template/src/etc/clp-config.yaml (1)
19-20: Documentation URL reference is correct and accessible.The URL
https://docs.yscope.com/clp/main/user-docs/guides-external-database.htmlhas been verified to exist and contains the "External database setup" guide with sections covering MariaDB/MySQL and MongoDB configuration for CLP. The anchor fragment#configuring-clp-to-use-external-databasesfollows standard markdown heading conventions and appropriately references a configuration section on that page.
| :::{important} | ||
| If your external service is running on the host machine, **do not use `127.0.0.1`** as the host. | ||
| The CLP components run in containers, so `127.0.0.1` would refer to the container itself. | ||
| Instead, use: | ||
| - `host.docker.internal` to refer to the host machine. | ||
| - The host's actual network IP address (e.g., `192.168.1.10`) | ||
| ::: |
There was a problem hiding this comment.
Host networking guidance needs platform-specific clarification.
The note about 127.0.0.1 and host.docker.internal is helpful, but host.docker.internal is primarily available on Docker Desktop (Mac/Windows). For Linux Docker hosts without Docker Desktop, users need different guidance—typically using the actual host IP or DNS hostname. Consider adding a platform-specific note.
Suggested addition (after line 207):
- `host.docker.internal` to refer to the host machine.
- The host's actual network IP address (e.g., `192.168.1.10`)
+
+ ::::{note}
+ **For Linux Docker hosts**: `host.docker.internal` is not available. Use the actual host IP address or hostname instead.
+ ::::Committable suggestion skipped: line range outside the PR's diff.
There was a problem hiding this comment.
not sure whether this coderabbit comment is valid or not
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
host.docker.internal support Linux Docker 2024 2025
💡 Result:
Short answer: Yes — but with caveats.
- Docker Desktop (macOS, Windows — and Docker Desktop for Linux) provides host.docker.internal by default so containers can resolve the host localhost. [1][2][6]
- On plain Docker Engine on Linux you must create the mapping yourself (common options):
- docker run --add-host=host.docker.internal:host-gateway (or in Compose: extra_hosts: - "host.docker.internal:host-gateway"). The dockerd daemon exposes the special host-gateway value and can be configured with host-gateway-ip. [1][3][5]
- Or map the host gateway/docker0 IP directly (e.g. 172.17.0.1) in /etc/hosts or via extra_hosts. [4][2]
References
- Docker docs — dockerd / host-gateway and --add-host examples. [1]
- Docker docs — host.docker.internal behavior / Networking how‑tos. [2]
- Docker Engine 20.10 release notes (adds Linux support hooks). [3]
- Community notes / common workarounds (docker0 / extra_hosts). [4]
If you want, I can show the exact docker run or docker‑compose lines to use for your setup.
@hoophalab The comment is valid. On Docker Desktop (Mac/Windows/Docker Desktop for Linux), host.docker.internal is available by default, but on plain Docker Engine on Linux, you must create the mapping yourself.
For Linux Docker Engine users, they need to use docker run --add-host=host.docker.internal:host-gateway or configure it in Docker Compose with extra_hosts: - "host.docker.internal:host-gateway".
Suggested documentation improvement:
Update the important block in the documentation (around line 203-207) to include platform-specific guidance:
:::{important}
If your external service is running on the host machine, **do not use `127.0.0.1`** as the host.
The CLP components run in containers, so `127.0.0.1` would refer to the container itself.
Instead, use:
- `host.docker.internal` to refer to the host machine (available by default on Docker Desktop)
- The host's actual network IP address (e.g., `192.168.1.10`)
**For Linux Docker Engine users** (not Docker Desktop): `host.docker.internal` requires manual
configuration. Add the following to your `docker-compose-all.yaml` services that need to
connect to the host:
```yaml
extra_hosts:
- "host.docker.internal:host-gateway"Alternatively, use the host's actual IP address instead.
:::
This will help Linux users avoid confusion and provide them with a clear path forward.
---
<details>
<summary>🧠 Learnings used</summary>
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).
</details>
<!-- This is an auto-generated reply by CodeRabbit -->
There was a problem hiding this comment.
i think this requires more thoughts / investigations. Let's put this review on hold
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
| @@ -1,4 +1,5 @@ | |||
| # yaml-language-server: $schema=../usr/share/config-schemas/clp-config.schema.json | |||
| # | |||
| @@ -15,6 +16,8 @@ | |||
| ## File containing credentials for services | |||
| #credentials_file_path: "etc/credentials.yaml" | |||
| # | |||
| ## Remove any bundled services below if you wish to use your own. For more information, see | |||
| ## https://docs.yscope.com/clp/main/user-docs/guides-external-database.html#configuring-clp-to-use-external-databases | |||
| #bundled: ["database", "queue", "redis", "results_cache"] | |||
There was a problem hiding this comment.
to keep consistent with the doc
| #bundled: ["database", "queue", "redis", "results_cache"] | |
| #bundled: | |
| # - "database" | |
| # - "queue" | |
| # - "redis" | |
| # - "results_cache" |
| :::{important} | ||
| If your external service is running on the host machine, **do not use `127.0.0.1`** as the host. | ||
| The CLP components run in containers, so `127.0.0.1` would refer to the container itself. | ||
| Instead, use: | ||
| - `host.docker.internal` to refer to the host machine. | ||
| - The host's actual network IP address (e.g., `192.168.1.10`) | ||
| ::: |
There was a problem hiding this comment.
not sure whether this coderabbit comment is valid or not
9e19374 to
723df8b
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/clp-package-utils/clp_package_utils/controller.py`:
- Around line 193-200: The code eagerly resolves the DB hostname via
_get_ip_from_hostname when building env vars (BundledService.DATABASE branch),
which can produce stale extra_hosts for dynamic DNS; update the controller to
log or document this limitation and avoid surprising operators by: add a brief
warning message in the same setup routine (e.g., set_up_env or the surrounding
method) noting that hostname is resolved at setup time and may become stale for
cloud-managed/floating IP DB endpoints, include the raw hostname in the log (not
just the IP) and recommend using stable static IPs or service discovery, and
add/update the controller class docstring or README to mention this behavior so
operators are aware.
In `@components/clp-py-utils/clp_py_utils/clp_config.py`:
- Around line 404-407: Change the boolean positional parameter is_bundled on
each transform_for_container method to be keyword-only to satisfy FBT001: update
the signatures (e.g., transform_for_container(self, *, is_bundled: bool)) for
the four implementations found in clp_config.py and keep the type hint and
behavior unchanged; update any internal docstring or comments if present and
ensure call sites use is_bundled=... (e.g., is_bundled=BundledService.DATABASE
in self.bundled) so calls remain explicit and linting passes.
| if BundledService.DATABASE not in self._clp_config.bundled: | ||
| env_vars |= { | ||
| "CLP_DB_PORT": str(self._clp_config.database.port), | ||
| "CLP_EXTRA_HOST_DATABASE_NAME": DB_COMPONENT_NAME, | ||
| "CLP_EXTRA_HOST_DATABASE_ADDR": _get_ip_from_hostname( | ||
| self._clp_config.database.host | ||
| ), | ||
| } |
There was a problem hiding this comment.
DNS resolution at setup time could become stale for external services.
_get_ip_from_hostname resolves the hostname to an IP eagerly during set_up_env. If the external database's IP changes between setup and when containers actually connect (e.g., cloud-managed DB with floating IPs), the extra_hosts mapping will be stale.
This is an inherent limitation of Docker's extra_hosts mechanism and likely acceptable for the target deployment model, but worth documenting for operators using dynamic DNS endpoints.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/clp-package-utils/clp_package_utils/controller.py` around lines
193 - 200, The code eagerly resolves the DB hostname via _get_ip_from_hostname
when building env vars (BundledService.DATABASE branch), which can produce stale
extra_hosts for dynamic DNS; update the controller to log or document this
limitation and avoid surprising operators by: add a brief warning message in the
same setup routine (e.g., set_up_env or the surrounding method) noting that
hostname is resolved at setup time and may become stale for
cloud-managed/floating IP DB endpoints, include the raw hostname in the log (not
just the IP) and recommend using stable static IPs or service discovery, and
add/update the controller class docstring or README to mention this behavior so
operators are aware.
| def transform_for_container(self, is_bundled: bool): | ||
| if is_bundled: | ||
| self.host = DB_COMPONENT_NAME | ||
| self.port = self.DEFAULT_PORT |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider making is_bundled a keyword-only argument to address FBT001.
Ruff flags FBT001 (boolean positional argument) on all four transform_for_container signatures (Lines 404, 482, 513, 550). Making it keyword-only would improve call-site readability and satisfy the lint rule.
♻️ Example fix (apply similarly to Redis, ResultsCache, Queue)
- def transform_for_container(self, is_bundled: bool):
+ def transform_for_container(self, *, is_bundled: bool):Callers would then use:
self.database.transform_for_container(is_bundled=BundledService.DATABASE in self.bundled)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def transform_for_container(self, is_bundled: bool): | |
| if is_bundled: | |
| self.host = DB_COMPONENT_NAME | |
| self.port = self.DEFAULT_PORT | |
| def transform_for_container(self, *, is_bundled: bool): | |
| if is_bundled: | |
| self.host = DB_COMPONENT_NAME | |
| self.port = self.DEFAULT_PORT |
🧰 Tools
🪛 Ruff (0.15.1)
[error] 404-404: Boolean-typed positional argument in function definition
(FBT001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/clp-py-utils/clp_py_utils/clp_config.py` around lines 404 - 407,
Change the boolean positional parameter is_bundled on each
transform_for_container method to be keyword-only to satisfy FBT001: update the
signatures (e.g., transform_for_container(self, *, is_bundled: bool)) for the
four implementations found in clp_config.py and keep the type hint and behavior
unchanged; update any internal docstring or comments if present and ensure call
sites use is_bundled=... (e.g., is_bundled=BundledService.DATABASE in
self.bundled) so calls remain explicit and linting passes.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tools/deployment/package/docker-compose-all.yaml`:
- Around line 235-236: The URI uses CLP_EXTRA_HOST_RESULTS_CACHE_ADDR
(defaulting to results_cache) directly in the MongoDB connection string instead
of the usual hardcoded hostname + extra_hosts pattern; add a short inline
comment above the "--uri" line explaining that this deliberate divergence is
required for MongoDB replica set member hostname resolution (so replica members
resolve their actual hostnames) and should not be changed to the other services'
hostname/extra_hosts pattern; reference the exact symbol
CLP_EXTRA_HOST_RESULTS_CACHE_ADDR and the URI fragment
"mongodb://${CLP_EXTRA_HOST_RESULTS_CACHE_ADDR:-results_cache}:.../${CLP_RESULTS_CACHE_DB_NAME:-clp-query-results}"
in the comment so future maintainers understand the intent.
| "--uri", "mongodb://${CLP_EXTRA_HOST_RESULTS_CACHE_ADDR:-results_cache}\ | ||
| :${CLP_RESULTS_CACHE_PORT:-27017}/${CLP_RESULTS_CACHE_DB_NAME:-clp-query-results}", |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Intentional divergence from other services' host resolution pattern.
This service uses CLP_EXTRA_HOST_RESULTS_CACHE_ADDR directly in the URI (defaulting to results_cache) instead of using the hardcoded hostname + extra_hosts redirection pattern used by all other services. The PR description explains this is for MongoDB replica set member hostname resolution.
Consider adding a brief inline comment above this line explaining why the address is used directly here rather than the hostname-based approach, so future maintainers don't "fix" it to match the others.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tools/deployment/package/docker-compose-all.yaml` around lines 235 - 236, The
URI uses CLP_EXTRA_HOST_RESULTS_CACHE_ADDR (defaulting to results_cache)
directly in the MongoDB connection string instead of the usual hardcoded
hostname + extra_hosts pattern; add a short inline comment above the "--uri"
line explaining that this deliberate divergence is required for MongoDB replica
set member hostname resolution (so replica members resolve their actual
hostnames) and should not be changed to the other services' hostname/extra_hosts
pattern; reference the exact symbol CLP_EXTRA_HOST_RESULTS_CACHE_ADDR and the
URI fragment
"mongodb://${CLP_EXTRA_HOST_RESULTS_CACHE_ADDR:-results_cache}:.../${CLP_RESULTS_CACHE_DB_NAME:-clp-query-results}"
in the comment so future maintainers understand the intent.
…lution for Docker compatibility
…s in Docker environments
The {{- if has "..." .Values.clpConfig.bundled }} guards from PR y-scope#1681
were lost when resolving merge conflicts with main. Restore them for:
- database-statefulset.yaml
- db-table-creator-job.yaml (initContainers)
- queue-statefulset.yaml
- redis-statefulset.yaml
- results-cache-statefulset.yaml
- results-cache-indices-creator-job.yaml (initContainers)
Without these guards, Helm renders bundled third-party resources even
when configured to use external services.
The merge incorrectly hardcoded the MongoDB URI using clp.fullname instead of the clp.resultsCacheHost/clp.resultsCachePort helpers introduced in PR y-scope#1681 for external service support.
…ces (resolves y-scope#1680). (y-scope#1681) Co-authored-by: hoophalab <200652805+hoophalab@users.noreply.github.com>
Description
Enable users to deploy CLP with external third-party services (database, queue, Redis,
results cache) instead of the bundled ones, for both the CLP package (Docker Compose)
and Helm chart deployments. Users configure which services are bundled via the
bundledlist in
clp-config.yaml; services omitted from the list are treated as external andmust be provided by the user.
Changes
clp_config.py— Host/port transformation withis_bundledDatabase,Queue,Redis,ResultsCache:transform_for_container()now acceptsis_bundled: bool. The host is always set to the internal component name (e.g.,database,queue) so containers use a consistent hostname. The port is only resetto the default when bundled; when external, the user-configured port is preserved.
Database.get_container_url()usesself.hostandself.port(post-transform)instead of hardcoded values, so JDBC URLs reflect the actual (possibly external)
connection.
ClpConfig.transform_for_container()passes theis_bundledflag based onself.bundledmembership.controller.py— Conditional env var export and external host resolution_set_up_env_for_database(),_set_up_env_for_queue(),_set_up_env_for_redis(),_set_up_env_for_results_cache(): Only emitCLP_EXTRA_HOST_*_NAME,CLP_EXTRA_HOST_*_ADDR, andCLP_*_PORTwhen the service is not bundled. Whenbundled, Docker DNS resolves service names without
extra_hosts._resolve_external_host(): Returns Docker'shost-gatewaytoken forlocalhost/127.0.0.1(so containers can reach services running on the Docker hostvia Docker's built-in host resolution), and falls back to DNS resolution for other
hostnames.
docker-compose-all.yaml—extra_hostsand env-var-driven portsextra_hoststox-service-defaultswith env-var-driven entries. When bundled,defaults map no-op hostnames to
127.0.0.2(harmless, Docker DNS takes precedence).When external, the controller populates them to override DNS with the external address.
queue,redis,database) andenv vars for ports (previously hardcoded).
results-cache-indices-creatorURI uses${CLP_EXTRA_HOST_RESULTS_CACHE_ADDR:-results_cache}so that when external, the addr resolves directly (needed because
initialize-results-cache.pyuses the URI hostname to configure the MongoDB replica set member, and the external MongoDB
server must be able to resolve that hostname to itself).
Helm chart — Conditional bundling
clpConfig.bundledlist tovalues.yaml(default: all four services)._helpers.tpl:clp.isBundled,clp.databaseHost/Port,clp.queueHost/Port,clp.redisHost/Port,clp.resultsCacheHost/Port. Theseresolve to the bundled K8s service name or the user-configured external host/port.
configmap.yamluses the host/port helpers instead of hardcoded service names.{{- if has "service" .Values.clpConfig.bundled }}conditionals.results-cache-indices-creator-jobanddb-table-creator-jobuse the helpers andconditionally include init-wait containers.
0.1.4-dev.3.Config templates & docs
clp-config.template.*.yaml: Improvedbundledsection comments; explicit YAML listformat with guidance on removing services.
guides-external-database.md: Restructured to use the newbundledconfigurationworkflow (configure
bundledlist + host/port, then start CLP). Added warning aboutlocalhost MongoDB not being supported when external.
guides-docker-compose-deployment.md: Simplified multi-host guide; removed manualpost-setup config editing (now auto-generated).
quick-start/clp-json.md,clp-text.md: Cross-reference to external database guide.Checklist
breaking change.
Validation performed
Tested on the actual built package (
taskto build, thensbin/start-clp.sh).Scenario 1: Bundled mode (default config)
All services bundled (default
clp-config.yaml), no external services running.Task: Verify that the default bundled deployment starts and all services become healthy.
Command:
Output:
Explanation: All 14 containers started and became healthy (database, queue, redis,
results-cache, db-table-creator, results-cache-indices-creator, compression-scheduler,
query-scheduler, compression-worker, query-worker, reducer, webui, api-server,
garbage-collector).
.envcontains noCLP_EXTRA_HOST_*orCLP_*_PORTentries —bundled mode uses Docker DNS only, no
extra_hostsoverrides needed.Scenario 2: All four services external
All four services (
database,queue,redis,results_cache) unbundled and runningexternally on the Docker host.
Task: Start 4 external services using the same images as CLP, configure
bundled: [], and verify all CLP containers connect to the external services.Setup — start four external services on the Docker host:
(Credentials match
etc/credentials.yaml. Images match those used by CLP. MongoDB isstarted with
--bind_ip_alland initialized at the host's non-loopback IP sincelocalhost MongoDB is not supported when external — see docs warning.)
Config (
etc/clp-config-ext-all.yaml):Command:
Output:
Verification — container status:
No
database,queue,redis, orresults-cachecontainers — all are external.Verification —
.envexternal host entries:Database, queue, and redis configured with
localhostresolved to Docker'shost-gatewaytoken, which Docker maps to the host's IP at container start. Results cache uses the
non-loopback IP directly (
192.168.3.247).Verification — service logs:
results-cache-indices-creatorconnected to external MongoDB and exited cleanly (exitcode 0, no warnings — replica set was initialized at the same IP
192.168.3.247).db-table-creatorconnected to external MariaDB:compression-schedulerstarted successfully:query-schedulerconnected to the external database viaextra_hostsmapping:Task: Verify end-to-end data flow (compression) with all external services.
Command:
Output:
Explanation: Compression job was submitted through external RabbitMQ, stored
metadata in external MariaDB, used external Redis for task coordination, and the
results-cache-indices-creatorsuccessfully created indices on the external MongoDB.Full data pipeline works with all four services external.
Scenario 3: Bundled mode in Kubernetes (kind cluster)
All services bundled (default
values.yaml), deployed viaset-up-test.sh.Task: Verify that the default Helm chart deployment starts all pods successfully.
Command:
After fix (
helm upgrade):Explanation: All 14 pods running/completed. Workers connected to queues:
Scenario 4: All four services external in Kubernetes (kind cluster)
All four services unbundled, running as separate Docker containers on the host, accessible
from kind pods via the host's non-loopback IP (
192.168.3.247).Setup — start four external services on the host:
(Credentials match
values.yamldefaults. Non-default ports used to avoid conflictswith the bundled deployment that was running earlier. No manual
rs.initiate()needed —initialize-results-cache.pydetects the uninitialized replica set and initializes itusing the configured URI hostname.)
Override values (
ext-values.yaml):Command:
Pod status:
No database, queue, redis, or results-cache statefulsets — all external.
Verification — no bundled resources:
Verification — service logs (external connections):
db-table-creatorconnected to external MariaDB:results-cache-indices-creatorcompleted (exit 0, no output — replica set auto-initialized).compression-workerconnected to external queue and redis:query-schedulerconnected to external database:Verification — configmap (external hosts):
{ "SqlDbHost": "192.168.3.247", "SqlDbPort": 13306, "MongoDbHost": "192.168.3.247", "MongoDbPort": 17017 }Explanation: Helm chart correctly skips all bundled service resources (StatefulSets,
Services, PVs, Secrets for queue/redis) when
bundled: []. The configmap, Celeryconnection strings, and init-job URIs all correctly resolve to the user-configured
external host/port. All CLP pods connect to external services successfully.