Skip to content

feat(clp-package): Add log-ingestor config interface and Docker Compose orchestration.#1741

Merged
LinZhihao-723 merged 70 commits into
y-scope:mainfrom
LinZhihao-723:log-ingestor-package-integration
Dec 10, 2025
Merged

feat(clp-package): Add log-ingestor config interface and Docker Compose orchestration.#1741
LinZhihao-723 merged 70 commits into
y-scope:mainfrom
LinZhihao-723:log-ingestor-package-integration

Conversation

@LinZhihao-723

@LinZhihao-723 LinZhihao-723 commented Dec 8, 2025

Copy link
Copy Markdown
Member
  • Integrate log-ingestor into CLP package.
  • Integrate log-ingestor into Docker Compose orchestration.

Description

NOTE: This PR depends on #1740.

This PR integrates log-ingestor into CLP package. This means log-ingestor will be started as a CLP service when starting a CLP package (clp-json) by default inside the Docker Compose.

The changes in this PR are mainly mirrored from #1575 and #1709.

Notice that users need to directly submit requests to log-ingestor's host at this moment. In a future PR, we should integrate log-ingestor into the API server so that users do not need to know where the log ingestor is hosted explicitly.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Summary by CodeRabbit

  • New Features

    • Introduced a log ingestor service with configurable host, port, logging level and buffer-flush behaviour.
    • Deployment and runtime now conditionally enable the log ingestor based on log storage configuration.
  • Documentation

    • Added configuration templates, deployment diagrams and multi-host guidance for the log ingestor.
    • Updated deployment compose files to include the log ingestor service.

✏️ Tip: You can customize this high-level summary in your review settings.

"CRITICAL",
]

LoggingLevelRust = Literal[

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is not related to Python logging, can we move it to clp_config.py? or some constants / definitions file if clp_config.py is already too long

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will put it in clp_config.py. Figuring out where to put this constant seems out of scope for this PR. Prefer to do it in a refactoring PR.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

class LogIngestor(BaseModel):
host: DomainStr = "localhost"
port: Port = 3270
buffer_timeout: PositiveInt = 300

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buffer_flush_timeout sounds more intuitive


class LogIngestor(BaseModel):
host: DomainStr = "localhost"
port: Port = 3270

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nah, we don't really enforce any rules in picking ports. though maybe we can just do the (API server port + 1) = 3002 so the number appears less random?

# initial_backoff_ms: 100
# max_backoff_ms: 5000

## log-ingestor config

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, maybe we can just remove this comment? since it doesn't add much info (like it's kinda apparent that the log_ingestor config object is for log-ingestor configuration

#log_ingestor:
# host: "localhost"
# port: 3270
# # The timeout (in seconds) after which the log buffer is flushed for compression if no new input

@junhaoliao junhaoliao Dec 9, 2025

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users might not care about Pydantic fields, they will only modify this config.

ah i don't mean users need to check the pydantic code. like if they have a YAML LSP capable plugin installed in their IDE (e.g. PyCharm and VS Code), the IDE would show the helper text because we do generate schemas for the config file now.

I agree we should keep the comments in the YAML file, until a reference doc is added for the whole config file. see #1233

| webui | Web server for the UI |
| mcp_server | MCP server for AI agent to access CLP functionalities |
| garbage_collector | Process to manage data retention |
| log_ingestor | A background service for continuous log ingestion |

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

Server for log ingestion from S3/SQS sources.

The "background service" part isn't that descriptive cuz other services are also services and also run in the background

host: DomainStr = "localhost"
port: Port = 3270
buffer_timeout: PositiveInt = 300
buffer_size_threshold: PositiveInt = 52428800 # 50 MB

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai submit an issue for fixing the units in other doc-strings

# # arrives.
# buffer_timeout: 300
# # The log buffer size (in bytes) that triggers a flush for compression.
# buffer_size_threshold: 52428800

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure (note: we want to use MiB in the docs

| webui | Web server for the UI |
| mcp_server | MCP server for AI agent to access CLP functionalities |
| garbage_collector | Process to manage data retention |
| log_ingestor | A background service for continuous log ingestion |

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Server for orchestrating and running continuous log ingestion jobs.

sure

@junhaoliao junhaoliao left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the title is fine, though i think it can be as simple as:

feat(clp-package): Add log-ingestor config interface and Docker Compose orchestration.

(the purpose of the service will be included in the release notes so we don't have to be explicit about that in this commit message

Comment thread docs/src/user-docs/guides-multi-host.md Outdated
"""
component_name = LOG_INGESTOR_COMPONENT_NAME
if self._clp_config.log_ingestor is None:
logger.info(f"log-ingestor is not configured, skipping {component_name} creation...")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems there's a regression

Comment thread components/clp-package-utils/clp_package_utils/controller.py
Comment thread components/clp-package-utils/clp_package_utils/controller.py Outdated
@LinZhihao-723 LinZhihao-723 changed the title feat(clp-package): Add log-ingestor service for continuous log ingestion from S3: feat(clp-package): Add log-ingestor config interface and Docker Compose orchestration. Dec 9, 2025
junhaoliao
junhaoliao previously approved these changes Dec 9, 2025

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 09631af and a240ca5.

📒 Files selected for processing (2)
  • components/clp-package-utils/clp_package_utils/controller.py (3 hunks)
  • components/package-template/src/etc/clp-config.template.json.yaml (1 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 549
File: components/core/tests/test-ir_encoding_methods.cpp:1180-1186
Timestamp: 2024-10-01T07:59:11.208Z
Learning: In the context of loop constructs, LinZhihao-723 prefers using `while (true)` loops and does not consider alternative loop constructs necessarily more readable.
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 549
File: components/core/tests/test-ir_encoding_methods.cpp:1180-1186
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In the context of loop constructs, LinZhihao-723 prefers using `while (true)` loops and does not consider alternative loop constructs necessarily more readable.
📚 Learning: 2025-06-24T20:13:46.758Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 939
File: components/package-template/src/etc/clp-config.yml:64-64
Timestamp: 2025-06-24T20:13:46.758Z
Learning: When users ask CodeRabbit to create an issue after providing suggestions, they want a GitHub issue created with the high-level requirements and context, not specific code implementations.

Applied to files:

  • components/package-template/src/etc/clp-config.template.json.yaml
📚 Learning: 2025-08-08T21:15:10.905Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/test_identity_transformation.py:87-95
Timestamp: 2025-08-08T21:15:10.905Z
Learning: In the CLP project's integration tests (Python code), variable names should use "logs" (plural) rather than "log" (singular) when referring to test logs or log-related entities, as this aligns with the naming conventions used throughout the codebase.

Applied to files:

  • components/package-template/src/etc/clp-config.template.json.yaml
📚 Learning: 2025-10-17T19:59:25.596Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:315-315
Timestamp: 2025-10-17T19:59:25.596Z
Learning: In components/clp-package-utils/clp_package_utils/controller.py, worker log directories (compression_worker, query_worker, reducer) created via `mkdir()` do not need `_chown_paths_if_root()` calls because directories are created with the same owner as the script caller. This differs from infrastructure service directories (database, queue, Redis, results cache) which do require explicit ownership changes.

Applied to files:

  • components/clp-package-utils/clp_package_utils/controller.py
📚 Learning: 2025-09-28T15:00:22.170Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Applied to files:

  • components/clp-package-utils/clp_package_utils/controller.py
🧬 Code graph analysis (1)
components/clp-package-utils/clp_package_utils/controller.py (2)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
  • StorageType (161-163)
components/clp-py-utils/clp_py_utils/core.py (1)
  • resolve_host_path_in_container (82-103)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: package-image
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: build (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: build (macos-15)
  • GitHub Check: rust-checks
🔇 Additional comments (1)
components/clp-package-utils/clp_package_utils/controller.py (1)

35-52: Verify CLP_LOG_INGESTOR_ENABLED against docker-compose templates before merging

The new _set_up_env_for_log_ingestor follows existing patterns: it explicitly sets CLP_LOG_INGESTOR_ENABLED="0" when disabled, and relies on docker-compose defaults to enable the component otherwise. Confirm that docker-compose service definitions use ${CLP_LOG_INGESTOR_ENABLED:-1} or equivalent to ensure the implicit enabling behavior matches other *_ENABLED flags (e.g., CLP_QUERY_SCHEDULER_ENABLED).

Comment on lines +16 to +28
## log-ingestor config. Currently, the config is applicable only if `logs_input.type` is "s3".
#log_ingestor:
# host: "localhost"
# port: 3002
# # The timeout (in seconds) after which the log buffer is flushed for compression if no new input
# # arrives.
# buffer_flush_timeout: 300
# # The log buffer size (in bytes) that triggers a flush for compression.
# buffer_flush_threshold: 268435456 # 256 MiB
# # The capacity of the internal channel used for communication between an ingestion job and the
# # log buffer.
# channel_capacity: 10
# logging_level: "INFO"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Log ingestor template block looks consistent; consider clarifying all applicability constraints

The new log_ingestor block and its defaults look sensible and consistent with other sections in this template. To reduce user confusion, consider also mentioning in the comment that log‑ingestor is only supported when the package storage engine is clp-s (since the Python config enforces that), in addition to logs_input.type being "s3".

🤖 Prompt for AI Agents
components/package-template/src/etc/clp-config.template.json.yaml lines 16-28:
the comment for the log_ingestor block only says it applies when logs_input.type
is "s3" but the Python config also requires the package storage engine to be
"clp-s"; update the commented header to state both applicability constraints
(logs_input.type == "s3" AND package storage engine == "clp-s"), keeping the
existing defaults and formatting, and make the note concise so users know both
conditions must be met for log_ingestor to be used.

@LinZhihao-723 LinZhihao-723 merged commit 39398e6 into y-scope:main Dec 10, 2025
20 checks passed
davidlion pushed a commit to davidlion/clp that referenced this pull request Jan 17, 2026
junhaoliao pushed a commit to junhaoliao/clp that referenced this pull request May 17, 2026
junhaoliao pushed a commit to junhaoliao/clp that referenced this pull request May 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants