You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Centralized operational logging architecture for multi-host CLP deployments
Document structure: This document covers Docker Compose deployments first, followed by a Kubernetes-specific section that explains how the same approaches
apply to Kubernetes.
Request
Background
Currently, CLP's deployment architecture requires container-to-host volume mounts for all service
logs. Each service writes logs to files within /var/log/<component>/ which are mounted from ${CLP_LOGS_DIR_HOST:-./var/log} on the host. While this approach has been convenient for
single-host deployments (allowing users to easily provide logs by archiving host files), it creates
significant challenges for multi-host deployments:
Multi-host incompatibility: In Kubernetes or multi-node Docker Compose deployments, logs are
scattered across different hosts, making centralized log access difficult
Storage overhead: Each host requires dedicated storage for log retention
Operational complexity: Admins must access (e.g. SSH into) individual hosts or set up
additional log aggregation infrastructure
With planned support for multi-host deployments through Kubernetes, and addition of the log-ingestor
component, there's an opportunity to modernize the operational logging architecture to leverage:
CLP's own compression technology - operational logs can benefit from the same
high-compression-ratio storage that user logs enjoy
Container-native logging - Docker/Kubernetes native log drivers eliminate the need for host
mounts
Centralized access - all logs accessible from a single control node
WebUI integration - operational logs viewable alongside user logs in the existing CLP
interface
Requirements
The new operational logging solution must satisfy the following requirements:
R1: Multi-host Support
R1.1: Support both Kubernetes (multi-node) and Docker Compose (single or multi-host)
deployments
R1.2: All logs accessible from a central control node, regardless of where services are
running
R1.3: No per-host log file access required for normal operations
R2: Tiered Access
R2.1 (Hot): Real-time access to recent logs (0-X minutes) for debugging active/crashed
services
Maximum acceptable lag: < 30 seconds
Must support live tailing
R2.2 (Warm): Recent historical logs (X minutes - Y hours) available uncompressed for immediate
grep/analysis
Maximum acceptable lag: < 5 minutes
Must support live tailing
R2.3 (Cold): Older logs compressed using CLP for long-term storage and efficient retrieval
Maximum acceptable lag: < 24 hours
Must support full-text search
R3: Admin Access & Export
R3.1: Deployment admins can view all logs from all services
R3.2: Easy export mechanism for sending logs to support/developers
R3.3: Export should include both real-time and historical logs
R4: WebUI Integration
R4.1: Dedicated WebUI page for viewing real-time operational logs (files on disk)
R4.2: Operational logs searchable through existing Search page once archived
R4.3: Support filtering by service name, log level, time range
R4.4: (Future) Admin-only access control for operational logs
R5.2: Use CLP's compression capabilities for long-term storage
R5.3: No heavyweight dependencies
R6: Incremental Migration
R6.1: Support gradual service-by-service migration from file-based to centralized logging
R6.2: Maintain backward compatibility during transition
R6.3: Clear deprecation path for CLP_LOGS_DIR environment variables
Possible implementation
Architecture Overview
Current
flowchart TD
A["All services write to files<br/>(some also write to stdout)"]
B["Docker local logging driver"]
subgraph Host1 [Host 1]
C1["/var/log/<component>/*.log<br/>(via volume mount)"]
D1["docker logs container-name"]
end
subgraph Host2 [Host 2]
C2["/var/log/<component>/*.log<br/>(via volume mount)"]
D2["docker logs container-name"]
end
A --> B
B --> C1
B --> D1
B --> C2
B --> D2
E["Admin must access each host separately<br/>(SSH, copy files, etc.)"]
C1 -.-> E
C2 -.-> E
Loading
Characteristics:
Services write logs to files via volume mounts from ${CLP_LOGS_DIR_HOST}
Logs scattered across hosts - no centralized access
Admin must SSH to each host to view/export logs
After (CLP-managed Fluent Bit)
flowchart TD
A[All services write to stdout]
B["Docker fluentd logging driver"]
subgraph ControlNode [Control Node]
C["Fluent Bit<br/>(receives logs from all hosts)"]
subgraph Output1 [Output 1: File with rotation]
D["/var/log/<component>/*.log"]
E["WebUI reads directly<br/>(real-time access)"]
D --> E
end
subgraph Output2 [Output 2: S3 + CLP archives]
F["S3 (IRv2 compressed logs)"]
G["Log-ingestor (periodic ingestion)"]
H["CLP Archives (dataset='_clp')"]
I["WebUI Search page"]
F --> G --> H --> I
end
C --> D
C --> F
end
subgraph WorkerNode1 [Worker Node 1]
W1["Container logs"]
end
subgraph WorkerNode2 [Worker Node 2]
W2["Container logs"]
end
A --> B
B --> W1 -->|"fluentd-address"| C
B --> W2 -->|"fluentd-address"| C
J["docker logs (via dual logging cache)"]
B -.-> J
Loading
Characteristics:
All logs centralized on control node via Fluent Bit
Organized path structure (/var/log/<component>/)
Automatic S3 upload with IRv2 compression
Historical logs searchable in WebUI via _clp dataset
docker logs still works via dual logging (Docker 20.10+)
Three-Tier Data Lifecycle
Hot Tier (0-X minutes): Files on disk at /var/log/<component>/*.log
Access method: WebUI new endpoint /os/cat (similar to existing /os/ls)
Retention: Managed by Fluent Bit file rotation (time-based or size-based)
Purpose: Real-time debugging, live tail, recent log access
Warm Tier (X min - Y hours): IRv2 files on S3
Access method: (Future optimization) Query directly from IRv2 without full archive
ingestion
Retention: Until log-ingestor processes and archives them
Access method: Existing WebUI Search page with dataset=_clp filter
Retention: Configurable archive retention policy
Purpose: Long-term searchable storage with high compression
Component Changes
1. Fluent Bit deployment
Docker Compose
Kubernetes: DaemonSet or single Deployment on control node (to be determined based on performance
testing)
Log tailing
2. Fluent Bit Configuration for Docker Compose
fluent-bit.conf
[SERVICE]
Flush 5
Daemon Off
Log_Level info
[INPUT]
Name forward
Listen 0.0.0.0
Port 24224
# Output 1: File for real-time access
[OUTPUT]
Name file
Match *
Path /var/log
Format json
# Rotation policy (matches CLP plugin flush)
# TBD: time-based or size-based to align with CLP plugin
# Output 2: CLP plugin for S3 + IRv2 compression
[OUTPUT]
Name clp_s3
Match *
s3_region ${CLP_S3_REGION}
s3_bucket ${CLP_S3_BUCKET}
s3_bucket_prefix ir/${FLUENT_BIT_TAG}/%Y/%m/%d/
upload_size_mb 16
use_disk_buffer true
# Uses Zstd compression for IRv2 format
Open Questions:
What should be the rotation policy for file output?
Time-based (e.g., rotate every 5 minutes)?
Size-based (e.g., rotate at 50MB)?
Should align with CLP plugin's flush policy to ensure synchronization
What is the exact flush/upload behavior of the CLP Fluent Bit plugin (irv2-beta)?
For historical logs, redirect to the search page with the dataset filter set to _clp (see 6.3)
Open Questions:
Should we also add /os/tail to support live tailing with server-sent events?
Pro: True live tail without client polling
Con: More complex implementation
6.3 Search Page Enhancement
Once the log-ingestor for CLP operational logs is ready, we can verify that the logs are searchable
in the search page.
Dataset filter:
Add URL parameter support: /search?dataset=_clp
Migration Timeline
Phase 1: Infrastructure Setup
Add Fluent Bit service to docker-compose-all.yaml
Create Fluent Bit configuration with dual outputs (file + CLP plugin)
Add fluentbit and minio (optional) to bundled services in config schema
Update clp-config.yaml templates with S3 path structure
Phase 2: WebUI Development
Implement /os/cat API endpoint
Create Operational Logs page UI
Add dataset URL parameter support to Search page
Mount /var/log volume to webui service
Phase 3: Service Migration for Third-Party Services
Migrate bundled services (no change in our code is required)
database (MariaDB)
queue (RabbitMQ)
redis
results-cache (MongoDB)
Challenges:
Each service has different log format
May require custom Fluent Bit parsers
Verify no log loss during migration
Phase 4: Service Migration for First-Party Services (First Wave)
Migrate Python-based services (easier log format standardization):
compression-scheduler
compression-worker
query-scheduler
query-worker
garbage-collector
reducer
Per-service checklist:
Update logging driver to fluentd
Keep CLP_LOGS_DIR env var but mark as deprecated
Test real-time log access via WebUI
Test log ingestion to archives
Validate search functionality
Phase 5: Service Migration for First-Party Services (Second Wave)
Migrate remaining services:
webui
mcp-server
api-server
log-ingestor (tricky: logging about logging)
spider-scheduler
spider-compression-worker
Phase 6: Cleanup & Optimization
Remove CLP_LOGS_DIR environment variables
Remove *volume_clp_logs mounts from services (keep only in Fluent Bit and webui)
Remove ${CLP_LOGS_DIR_HOST} host mounts
Documentation updates
Performance tuning
Future Optimizations
Direct IRv2 querying (Warm tier optimization):
Query worker currently supports archives only
Extend to support IRv2 stream files on S3
Would enable searching logs before full archive ingestion
WebUI live tail (Server-sent events):
Current proposal: Client-side polling of /os/cat
Optimization: Server-sent events for true push-based tail
Authentication & Authorization:
Current: No access control on operational logs
Future: Admin-only access to _clp dataset
Requires: an authentication system in the CLP Package (TBA)
Structured logging standardization:
Ensure all CLP services output JSON logs
Consistent field names (timestamp, level, message, component, etc.)
Easier filtering and parsing in WebUI
Multi-cluster support:
Current design: Single S3 bucket per deployment
Future: Multiple clusters writing to the same bucket with cluster ID prefix
Use case: Multi-region deployments for legal compliance
Alternative approaches: Native Docker logging drivers
This section evaluates lighter-weight alternatives to the CLP-managed Fluent Bit approach, using
Docker's native logging drivers. These alternatives may appeal to users who:
Prefer a simpler CLP Package without log aggregation infrastructure
Already have their own log aggregation systems (Fluentd, Vector, OpenTelemetry, Loki, etc.)
Deploy on single-host environments only
Candidate logging drivers
1. json-file driver
The default Docker logging driver. Writes JSON-formatted logs to local files.
flowchart TD
A[All services write to stdout]
B["Docker json-file logging driver<br/>(with rotation: max-size, max-file)"]
subgraph Host1 [Host 1]
C1["/var/lib/docker/containers/<id>/*.log"]
D1["docker logs container-name"]
end
subgraph Host2 [Host 2]
C2["/var/lib/docker/containers/<id>/*.log"]
D2["docker logs container-name"]
end
A --> B
B --> C1 --> D1
B --> C2 --> D2
E["External log aggregation (optional)<br/>Fluentd, Vector, OpenTelemetry, Loki, etc."]
C1 -.-> E
C2 -.-> E
Comma-separated env vars to include in log metadata
2. syslog driver
Routes container logs to a syslog server (local or remote).
flowchart TD
A[All services write to stdout]
B["Docker syslog logging driver"]
subgraph ControlNode [Control Node]
C["rsyslog container<br/>(receives logs from all hosts)"]
D["/var/log/<component>/*.log"]
E["WebUI reads directly<br/>(real-time access)"]
C --> D --> E
end
subgraph WorkerNode1 [Worker Node 1]
F1["Container logs"]
end
subgraph WorkerNode2 [Worker Node 2]
F2["Container logs"]
end
A --> B
B --> F1 -->|"tcp://rsyslog:514"| C
B --> F2 -->|"tcp://rsyslog:514"| C
G["docker logs (via dual logging cache)"]
B -.-> G
Dual logging
(Docker Docs: Dual logging): Starting with
Docker Engine 20.10, Docker automatically caches logs locally when using remote logging drivers
(like syslog or fluentd), enabling docker logs to work. No configuration is required to enable
this feature.
Cache configuration: The cache options below can be configured either:
Per-container via --log-opt flags (e.g., --log-opt cache-max-size=50m)
Globally in /etc/docker/daemon.json (applies to all new containers)
Note: The Docker documentation does not provide explicit docker-compose examples for cache-*
options. While the docs state these "can be specified per container", only daemon.json examples
are shown. In docker-compose, you would use:
This allows users to choose based on their deployment complexity and existing infrastructure.
Kubernetes considerations
In Kubernetes, there are no logging driver configurations at the container level like Docker Compose.
Instead, the container runtime (containerd, CRI-O) handles log collection differently.
How Kubernetes logging works
Container runtime writes logs to files on the node:
This applies to all containers on the node. Unlike Docker Compose, you cannot configure rotation
per-service.
Default behavior (equivalent to json-file)
With no additional configuration, Kubernetes behaves like Docker's json-file driver:
Logs stored per-node in /var/log/containers/
No centralized access - must access each node separately
kubectl logs <pod> works natively
syslog equivalent
There is no direct syslog logging driver in Kubernetes. Achieving centralized logging requires a
DaemonSet-based log forwarder—which is essentially the Fluent Bit approach described below.
CLP-managed Fluent Bit (recommended for Kubernetes)
This is the standard pattern for Kubernetes log aggregation:
Centralized operational logging architecture for multi-host CLP deployments
Request
Background
Currently, CLP's deployment architecture requires container-to-host volume mounts for all service
logs. Each service writes logs to files within
/var/log/<component>/which are mounted from${CLP_LOGS_DIR_HOST:-./var/log}on the host. While this approach has been convenient forsingle-host deployments (allowing users to easily provide logs by archiving host files), it creates
significant challenges for multi-host deployments:
scattered across different hosts, making centralized log access difficult
additional log aggregation infrastructure
With planned support for multi-host deployments through Kubernetes, and addition of the log-ingestor
component, there's an opportunity to modernize the operational logging architecture to leverage:
high-compression-ratio storage that user logs enjoy
mounts
interface
Requirements
The new operational logging solution must satisfy the following requirements:
R1: Multi-host Support
deployments
running
R2: Tiered Access
services
grep/analysis
R3: Admin Access & Export
R4: WebUI Integration
R5: Lightweight & Efficient
R6: Incremental Migration
CLP_LOGS_DIRenvironment variablesPossible implementation
Architecture Overview
Current
flowchart TD A["All services write to files<br/>(some also write to stdout)"] B["Docker local logging driver"] subgraph Host1 [Host 1] C1["/var/log/<component>/*.log<br/>(via volume mount)"] D1["docker logs container-name"] end subgraph Host2 [Host 2] C2["/var/log/<component>/*.log<br/>(via volume mount)"] D2["docker logs container-name"] end A --> B B --> C1 B --> D1 B --> C2 B --> D2 E["Admin must access each host separately<br/>(SSH, copy files, etc.)"] C1 -.-> E C2 -.-> ECharacteristics:
${CLP_LOGS_DIR_HOST}After (CLP-managed Fluent Bit)
flowchart TD A[All services write to stdout] B["Docker fluentd logging driver"] subgraph ControlNode [Control Node] C["Fluent Bit<br/>(receives logs from all hosts)"] subgraph Output1 [Output 1: File with rotation] D["/var/log/<component>/*.log"] E["WebUI reads directly<br/>(real-time access)"] D --> E end subgraph Output2 [Output 2: S3 + CLP archives] F["S3 (IRv2 compressed logs)"] G["Log-ingestor (periodic ingestion)"] H["CLP Archives (dataset='_clp')"] I["WebUI Search page"] F --> G --> H --> I end C --> D C --> F end subgraph WorkerNode1 [Worker Node 1] W1["Container logs"] end subgraph WorkerNode2 [Worker Node 2] W2["Container logs"] end A --> B B --> W1 -->|"fluentd-address"| C B --> W2 -->|"fluentd-address"| C J["docker logs (via dual logging cache)"] B -.-> JCharacteristics:
/var/log/<component>/)_clpdatasetdocker logsstill works via dual logging (Docker 20.10+)Three-Tier Data Lifecycle
Hot Tier (0-X minutes): Files on disk at
/var/log/<component>/*.log/os/cat(similar to existing/os/ls)Warm Tier (X min - Y hours): IRv2 files on S3
ingestion
Cold Tier (>Y hours): CLP Archives
dataset=_clpfilterComponent Changes
1. Fluent Bit deployment
testing)
2. Fluent Bit Configuration for Docker Compose
fluent-bit.conf
Open Questions:
3. Service migration (Incremental)
Phase 1: Migrate CLP first-party services
Phase 2: Migrate third-party services (database, queue, redis, results-cache)
Backward Compatibility:
After validation, remove old mounts and deprecate CLP_LOGS_DIR env vars
4. S3 configuration (clp-config.yaml)
New bundled services:
bundled: ["database", "queue", "redis", "results_cache", "fluentbit", "minio"]
S3 path structure:
5. Log-ingestor configuration
Ingest into dataset "_clp"
datasets
Open Questions:
6. WebUI enhancements
6.1 New /os/cat API Endpoint
Volume mount (add to webui service in docker-compose-all.yaml):
6.2 New "Operational Logs" Page
For querying realtime logs.
Location: /components/webui/client/src/pages/OperationalLogsPage/
Features:
Open Questions:
/os/tailto support live tailing with server-sent events?6.3 Search Page Enhancement
Once the log-ingestor for CLP operational logs is ready, we can verify that the logs are searchable
in the search page.
Dataset filter:
/search?dataset=_clpMigration Timeline
Phase 1: Infrastructure Setup
Phase 2: WebUI Development
Phase 3: Service Migration for Third-Party Services
Migrate bundled services (no change in our code is required)
Challenges:
Phase 4: Service Migration for First-Party Services (First Wave)
Migrate Python-based services (easier log format standardization):
Per-service checklist:
Phase 5: Service Migration for First-Party Services (Second Wave)
Migrate remaining services:
Phase 6: Cleanup & Optimization
*volume_clp_logs mountsfrom services (keep only in Fluent Bit and webui)${CLP_LOGS_DIR_HOST}host mountsFuture Optimizations
Direct IRv2 querying (Warm tier optimization):
WebUI live tail (Server-sent events):
Authentication & Authorization:
Structured logging standardization:
Multi-cluster support:
Alternative approaches: Native Docker logging drivers
This section evaluates lighter-weight alternatives to the CLP-managed Fluent Bit approach, using
Docker's native logging drivers. These alternatives may appeal to users who:
Candidate logging drivers
1. json-file driver
The default Docker logging driver. Writes JSON-formatted logs to local files.
flowchart TD A[All services write to stdout] B["Docker json-file logging driver<br/>(with rotation: max-size, max-file)"] subgraph Host1 [Host 1] C1["/var/lib/docker/containers/<id>/*.log"] D1["docker logs container-name"] end subgraph Host2 [Host 2] C2["/var/lib/docker/containers/<id>/*.log"] D2["docker logs container-name"] end A --> B B --> C1 --> D1 B --> C2 --> D2 E["External log aggregation (optional)<br/>Fluentd, Vector, OpenTelemetry, Loki, etc."] C1 -.-> E C2 -.-> EConfiguration example:
Key options (Docker Docs: JSON File logging driver):
max-size-1(unlimited)10m,1g)max-file1compressfalselabelsenv2. syslog driver
Routes container logs to a syslog server (local or remote).
flowchart TD A[All services write to stdout] B["Docker syslog logging driver"] subgraph ControlNode [Control Node] C["rsyslog container<br/>(receives logs from all hosts)"] D["/var/log/<component>/*.log"] E["WebUI reads directly<br/>(real-time access)"] C --> D --> E end subgraph WorkerNode1 [Worker Node 1] F1["Container logs"] end subgraph WorkerNode2 [Worker Node 2] F2["Container logs"] end A --> B B --> F1 -->|"tcp://rsyslog:514"| C B --> F2 -->|"tcp://rsyslog:514"| C G["docker logs (via dual logging cache)"] B -.-> GConfiguration example:
**Key options
** (Docker Docs: Syslog logging driver):
syslog-addressudp://host:port,tcp://host:port,tcp+tls://host:port, orunix:///pathsyslog-facilitydaemon,local0-local7)syslog-formatrfc3164,rfc5424,rfc5424microsyslog-tls-*tcp+tlsconnectionstag{{.Name}},{{.ID}})docker logscommand availabilityA critical consideration is whether
docker logsremains functional with each driver.docker logsworks?json-filelocaljournaldsyslogfluentdDual logging
(Docker Docs: Dual logging): Starting with
Docker Engine 20.10, Docker automatically caches logs locally when using remote logging drivers
(like
syslogorfluentd), enablingdocker logsto work. No configuration is required to enablethis feature.
Cache configuration: The cache options below can be configured either:
--log-optflags (e.g.,--log-opt cache-max-size=50m)/etc/docker/daemon.json(applies to all new containers)cache-disabledfalsecache-max-size20mcache-max-file5cache-compresstrueComponent changes impact analysis
1. Fluent Bit deployment
Verdict:
json-file: Simplest, zero overheadsyslog: Lightweight if rsyslog already deployed; can centralize to control node2. Fluent Bit configuration (dual output)
/var/log/<component>/)/var/lib/docker/containers/<id>/)Verdict:
json-file: Loses organized path structure and S3 pipelinesyslog: Can achieve organized paths via rsyslog templates; still loses S3 pipeline3. Service migration
CLP_LOGS_DIRenv vars*volume_clp_logsmountsVerdict: All approaches support the same service migration pattern (stdout-based logging).
4. S3 configuration
ir/<component>/<date>/)Verdict:
json-file/syslog: Lose automatic S3 ingestion. Users must implement their own logscanning / shipping if needed.
5. Log-ingestor configuration
_clpdataset ingestionVerdict:
json-file/syslog: No automated path to CLP archives. Historical operational logs notsearchable in WebUI.
6. WebUI enhancements
6.1
/os/catAPI Endpoint/var/log/<component>//var/lib/docker/containers/<id>//var/log/<component>/(rsyslog)/var/log)/var/log)Verdict:
json-file: WebUI would need to mount Docker's container directory or use Docker APIsyslog: Can achieve same organized structure as CLP-managed Fluent Bit via rsyslog templates6.2 "Operational Logs" Page
6.3 Search Page Enhancement (
?dataset=_clp)Verdict:
json-file/syslog: Historical search feature not available.Requirements impact matrix
docker logs > filedocker logscommandtcp+tls://)Comparing approaches across deployment scenarios
To set up multi-host deployments with syslog:
node (GitHub: puzzle/kubernetes-rsyslog-logging)
/var/log/<component>/using templatessyslog-addressRecommendation: Configurable modes
Consider offering multiple operational logging modes via configuration:
simplecentralizedfullThis allows users to choose based on their deployment complexity and existing infrastructure.
Kubernetes considerations
In Kubernetes, there are no logging driver configurations at the container level like Docker Compose.
Instead, the container runtime (containerd, CRI-O) handles log collection differently.
How Kubernetes logging works
Container runtime writes logs to files on the node:
/var/log/containers/<pod>_<namespace>_<container>-<id>.log/var/log/pods/<namespace>_<pod>_<uid>/<container>/0.logLog format: JSON by default (similar to Docker's json-file driver)
kubectl logs: Reads from these node files (always works, no driver dependency)Log rotation: Configured via kubelet, not per-container
Log rotation configuration (Helm)
Since CLP plans to use Helm for Kubernetes deployments, log rotation is configured in the kubelet
settings rather than per-container:
This applies to all containers on the node. Unlike Docker Compose, you cannot configure rotation
per-service.
Default behavior (equivalent to json-file)
With no additional configuration, Kubernetes behaves like Docker's json-file driver:
/var/log/containers/kubectl logs <pod>works nativelysyslog equivalent
There is no direct syslog logging driver in Kubernetes. Achieving centralized logging requires a
DaemonSet-based log forwarder—which is essentially the Fluent Bit approach described below.
CLP-managed Fluent Bit (recommended for Kubernetes)
This is the standard pattern for Kubernetes log aggregation:
Key difference from Docker Compose:
logging:blockfluentd-asyncsettingHow it works:
/var/log/containers/from the host (read-only)This approach achieves the same result as Docker's fluentd logging driver, but through file tailing
rather than network push.
References