Skip to content

feat(clp-rust-utils): Define all CLP package config required by log-ingestor; Define base config for log-ingestor.#1729

Merged
LinZhihao-723 merged 15 commits into
y-scope:mainfrom
LinZhihao-723:log-ingestor-config
Dec 6, 2025
Merged

feat(clp-rust-utils): Define all CLP package config required by log-ingestor; Define base config for log-ingestor.#1729
LinZhihao-723 merged 15 commits into
y-scope:mainfrom
LinZhihao-723:log-ingestor-config

Conversation

@LinZhihao-723

@LinZhihao-723 LinZhihao-723 commented Dec 4, 2025

Copy link
Copy Markdown
Member

Description

NOTE: This PR depends on #1704.

This PR:

  • Adds CLP package configs requried by log-ingestor, including:
    • archive output config
    • logs input config (S3 and Fs)
  • Adds log-ingestor config required by the ingestion server.

The use of these configs can be found in the dev branch LinZhihao-723#29.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Ensure all workflows pass.
  • Tested within the dev branch to ensure it can deserialize real CLP config properly.

Summary by CodeRabbit

  • New Features
    • Configurable log ingestion added with filesystem and S3 options (including AWS auth support).
    • New archive output settings for target sizes and compression level.
    • Compression job status tracking added with distinct operational states (pending, running, succeeded, failed, killed).

✏️ Tip: You can customize this high-level summary in your review settings.

@LinZhihao-723 LinZhihao-723 requested a review from a team as a code owner December 4, 2025 21:48
@coderabbitai

coderabbitai Bot commented Dec 4, 2025

Copy link
Copy Markdown
Contributor

Walkthrough

Adds log ingestion and archive output configuration types, extends Config with those fields, enables serde deserialization for AWS auth structs, and introduces a CompressionJobStatus enum plus job_config module re-exports.

Changes

Cohort / File(s) Summary
Configuration schema & defaults
components/clp-rust-utils/src/clp_config/package/config.rs
Added public types LogIngestor, ArchiveOutput, FsIngestion, S3Ingestion, and LogsInput (serde-tagged). Extended Config with log_ingestor, logs_input, and archive_output. Implemented Default for Config, LogIngestor, ArchiveOutput, and FsIngestion. Added tests for LogsInput deserialization.
S3 serde support
components/clp-rust-utils/src/clp_config/s3_config.rs
Added Deserialize derive and import for AwsAuthentication and AwsCredentials (previously only Serialize).
Job config modules & enum
components/clp-rust-utils/src/job_config.rs, components/clp-rust-utils/src/job_config/compression.rs
Added mod compression; and mod ingestion; plus pub use re-exports in job_config.rs. Introduced CompressionJobStatus enum (Pending=0Killed=4) with serde and numeric conversion derives (IntoPrimitive, TryFromPrimitive).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20–25 minutes

  • Review serde tagging and default values in package/config.rs and LogsInput tests.
  • Verify AwsAuthentication / AwsCredentials derive changes in s3_config.rs.
  • Validate numeric representation and conversion traits on CompressionJobStatus in compression.rs.
  • Check public re-exports in job_config.rs and ensure no name collisions.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly aligns with the PR's main objective: adding CLP package configurations required by log-ingestor (archive output, logs input) and defining the base log-ingestor configuration.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e7acd69 and 1c53904.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (12)
  • components/clp-rust-utils/src/clp_config/package/config.rs (4 hunks)
  • components/clp-rust-utils/src/clp_config/s3_config.rs (2 hunks)
  • components/clp-rust-utils/src/job_config.rs (1 hunks)
  • components/clp-rust-utils/src/job_config/compression.rs (1 hunks)
  • components/clp-rust-utils/src/job_config/ingestion.rs (1 hunks)
  • components/log-ingestor/Cargo.toml (1 hunks)
  • components/log-ingestor/src/ingestion_job.rs (1 hunks)
  • components/log-ingestor/src/ingestion_job/s3_scanner.rs (4 hunks)
  • components/log-ingestor/src/ingestion_job/sqs_listener.rs (4 hunks)
  • components/log-ingestor/src/ingestion_job_manager.rs (1 hunks)
  • components/log-ingestor/src/lib.rs (1 hunks)
  • components/log-ingestor/tests/test_ingestion_job.rs (3 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-10-22T21:02:31.113Z
Learning: Repository y-scope/clp: Maintain deterministic CI/builds for Rust; add a check to verify Cargo.lock is in sync with Cargo.toml without updating dependencies (non-mutating verification in clp-rust-checks workflow).
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 549
File: components/core/tests/test-ir_encoding_methods.cpp:1180-1186
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In the context of loop constructs, LinZhihao-723 prefers using `while (true)` loops and does not consider alternative loop constructs necessarily more readable.
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 549
File: components/core/tests/test-ir_encoding_methods.cpp:1180-1186
Timestamp: 2024-10-01T07:59:11.208Z
Learning: In the context of loop constructs, LinZhihao-723 prefers using `while (true)` loops and does not consider alternative loop constructs necessarily more readable.
📚 Learning: 2025-11-03T16:17:40.223Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1535
File: components/clp-rust-utils/src/clp_config/package/config.rs:47-61
Timestamp: 2025-11-03T16:17:40.223Z
Learning: In the y-scope/clp repository, the `ApiServer` struct in `components/clp-rust-utils/src/clp_config/package/config.rs` is a Rust-native configuration type and does not mirror any Python code, unlike other structs in the same file (Config, Database, ResultsCache, Package) which are mirrors of Python definitions.

Applied to files:

  • components/clp-rust-utils/src/clp_config/package/config.rs
📚 Learning: 2025-10-22T21:02:31.113Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-10-22T21:02:31.113Z
Learning: Repository y-scope/clp: Maintain deterministic CI/builds for Rust; add a check to verify Cargo.lock is in sync with Cargo.toml without updating dependencies (non-mutating verification in clp-rust-checks workflow).

Applied to files:

  • components/log-ingestor/Cargo.toml
🧬 Code graph analysis (2)
components/log-ingestor/src/ingestion_job.rs (2)
components/log-ingestor/src/ingestion_job/s3_scanner.rs (2)
  • shutdown_and_join (184-187)
  • get_id (193-195)
components/log-ingestor/src/ingestion_job/sqs_listener.rs (2)
  • shutdown_and_join (219-222)
  • get_id (228-230)
components/log-ingestor/src/ingestion_job_manager.rs (3)
components/log-ingestor/src/ingestion_job.rs (2)
  • from (47-49)
  • from (53-55)
components/log-ingestor/src/ingestion_job/s3_scanner.rs (1)
  • spawn (152-171)
components/log-ingestor/src/ingestion_job/sqs_listener.rs (1)
  • spawn (187-206)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: package-image
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: build (macos-15)
  • GitHub Check: build (ubuntu-24.04)
  • GitHub Check: rust-checks
🔇 Additional comments (32)
components/log-ingestor/src/lib.rs (1)

4-4: LGTM!

The new module exposure properly expands the public API surface for ingestion job management.

components/clp-rust-utils/src/job_config.rs (1)

4-10: LGTM!

The new module organization and re-exports follow the established pattern and properly expand the job_config API surface for compression and ingestion types.

components/log-ingestor/src/ingestion_job.rs (1)

8-56: LGTM!

The IngestionJob enum provides a clean abstraction over S3Scanner and SqsListener with proper delegation, ergonomic From conversions, and thorough documentation. The implementation correctly forwards lifecycle methods to the underlying variants.

components/log-ingestor/src/ingestion_job/sqs_listener.rs (2)

14-22: LGTM!

Consolidating the S3-related configuration into S3IngestionBaseConfig reduces duplication between SqsListenerConfig and S3ScannerConfig while maintaining clear separation of concerns.


144-163: LGTM!

All references to bucket name and key prefix have been correctly updated to access the fields via base.bucket_name and base.key_prefix.

components/log-ingestor/src/ingestion_job/s3_scanner.rs (2)

12-18: LGTM!

Consistent with the refactoring in SqsListenerConfig, this consolidation reduces configuration duplication.


92-120: LGTM!

All references have been correctly updated to use base.bucket_name and base.key_prefix.

components/log-ingestor/Cargo.toml (1)

22-22: The thiserror = "2.0.17" version is valid and current.

Version 2.0.17 is the latest stable release as of December 2025. Verify that Cargo.lock remains in sync with Cargo.toml using a non-mutating check (e.g., cargo check --locked) to maintain deterministic builds, rather than allowing dependency updates during validation.

components/log-ingestor/tests/test_ingestion_job.rs (3)

7-7: LGTM!

Import correctly updated to include S3IngestionBaseConfig from the job_config module.


166-185: LGTM!

The SqsListenerConfig correctly uses the new consolidated S3IngestionBaseConfig structure. All required fields are properly initialized, and the optional fields are appropriately set to None or default values for testing purposes.


252-264: LGTM!

The S3ScannerConfig correctly uses the consolidated S3IngestionBaseConfig structure, consistent with the SqsListenerConfig test above.

components/clp-rust-utils/src/clp_config/s3_config.rs (6)

1-2: LGTM!

Good use of the secrecy crate for handling sensitive credentials.


13-19: LGTM!

Adding Deserialize to AwsAuthentication enables proper configuration deserialization while maintaining the existing serialization capability.


21-26: Good security practice using SecretString for credentials.

This ensures the secret access key is automatically redacted in debug output and requires explicit exposure when needed.


28-37: LGTM!

Custom PartialEq and Eq implementations are necessary since SecretString doesn't implement these traits by default. The implementation correctly exposes the secret only for comparison purposes.


39-49: Review the security implications of exposing secrets during serialization.

The custom Serialize implementation exposes the secret access key in the serialized output. While this may be intentional for configuration persistence, ensure that:

  1. Serialized output is never logged or stored in plaintext.
  2. This aligns with your security requirements for credential handling.

If serialization of credentials to external systems is not required, consider removing Serialize from AwsCredentials to prevent accidental secret exposure.


51-79: LGTM!

Test correctly validates the custom Serialize implementation for AwsCredentials.

components/clp-rust-utils/src/clp_config/package/config.rs (6)

3-3: LGTM!

Public re-export of AwsAuthentication makes it accessible through this module's API.


14-42: LGTM!

The Config struct is properly extended with the new ingestion-related fields, and the Default implementation correctly initializes them.


192-212: LGTM!

LogIngestor struct has sensible default values. The buffer size threshold of 50MB and channel capacity of 10 appear reasonable for ingestion workloads.


242-264: LGTM!

S3Ingestion and FsIngestion structs are well-defined. The default directory of "/" for FsIngestion aligns with the note about syncing with Python defaults.


266-281: LGTM!

The LogsInput enum correctly uses serde's tagged enum with #[serde(flatten)] for a cleaner JSON structure, consistent with StreamOutputStorage in this file.


283-342: LGTM!

Tests provide good coverage for both LogsInput variants, validating the JSON deserialization paths for S3 and filesystem configurations.

components/log-ingestor/src/ingestion_job_manager.rs (9)

19-30: LGTM!

Error enum is well-defined with clear, descriptive variants using thiserror for automatic error message formatting.


41-63: LGTM!

The method correctly extracts the base config, creates the appropriate client manager, and delegates to create_s3_ingestion_job with a closure for job creation.


65-91: LGTM!

Consistent with create_s3_scanner_job, this method properly handles SQS listener creation while reusing the common job registration logic.


106-119: Consider error handling for partial shutdown failures.

The job is removed from the table before shutdown_and_join is called. If shutdown_and_join fails on either the ingestion job or listener, the job entry is already removed, potentially leaving resources in an inconsistent state.

Depending on your requirements:

  1. If removal-before-shutdown is intentional (e.g., to prevent double-removal attempts), document this behavior.
  2. If transactional semantics are needed, consider removing only after successful shutdown.

147-205: LGTM!

The prefix conflict detection logic is correct:

  • Conflicts are detected when region, bucket, and dataset all match AND the prefixes overlap.
  • The UUID generation loop is safe in practice (UUID v4 collision is astronomically unlikely).
  • Lock management is appropriate since client creation happens before this method is called.

207-235: Hardcoded AWS endpoint format may limit flexibility.

The endpoint URLs are hardcoded to https://s3.{region}.amazonaws.com and https://sqs.{region}.amazonaws.com. This works for standard AWS commercial regions but won't support:

  • LocalStack or other mock services for testing
  • AWS GovCloud or China regions (different endpoint formats)
  • Custom VPC endpoints

Consider making the endpoint configurable or deriving it from the AWS SDK's region resolver.


237-243: LGTM!

The unused _ingestion_job_config parameter is intentional per the comment at line 187-188, which notes that the listener is designed to be shared among multiple ingestion jobs in the future.


246-254: LGTM!

IngestionJobTableEntry correctly captures all metadata needed for conflict detection and job lifecycle management.


266-272: LGTM!

The is_mutually_prefix_free function correctly handles all edge cases:

  • Empty strings (correctly returns false as empty string is a prefix of everything)
  • Equal strings (correctly returns false as they conflict)
  • Disjoint prefixes (correctly returns true)

Comment thread components/clp-rust-utils/src/clp_config/package/config.rs
Comment thread components/clp-rust-utils/src/job_config/compression.rs
Comment thread components/clp-rust-utils/src/job_config/ingestion.rs
Comment thread components/log-ingestor/src/ingestion_job_manager.rs
Comment thread components/log-ingestor/src/ingestion_job_manager.rs

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
components/clp-rust-utils/src/clp_config/package/config.rs (1)

214-240: Missing #[serde(default)] attribute on ArchiveOutput.

Unlike LogIngestor, Database, and other config structs in this file, ArchiveOutput doesn't have #[serde(default)]. This means deserialization will fail if any field is missing in the input configuration.

If partial configuration with defaults is intended, add the attribute:

 /// Mirror of `clp_py_utils.clp_config.ArchiveOutput`.
 ///
 /// # NOTE
 ///
 /// * This type is partially defined: unused fields are omitted and discarded through
 ///   deserialization.
 /// * The default values must be kept in sync with the Python definition.
 #[derive(Clone, Debug, Deserialize, Eq, PartialEq)]
+#[serde(default)]
 pub struct ArchiveOutput {
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1c53904 and 915af6f.

📒 Files selected for processing (2)
  • components/clp-rust-utils/src/clp_config/package/config.rs (4 hunks)
  • components/clp-rust-utils/src/clp_config/s3_config.rs (2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-10-22T21:02:31.113Z
Learning: Repository y-scope/clp: Maintain deterministic CI/builds for Rust; add a check to verify Cargo.lock is in sync with Cargo.toml without updating dependencies (non-mutating verification in clp-rust-checks workflow).
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.
Learnt from: hoophalab
Repo: y-scope/clp PR: 1535
File: components/clp-rust-utils/src/clp_config/package/config.rs:47-61
Timestamp: 2025-11-03T16:17:40.223Z
Learning: In the y-scope/clp repository, the `ApiServer` struct in `components/clp-rust-utils/src/clp_config/package/config.rs` is a Rust-native configuration type and does not mirror any Python code, unlike other structs in the same file (Config, Database, ResultsCache, Package) which are mirrors of Python definitions.
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1125
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:267-291
Timestamp: 2025-09-15T22:20:40.750Z
Learning: For CLP compression jobs, the team has decided to fail the entire job immediately upon encountering any invalid input path, rather than continuing to process valid paths. This decision was made during PR #1125 development.
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 549
File: components/core/tests/test-ir_encoding_methods.cpp:1180-1186
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In the context of loop constructs, LinZhihao-723 prefers using `while (true)` loops and does not consider alternative loop constructs necessarily more readable.
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 549
File: components/core/tests/test-ir_encoding_methods.cpp:1180-1186
Timestamp: 2024-10-01T07:59:11.208Z
Learning: In the context of loop constructs, LinZhihao-723 prefers using `while (true)` loops and does not consider alternative loop constructs necessarily more readable.
📚 Learning: 2025-11-03T16:17:40.223Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1535
File: components/clp-rust-utils/src/clp_config/package/config.rs:47-61
Timestamp: 2025-11-03T16:17:40.223Z
Learning: In the y-scope/clp repository, the `ApiServer` struct in `components/clp-rust-utils/src/clp_config/package/config.rs` is a Rust-native configuration type and does not mirror any Python code, unlike other structs in the same file (Config, Database, ResultsCache, Package) which are mirrors of Python definitions.

Applied to files:

  • components/clp-rust-utils/src/clp_config/package/config.rs
🧬 Code graph analysis (1)
components/clp-rust-utils/src/clp_config/package/config.rs (1)
components/clp-py-utils/clp_py_utils/clp_config.py (3)
  • AwsAuthentication (547-577)
  • StreamOutput (709-717)
  • ArchiveOutput (693-706)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: package-image
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: build (ubuntu-24.04)
  • GitHub Check: build (macos-15)
  • GitHub Check: rust-checks
🔇 Additional comments (6)
components/clp-rust-utils/src/clp_config/s3_config.rs (1)

1-1: LGTM! Deserialize support added for AWS authentication types.

The addition of Deserialize to AwsAuthentication and AwsCredentials properly enables deserialization of these types for the new log ingestion configuration.

Also applies to: 13-13, 21-21

components/clp-rust-utils/src/clp_config/package/config.rs (5)

3-3: LGTM! Config structure properly extended with new fields.

The new log_ingestor, logs_input, and archive_output fields are correctly added to Config with appropriate default initializations.

Also applies to: 19-19, 22-23, 33-33, 36-39


192-212: LGTM! LogIngestor configuration properly defined.

The struct follows the established pattern with #[serde(default)] and a Default implementation. The default values are sensible for a log ingestion service.


266-281: LGTM! LogsInput enum properly defined with tagged deserialization.

The enum structure correctly uses #[serde(tag = "type")] for tagged deserialization and #[serde(flatten)] to inline configuration fields. This pattern is consistent with StreamOutputStorage in the same file.


283-338: LGTM! Comprehensive tests for LogsInput deserialization.

The tests adequately cover both S3 and Fs variants of LogsInput, verifying that the tagged deserialization with flattened configuration works correctly. The assertions properly validate the deserialized values.


242-246: Verify AWS authentication type compatibility with Python definition.

The Rust AwsAuthentication implementation should be confirmed to handle all authentication types that the Python clp_py_utils.clp_config.AwsAuthentication can serialize. If Python supports authentication types beyond credentials (such as profile, ec2, or env_vars), the Rust deserialization will fail if these types are sent from Python. Verify the enum variants in the Rust implementation and test the actual serialization/deserialization flow end-to-end.

Comment thread components/clp-rust-utils/src/clp_config/package/config.rs

@hoophalab hoophalab left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. one minor comment

how about the following title

feat(clp-rust-utils): Define all CLP package config required by log-ingestor; Add config for log-ingestor.

Comment thread components/clp-rust-utils/src/clp_config/package/config.rs
@LinZhihao-723 LinZhihao-723 changed the title feat(clp-rust-utils): Define all CLP package config requried by log-ingestor; Define base config for log-ingestor. feat(clp-rust-utils): Define all CLP package config required by log-ingestor; Define base config for log-ingestor. Dec 6, 2025

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
components/clp-rust-utils/src/clp_config/package/config.rs (1)

249-264: Revisit / as the default FsIngestion directory; it is overly broad and risky.

Using the filesystem root ("/") as the default FsIngestion::directory effectively makes the default logs_input in Config::default() point at the entire filesystem, which is a substantial risk surface (unexpected data ingestion, permission issues, and potential leakage). A more constrained default (for example a log‑specific directory) or forcing explicit configuration via Option<String> would be safer and more predictable.

Since this was already raised earlier, consider either:

  • Changing the default to a log‑scoped path consistent with the rest of the config, or
  • Making directory mandatory (no Default for FsIngestion) so callers must opt into a path consciously.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 915af6f and d3b176c.

📒 Files selected for processing (1)
  • components/clp-rust-utils/src/clp_config/package/config.rs (4 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-10-22T21:02:31.113Z
Learning: Repository y-scope/clp: Maintain deterministic CI/builds for Rust; add a check to verify Cargo.lock is in sync with Cargo.toml without updating dependencies (non-mutating verification in clp-rust-checks workflow).
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 549
File: components/core/tests/test-ir_encoding_methods.cpp:1180-1186
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In the context of loop constructs, LinZhihao-723 prefers using `while (true)` loops and does not consider alternative loop constructs necessarily more readable.
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 549
File: components/core/tests/test-ir_encoding_methods.cpp:1180-1186
Timestamp: 2024-10-01T07:59:11.208Z
Learning: In the context of loop constructs, LinZhihao-723 prefers using `while (true)` loops and does not consider alternative loop constructs necessarily more readable.
📚 Learning: 2025-11-03T16:17:40.223Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1535
File: components/clp-rust-utils/src/clp_config/package/config.rs:47-61
Timestamp: 2025-11-03T16:17:40.223Z
Learning: In the y-scope/clp repository, the `ApiServer` struct in `components/clp-rust-utils/src/clp_config/package/config.rs` is a Rust-native configuration type and does not mirror any Python code, unlike other structs in the same file (Config, Database, ResultsCache, Package) which are mirrors of Python definitions.

Applied to files:

  • components/clp-rust-utils/src/clp_config/package/config.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: package-image
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: rust-checks
  • GitHub Check: build (macos-15)
  • GitHub Check: build (ubuntu-24.04)
🔇 Additional comments (4)
components/clp-rust-utils/src/clp_config/package/config.rs (4)

3-4: AwsAuthentication import is consistent with new S3Ingestion usage.

Pulling AwsAuthentication from crate::clp_config matches the JSON shape used in the tests and keeps the S3-specific config centralised; no issues here.


19-24: Confirm new Config defaults match Python ClpConfig semantics.

The additions of log_ingestor, logs_input (defaulting to FsIngestion::default()), and archive_output into Config::default() look structurally sound and will make these blocks optional when deserialising Config. Please double‑check that:

  • The default host/port for log_ingestor.
  • The choice of LogsInput::Fs as the implicit default.
  • The archive sizing and compression defaults

all match the corresponding Python clp_py_utils.clp_config.ClpConfig defaults, so cross‑language config behaviour stays in sync.

Also applies to: 33-40


214-241: ArchiveOutput default and serde configuration now look correct.

Adding #[serde(default)] plus the explicit Default implementation brings ArchiveOutput in line with other nested config structs and allows partially specified archive configs to deserialize cleanly while filling in sensible defaults. No further issues from my side.


243-247: S3Ingestion keeps S3 concerns focused on authentication.

Having S3Ingestion only carry aws_authentication and leaving other S3 details to higher‑level config mirrors a clean separation of concerns, especially with AwsAuthentication reused from crate::clp_config. The struct is minimal but coherent given the test coverage for the credentials path.

Comment on lines +192 to +212
#[derive(Clone, Debug, Deserialize, Eq, PartialEq)]
#[serde(default)]
pub struct LogIngestor {
pub host: String,
pub port: u16,
pub buffer_timeout_sec: u64,
pub buffer_size_threshold: u64,
pub channel_capacity: usize,
}

impl Default for LogIngestor {
fn default() -> Self {
Self {
host: "localhost".to_owned(),
port: 3333,
buffer_timeout_sec: 300,
buffer_size_threshold: 50 * 1024 * 1024,
channel_capacity: 10,
}
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Clarify whether LogIngestor mirrors a Python config or is Rust‑only.

Other structs in this file explicitly state when they mirror Python config types, while ApiServer is documented as Rust‑native. For LogIngestor, it would help future maintainers if you add a short doc comment making clear whether it mirrors a Python definition (and should stay in sync) or is a Rust‑only config surface, similar to ApiServer. Based on learnings, this distinction has mattered before.

🤖 Prompt for AI Agents
In components/clp-rust-utils/src/clp_config/package/config.rs around lines 192
to 212, add a short doc comment above the LogIngestor struct that explicitly
states whether this config mirrors a Python counterpart (and must be kept in
sync) or is Rust-native only (like ApiServer); follow the same wording and style
used for other structs in this file (e.g., mention "Mirrors Python config:
<path>" if it mirrors, or "Rust-native config, not present in Python" if not),
and place the comment immediately above the #[derive(...)] so future maintainers
can quickly see the intent.

Comment on lines +267 to +282
/// Mirror of `clp_py_utils.clp_config.ClpConfig.logs_input`.
#[derive(Debug, Clone, PartialEq, Eq, Deserialize)]
#[serde(tag = "type")]
pub enum LogsInput {
#[serde(rename = "fs")]
Fs {
#[serde(flatten)]
config: FsIngestion,
},

#[serde(rename = "s3")]
S3 {
#[serde(flatten)]
config: S3Ingestion,
},
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Serde setup for LogsInput is sound; consider a Default impl as a convenience.

The #[serde(tag = "type")] plus #[serde(flatten)] pattern cleanly matches the JSON in the tests for both fs and s3 variants. Given that Config::default() already chooses LogsInput::Fs { config: FsIngestion::default() }, you may want to add:

impl Default for LogsInput {
    fn default() -> Self {
        LogsInput::Fs {
            config: FsIngestion::default(),
        }
    }
}

This would make LogsInput easier to use standalone and documents the intended default more explicitly, but it’s not strictly required.

🤖 Prompt for AI Agents
In components/clp-rust-utils/src/clp_config/package/config.rs around lines 267
to 282, add a Default impl for LogsInput so it returns LogsInput::Fs { config:
FsIngestion::default() }; implement this by adding an impl Default for LogsInput
with fn default() -> Self { LogsInput::Fs { config: FsIngestion::default() } }
to make LogsInput usable standalone and align with Config::default(); ensure
FsIngestion implements Default (or adjust accordingly).

Comment on lines +289 to +338
#[test]
fn deserialize_logs_input_s3_config() {
const ACCESS_KEY_ID: &str = "YSCOPE";
const SECRET_ACCESS_KEY: &str = "IamSecret";
let logs_input_config_json = serde_json::json!({
"type": "s3",
"aws_authentication": {
"type": "credentials",
"credentials": {
"access_key_id": ACCESS_KEY_ID,
"secret_access_key": SECRET_ACCESS_KEY,
}
}
});

let deserialized =
serde_json::from_str::<LogsInput>(logs_input_config_json.to_string().as_str())
.expect("failed to deserialize `LogsInput` from JSON");

match deserialized {
LogsInput::S3 { config } => match config.aws_authentication {
crate::clp_config::AwsAuthentication::Credentials { credentials } => {
assert_eq!(credentials.access_key_id, ACCESS_KEY_ID);
assert_eq!(credentials.secret_access_key, SECRET_ACCESS_KEY);
}
},
LogsInput::Fs { .. } => panic!("Expected S3"),
}
}

#[test]
fn deserialize_logs_input_fs_config() {
const DIRECTORY: &str = "/var/logs";

let logs_input_config_json = serde_json::json!({
"type": "fs",
"directory": DIRECTORY,
});

let deserialized =
serde_json::from_str::<LogsInput>(logs_input_config_json.to_string().as_str())
.expect("failed to deserialize `LogsInput` from JSON");

match deserialized {
LogsInput::Fs { config } => {
assert_eq!(config.directory, DIRECTORY);
}
LogsInput::S3 { .. } => panic!("Expected Fs"),
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Tests exercise both fs and s3 paths; minor ergonomic improvements possible.

The tests correctly validate deserialisation of LogsInput for both S3 credentials and fs directory cases, which gives good coverage for the new enum wiring. To simplify them slightly, you could avoid the stringify/parse round‑trip by using serde_json::from_value(logs_input_config_json) directly instead of to_string().as_str(), but that’s purely a style/ergonomics tweak.

🤖 Prompt for AI Agents
In components/clp-rust-utils/src/clp_config/package/config.rs around lines 289
to 338, the tests build JSON values then convert to string and parse them back;
simplify by calling serde_json::from_value(logs_input_config_json) directly
instead of serde_json::from_str(logs_input_config_json.to_string().as_str()),
i.e., pass the existing serde_json::Value into from_value to avoid the needless
stringify/parse round-trip while keeping behavior identical.

@LinZhihao-723 LinZhihao-723 merged commit bfc474f into y-scope:main Dec 6, 2025
21 checks passed
junhaoliao pushed a commit to junhaoliao/clp that referenced this pull request May 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants