fix(compression_scheduler): Only fetch datasets when using the `clp-s` storage engine (fixes #1214). by kirkrodrigues · Pull Request #1215 · y-scope/clp

kirkrodrigues · 2025-08-18T04:23:07Z

Description

As the title says. The PR implements the solution proposed in #1214.

Checklist

The PR satisfies the contribution guidelines.
This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.
Necessary docs have been updated, OR no docs need to be updated.

Validation performed

With the clp-text package:

sbin/start-clp.sh
sbin/compress.sh ~/logs/hive-24hrs
sbin/search.sh "DESERIALIZE_ERRORS" --count
sbin/stop-clp.sh

With the clp-json package:

sbin/start-clp.sh
sbin/compress.sh --timestamp-key 't.$date' ~/logs/mongod.log
sbin/search.sh 'c: REPL' --count
sbin/stop-clp.sh

Summary by CodeRabbit

New Features
- None; no user-facing changes.
Refactor
- Streamlined compression scheduling by centralizing configuration, reducing parameter complexity.
- Improved handling of storage engines to skip unnecessary dataset lookups when not applicable.
Documentation
- Updated user-facing configuration guidance to reflect the new consolidated setup.
Chores
- Aligned internal calls with the updated configuration structure for consistency and maintainability.

…storage engine (fixes y-scope#1214).

coderabbitai · 2025-08-18T04:23:16Z

Walkthrough

Refactors search_and_schedule_new_tasks to accept CLPConfig as the first parameter. Replaces clp_archive_output usage with clp_config.archive_output. Adds conditional loading of existing_datasets based on storage engine. Updates internal call sites (e.g., main) to pass clp_config. Docstring revised accordingly.

Changes

Cohort / File(s)	Summary
Scheduler compression API update `components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py`	- Function signature changed to use CLPConfig as first param - Replaced clp_archive_output with clp_config.archive_output - Conditionally populate existing_datasets only when storage engine is CLP_S - Updated internal call sites to pass clp_config - Updated docstring

Sequence Diagram(s)

sequenceDiagram
    participant Main
    participant Scheduler as compression_scheduler.search_and_schedule_new_tasks
    participant Storage

    Main->>Scheduler: search_and_schedule_new_tasks(clp_config, db_conn, db_cursor, metadata_db_cfg)
    alt clp_config.package.storage_engine == CLP_S
        Scheduler->>Storage: fetch existing_datasets
        Storage-->>Scheduler: existing_datasets
    else
        Scheduler->>Scheduler: existing_datasets = []
    end
    Scheduler->>Scheduler: add_dataset(..., archive_output=clp_config.archive_output)
    Scheduler-->>Main: return

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🔭 Outside diff range comments (2)

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (2)
171-200: Also guard dataset creation under non-CLP_S engines to avoid unintended metadata operations

Currently, add_dataset may still run when the storage engine is not CLP_S, because existing_datasets is empty and the condition checks only set membership. If non-CLP_S engines don’t support datasets in the same metadata DB, this can fail or create unintended rows. To fully align with the PR goal, gate dataset creation on CLP_S as well.

Apply this diff:
-        if dataset is not None and dataset not in existing_datasets:
-            add_dataset(
-                db_conn,
-                db_cursor,
-                table_prefix,
-                dataset,
-                clp_config.archive_output,
-            )
-
-            # NOTE: This assumes we never delete a dataset when compression jobs are being scheduled
-            existing_datasets.add(dataset)
+        # Only manage datasets when using the CLP_S storage engine
+        if StorageEngine.CLP_S == clp_config.package.storage_engine:
+            if dataset is not None and dataset not in existing_datasets:
+                add_dataset(
+                    db_conn,
+                    db_cursor,
+                    table_prefix,
+                    dataset,
+                    clp_config.archive_output,
+                )
+                # NOTE: This assumes we never delete a dataset when compression jobs are being scheduled
+                existing_datasets.add(dataset)
404-406: Non-blocking: consider using clp_config.compression_scheduler.logging_level instead of env

This keeps the scheduler’s logging level sourced from its own config and avoids cross-scheduler leakage noted in prior discussions.

Apply this minimal change:
-    set_logging_level(logger, os.getenv("CLP_LOGGING_LEVEL"))
+    set_logging_level(logger, clp_config.compression_scheduler.logging_level)
Context: There’s a known issue where the start script sets CLP_LOGGING_LEVEL from the query scheduler. Sourcing directly from clp_config avoids that coupling.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d2e97ce and d06100b.

📒 Files selected for processing (1)

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (3 hunks)

🧰 Additional context used

🧠 Learnings (3)

📓 Common learnings

Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

📚 Learning: 2025-08-08T06:59:42.436Z

Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

Applied to files:

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py

📚 Learning: 2025-01-16T16:58:43.190Z

Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py

🔇 Additional comments (5)

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (5)

171-176: Conditional dataset fetch aligns with PR objective

Fetching existing datasets only when StorageEngine.CLP_S is in use is correct and makes the scheduler agnostic to other storage engines’ metadata layout.

198-199: Using clp_config.archive_output is the right consolidation

This removes the need to thread ArchiveOutput separately and keeps the function self-contained.

432-437: Call site updated to new signature (good)

main correctly passes clp_config as the first argument to search_and_schedule_new_tasks.

156-161: Refactor Confirmed: all call sites updated
The search_and_schedule_new_tasks function now consistently uses the new signature (clp_config, db_conn, db_cursor, clp_metadata_db_connection_config).
No occurrences of clp_archive_output or old call patterns remain—ready to approve.

273-287: Incorrect storage‐engine assumption for tags table
The tags table is created by create_metadata_db_tables only when storage_engine != StorageEngine.CLP_S (i.e. for the default CLP engine), not under CLP_S. The proposed guard (StorageEngine.CLP_S == …) is inverted. If you need to skip tag operations when the table doesn’t exist, guard on StorageEngine.CLP (or equivalently != CLP_S) instead.

Likely an incorrect or invalid review comment.

coderabbitai · 2025-08-18T04:31:42Z

    """
    For all jobs with PENDING status, splits the job into tasks and schedules them.
+    :param clp_config:
    :param db_conn:
    :param db_cursor:
    :param clp_metadata_db_connection_config:
-    :param clp_archive_output:
    """


🧹 Nitpick (assertive)

Docstring is minimal; clarify clp_config usage and storage-engine–specific dataset handling

Please expand the docstring to reflect that datasets are fetched/managed only for the CLP_S storage engine and that archive_output is sourced from clp_config.

Apply this diff to improve the docstring:

def search_and_schedule_new_tasks( clp_config: CLPConfig, db_conn, db_cursor, clp_metadata_db_connection_config: Dict[str, Any], ): """ For all jobs with PENDING status, splits the job into tasks and schedules them. - :param clp_config: + :param clp_config: Full CLP configuration. Note: dataset discovery/metadata updates are + performed only when clp_config.package.storage_engine == StorageEngine.CLP_S. + archive_output is taken from clp_config.archive_output. :param db_conn: :param db_cursor: :param clp_metadata_db_connection_config: """

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"""

For all jobs with PENDING status, splits the job into tasks and schedules them.

:param clp_config:

:param db_conn:

:param db_cursor:

:param clp_metadata_db_connection_config:

:param clp_archive_output:

"""

def search_and_schedule_new_tasks(

clp_config: CLPConfig,

db_conn,

db_cursor,

clp_metadata_db_connection_config: Dict[str, Any],

):

"""

For all jobs with PENDING status, splits the job into tasks and schedules them.

:param clp_config: Full CLP configuration. Note: dataset discovery/metadata updates are

performed only when clp_config.package.storage_engine == StorageEngine.CLP_S.

archive_output is taken from clp_config.archive_output.

:param db_conn:

:param db_cursor:

:param clp_metadata_db_connection_config:

"""

🤖 Prompt for AI Agents

In components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py around lines 162 to 168, the function docstring is too minimal; update it to state that clp_config supplies archive_output and other relevant settings, and that dataset fetching/management only occurs when the storage engine is CLP_S (handle other storage engines differently or skip dataset operations). Modify the docstring to list parameters (clp_config, db_conn, db_cursor, clp_metadata_db_connection_config), describe how clp_config.archive_output is used, and explicitly note storage-engine–specific behavior for dataset handling so callers understand when datasets are fetched and when they are not.

…` storage engine (fixes y-scope#1214). (y-scope#1215) Co-authored-by: Bingran Hu <bingran.hu@yscope.com>

fix(compression_scheduler): Only fetch datasets when using the clp-s …

d06100b

…storage engine (fixes y-scope#1214).

kirkrodrigues requested a review from a team as a code owner August 18, 2025 04:23

kirkrodrigues requested a review from junhaoliao August 18, 2025 04:23

kirkrodrigues changed the title ~~fix(compression_scheduler): Only fetch datasets when using the clp-s storage engine (fixes #1214).~~ fix(compression_scheduler): Only fetch datasets when using the clp-s storage engine (fixes #1214). Aug 18, 2025

coderabbitai Bot reviewed Aug 18, 2025

View reviewed changes

Merge branch 'main' into fix-1214

92a4387

Bill-hbrhbr approved these changes Aug 18, 2025

View reviewed changes

kirkrodrigues merged commit 8618486 into y-scope:main Aug 18, 2025
7 checks passed

kirkrodrigues deleted the fix-1214 branch August 18, 2025 06:30

junhaoliao pushed a commit to junhaoliao/clp that referenced this pull request May 17, 2026

fix(compression_scheduler): Only fetch datasets when using the `clp-s…

6f89cb8

…` storage engine (fixes y-scope#1214). (y-scope#1215) Co-authored-by: Bingran Hu <bingran.hu@yscope.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(compression_scheduler): Only fetch datasets when using the `clp-s` storage engine (fixes #1214).#1215

fix(compression_scheduler): Only fetch datasets when using the `clp-s` storage engine (fixes #1214).#1215
kirkrodrigues merged 2 commits into
y-scope:mainfrom
kirkrodrigues:fix-1214

kirkrodrigues commented Aug 18, 2025 •

edited

Loading

Uh oh!

coderabbitai Bot commented Aug 18, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Status, Documentation and Community

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Aug 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kirkrodrigues commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Validation performed

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kirkrodrigues commented Aug 18, 2025 •

edited

Loading

coderabbitai Bot commented Aug 18, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)