Skip to content

feat(job-orchestration): Add Spider compression task; Move Celery compression task into a dedicated submodule.#1340

Merged
LinZhihao-723 merged 29 commits into
y-scope:mainfrom
sitaowang1998:spider-job-executor
Oct 1, 2025
Merged

feat(job-orchestration): Add Spider compression task; Move Celery compression task into a dedicated submodule.#1340
LinZhihao-723 merged 29 commits into
y-scope:mainfrom
sitaowang1998:spider-job-executor

Conversation

@sitaowang1998

@sitaowang1998 sitaowang1998 commented Sep 26, 2025

Copy link
Copy Markdown
Contributor

Description

This PR

  • Separates celery compression task into separate file.
  • Adds spider compression task.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Celery compression works end-to-end.
  • GitHub workflows pass.

Summary by CodeRabbit

  • New Features

    • Compression exposed as a Celery worker task with graceful shutdown logging.
    • Spider-compatible compression endpoint and payload encoding/decoding helpers for interoperability.
  • Refactor

    • Core compression logic decoupled from worker bindings; logger is now passed through execution paths.
  • Chores

    • Task routing and scheduler wiring updated to use the new worker-oriented compression endpoints.

@sitaowang1998 sitaowang1998 requested a review from a team as a code owner September 26, 2025 03:14
@coderabbitai

coderabbitai Bot commented Sep 26, 2025

Copy link
Copy Markdown
Contributor

Walkthrough

Adds a Celery-based compression task and worker-shutdown handler, introduces a Spider wrapper and byte-string utilities for compression, refactors the original compression task into a core plain function compression_entry_point that accepts an explicit logger, updates Celery routing, and switches the scheduler import to the Celery wrapper. (34 words)

Changes

Cohort / File(s) Summary of changes
Celery compression wrapper
components/job-orchestration/job_orchestration/executor/compress/celery_compress.py
Added Celery task compress that delegates to compression_entry_point with a Celery task logger; registered a worker_shutdown signal handler; added Celery-related imports and module-level task logger; task returns the entry point result.
Celery routing config update
components/job-orchestration/job_orchestration/executor/compress/celeryconfig.py
Updated import target and queue route key to point to ...compress.celery_compress.compress instead of the previous ...compress.compression_task.compress.
Core compression refactor
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
Removed Celery bindings and shutdown hook; introduced compression_entry_point as a plain function; added an explicit logger parameter and propagated it to run_clp and downstream calls; removed module-level Celery logger.
Scheduler import switch
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
Switched import of compress from ...compress.compression_task to ...compress.celery_compress. No behavioral changes beyond import path.
Spider compression wrapper
components/job-orchestration/job_orchestration/executor/compress/spider_compress.py
New module exposing compress wrapper for Spider tasks: initializes a logger, converts JSON inputs from list[Int8] to UTF-8 strings, calls compression_entry_point with the logger, serializes result to JSON, and returns it encoded as list[Int8].
Spider byte utilities
components/job-orchestration/job_orchestration/utils/spider_utils.py
New utility functions int8_list_to_utf8_str and utf8_str_to_int8_list for converting between spider_py.Int8 lists and UTF-8 strings; added spider_py import.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Scheduler
  participant Broker as Celery Broker
  participant Worker as Celery Worker
  participant Task as celery_compress.compress
  participant Core as compression_entry_point

  Scheduler->>Broker: Enqueue compress(job_id, task_id, tag_ids, configs)
  Broker->>Worker: Deliver task
  Worker->>Task: Execute task
  note right of Task: Uses Celery task logger\nand delegates to core entry point
  Task->>Core: compression_entry_point(params, logger)
  Core-->>Task: result
  Task-->>Scheduler: result
  rect rgb(245,250,255)
    note over Worker,Task: worker_shutdown signal logs shutdown
  end
Loading
sequenceDiagram
  autonumber
  actor Spider as Spider TaskContext
  participant SpiderWrap as spider_compress.compress
  participant Utils as spider_utils
  participant Core as compression_entry_point

  Spider->>SpiderWrap: Invoke compress(ctx, job_id, task_id, tag_ids, configs_bytes)
  note right of SpiderWrap: Initializes logger "spider_compression"
  SpiderWrap->>Utils: int8_list_to_utf8_str(...) for each JSON input
  SpiderWrap->>Core: compression_entry_point(params, logger)
  Core-->>SpiderWrap: result
  SpiderWrap->>Utils: utf8_str_to_int8_list(json.dumps(result))
  SpiderWrap-->>Spider: result (encoded bytes/list[Int8])
Loading
sequenceDiagram
  autonumber
  participant Old as compression_task (previous)
  participant New as compression_task (refactored)
  participant Core as compression_entry_point

  note over Old: Previously Celery-bound with module logger and signals
  note over New: Now plain function(s) with explicit logger parameter
  New->>Core: compression_entry_point(params, logger)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title accurately and succinctly describes the two primary objectives of the pull request—adding a Spider compression task and relocating the Celery compression task—using conventional commit syntax, making it clear what feature changes are introduced in the job-orchestration scope.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)

521-571: Add missing logger argument to downstream wrappers

  • In components/job-orchestration/job_orchestration/executor/compress/spider_comrpess.py (line 18) and celery_compress.py (line 26), calls to general_compress omit the required logger parameter. Update each wrapper’s signature to accept a logger and pass it as the last argument to general_compress.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fa811fe and b1ac863.

📒 Files selected for processing (5)
  • components/job-orchestration/job_orchestration/executor/compress/celery_compress.py (1 hunks)
  • components/job-orchestration/job_orchestration/executor/compress/celeryconfig.py (1 hunks)
  • components/job-orchestration/job_orchestration/executor/compress/compression_task.py (4 hunks)
  • components/job-orchestration/job_orchestration/executor/compress/spider_comrpess.py (1 hunks)
  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
🧬 Code graph analysis (4)
components/job-orchestration/job_orchestration/executor/compress/celery_compress.py (2)
components/job-orchestration/job_orchestration/executor/compress/spider_comrpess.py (1)
  • compress (9-26)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)
  • general_compress (521-602)
components/job-orchestration/job_orchestration/executor/compress/celeryconfig.py (1)
components/job-orchestration/job_orchestration/scheduler/constants.py (1)
  • SchedulerType (9-11)
components/job-orchestration/job_orchestration/executor/compress/spider_comrpess.py (3)
components/clp-py-utils/clp_py_utils/clp_logging.py (1)
  • get_logger (17-25)
components/job-orchestration/job_orchestration/executor/compress/celery_compress.py (1)
  • compress (17-34)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)
  • general_compress (521-602)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (1)
components/job-orchestration/job_orchestration/executor/compress/celery_compress.py (1)
  • compress (17-34)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: package-image
🔇 Additional comments (3)
components/job-orchestration/job_orchestration/executor/compress/celeryconfig.py (1)

9-16: Queue route updated to match relocated task

The module path and routing key now align with celery_compress.compress, so the worker will continue to discover the task on the intended queue.

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (1)

34-34: Scheduler import follows task relocation

Import now targets the Celery-facing wrapper, keeping the existing compress.s(...) scheduling flow intact.

components/job-orchestration/job_orchestration/executor/compress/celery_compress.py (1)

11-34: Celery wrapper cleanly delegates to shared core

get_task_logger keeps Celery-managed logging while centralising the heavy lifting in general_compress, which keeps behaviour consistent across execution contexts.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0ba30e6 and 4330fc0.

📒 Files selected for processing (1)
  • components/job-orchestration/job_orchestration/executor/compress/spider_comrpess.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
🧬 Code graph analysis (1)
components/job-orchestration/job_orchestration/executor/compress/spider_comrpess.py (3)
components/clp-py-utils/clp_py_utils/clp_logging.py (1)
  • get_logger (17-25)
components/job-orchestration/job_orchestration/executor/compress/celery_compress.py (1)
  • compress (17-34)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)
  • general_compress (521-602)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: package-image
  • GitHub Check: lint-check (macos-15)

Comment on lines +1 to +17
from clp_py_utils.clp_logging import get_logger
from job_orchestration.executor.compress.compression_task import general_compress
from spider_py import TaskContext

# Setup logging
logger = get_logger("spider_compression_scheduler")


def compress(
_: TaskContext,
job_id: int,
task_id: int,
tag_ids: list[int] | None,
clp_io_config_json: str,
paths_to_compress_json: str,
clp_metadata_db_connection_config: dict[str, bool | int | str | None],
) -> dict[str, int | float | str | None]:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Widen the metadata config annotation.

Database.parse_obj consumes the raw JSON payload, which can contain nested dict structures or scalar types outside bool | int | str | None (e.g., floats, sub-mappings for SSL/client options). The new hint is therefore misleading—it flags legitimate configs as incompatible even though runtime accepts them. Please keep this argument typed as a generic mapping.

-from clp_py_utils.clp_logging import get_logger
-from job_orchestration.executor.compress.compression_task import general_compress
-from spider_py import TaskContext
+from typing import Any, Mapping
+
+from clp_py_utils.clp_logging import get_logger
+from job_orchestration.executor.compress.compression_task import general_compress
+from spider_py import TaskContext
@@
-    clp_metadata_db_connection_config: dict[str, bool | int | str | None],
+    clp_metadata_db_connection_config: Mapping[str, Any],
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from clp_py_utils.clp_logging import get_logger
from job_orchestration.executor.compress.compression_task import general_compress
from spider_py import TaskContext
# Setup logging
logger = get_logger("spider_compression_scheduler")
def compress(
_: TaskContext,
job_id: int,
task_id: int,
tag_ids: list[int] | None,
clp_io_config_json: str,
paths_to_compress_json: str,
clp_metadata_db_connection_config: dict[str, bool | int | str | None],
) -> dict[str, int | float | str | None]:
from typing import Any, Mapping
from clp_py_utils.clp_logging import get_logger
from job_orchestration.executor.compress.compression_task import general_compress
from spider_py import TaskContext
# Setup logging
logger = get_logger("spider_compression_scheduler")
def compress(
_: TaskContext,
job_id: int,
task_id: int,
tag_ids: list[int] | None,
clp_io_config_json: str,
paths_to_compress_json: str,
clp_metadata_db_connection_config: Mapping[str, Any],
) -> dict[str, int | float | str | None]:
...
🤖 Prompt for AI Agents
In
components/job-orchestration/job_orchestration/executor/compress/spider_comrpess.py
around lines 1 to 17, the clp_metadata_db_connection_config parameter is
currently annotated as dict[str, bool | int | str | None] which is too narrow
and rejects valid nested or scalar values; change its type to a generic mapping
such as dict[str, Any] (or Mapping[str, Any]) and add the necessary import for
Any from typing, so the function accepts arbitrary JSON-like nested structures
without altering runtime behavior.

@LinZhihao-723 LinZhihao-723 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before I post other comments: passing the logger by adding a param looks like a hack tbh...
We should probably create a function to setup/get the global logger instead.

_: TaskContext,
job_id: int,
task_id: int,
tag_ids: list[int] | None,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the package is running on Python3.9, I think we should use Union or Optional instead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about from __future__ import annotations?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Just make sure it's compatible with Python 3.9.



def compress(
_: TaskContext,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we planning to use this variable in the future PRs?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe. Depend on how we plan the failure handling.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can u post a draft PR for the actual integration? Without seeing how things get integrated, it's a little hard to determine whether this PR is the best practice.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 03093e7 and 7cdc53c.

📒 Files selected for processing (2)
  • components/job-orchestration/job_orchestration/executor/compress/celery_compress.py (1 hunks)
  • components/job-orchestration/job_orchestration/executor/compress/spider_compress.py (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: LinZhihao-723
PR: y-scope/clp#1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.108Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.
📚 Learning: 2025-09-28T15:00:22.108Z
Learnt from: LinZhihao-723
PR: y-scope/clp#1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.108Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Applied to files:

  • components/job-orchestration/job_orchestration/executor/compress/celery_compress.py
  • components/job-orchestration/job_orchestration/executor/compress/spider_compress.py
📚 Learning: 2025-07-03T13:33:27.460Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#0
File: :0-0
Timestamp: 2025-07-03T13:33:27.460Z
Learning: In the CLP codebase, `from __future__ import annotations` imports are used to enable Python pipe notation for union types (e.g., `int | str`) in Python versions before 3.10, and should not be considered unused imports.

Applied to files:

  • components/job-orchestration/job_orchestration/executor/compress/spider_compress.py
🧬 Code graph analysis (2)
components/job-orchestration/job_orchestration/executor/compress/celery_compress.py (3)
components/job-orchestration/job_orchestration/executor/compress/spider_compress.py (1)
  • compress (11-43)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)
  • compression_entry_point (521-602)
components/clp-py-utils/clp_py_utils/sql_adapter.py (1)
  • connect (31-51)
components/job-orchestration/job_orchestration/executor/compress/spider_compress.py (3)
components/clp-py-utils/clp_py_utils/clp_logging.py (1)
  • get_logger (17-25)
components/job-orchestration/job_orchestration/executor/compress/celery_compress.py (1)
  • compress (17-34)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)
  • compression_entry_point (521-602)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: package-image
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7cdc53c and 8e07c38.

📒 Files selected for processing (1)
  • components/job-orchestration/job_orchestration/executor/compress/spider_compress.py (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Learnt from: LinZhihao-723
PR: y-scope/clp#1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.
📚 Learning: 2025-09-28T15:00:22.170Z
Learnt from: LinZhihao-723
PR: y-scope/clp#1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Applied to files:

  • components/job-orchestration/job_orchestration/executor/compress/spider_compress.py
📚 Learning: 2025-07-03T13:33:27.460Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#0
File: :0-0
Timestamp: 2025-07-03T13:33:27.460Z
Learning: In the CLP codebase, `from __future__ import annotations` imports are used to enable Python pipe notation for union types (e.g., `int | str`) in Python versions before 3.10, and should not be considered unused imports.

Applied to files:

  • components/job-orchestration/job_orchestration/executor/compress/spider_compress.py
🧬 Code graph analysis (1)
components/job-orchestration/job_orchestration/executor/compress/spider_compress.py (4)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
  • Database (168-269)
components/clp-py-utils/clp_py_utils/clp_logging.py (1)
  • get_logger (17-25)
components/job-orchestration/job_orchestration/executor/compress/celery_compress.py (1)
  • compress (17-34)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)
  • compression_entry_point (521-602)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: package-image
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8e07c38 and 14a8efc.

📒 Files selected for processing (1)
  • components/job-orchestration/job_orchestration/executor/compress/spider_compress.py (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Learnt from: LinZhihao-723
PR: y-scope/clp#1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.
📚 Learning: 2025-09-28T15:00:22.170Z
Learnt from: LinZhihao-723
PR: y-scope/clp#1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Applied to files:

  • components/job-orchestration/job_orchestration/executor/compress/spider_compress.py
📚 Learning: 2025-07-03T13:33:27.460Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#0
File: :0-0
Timestamp: 2025-07-03T13:33:27.460Z
Learning: In the CLP codebase, `from __future__ import annotations` imports are used to enable Python pipe notation for union types (e.g., `int | str`) in Python versions before 3.10, and should not be considered unused imports.

Applied to files:

  • components/job-orchestration/job_orchestration/executor/compress/spider_compress.py
🧬 Code graph analysis (1)
components/job-orchestration/job_orchestration/executor/compress/spider_compress.py (4)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
  • Database (168-269)
components/clp-py-utils/clp_py_utils/clp_logging.py (1)
  • get_logger (17-25)
components/job-orchestration/job_orchestration/executor/compress/celery_compress.py (1)
  • compress (17-34)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)
  • compression_entry_point (521-602)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: package-image
  • GitHub Check: lint-check (macos-15)

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (2)
components/job-orchestration/job_orchestration/executor/compress/spider_compress.py (2)

10-10: Align logger name with module name.

The logger is named "spider_compression_scheduler" but the module is spider_compress.py. This inconsistency makes logs harder to trace.

Apply this diff:

-logger = get_logger("spider_compression_scheduler")
+logger = get_logger("spider_compress")

1-7: Add from __future__ import annotations for forward-compatible type hints.

The file uses modern type syntax like list[Int8] (lines 17-21) which requires Python 3.9+. Adding the __future__ import enables this syntax to work reliably across Python 3.9 environments and aligns with the codebase convention.

Based on learnings.

Apply this diff:

+from __future__ import annotations
+
 import json

 from clp_py_utils.clp_config import Database
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between de5eca4 and a40e8a0.

📒 Files selected for processing (2)
  • components/job-orchestration/job_orchestration/executor/compress/spider_compress.py (1 hunks)
  • components/job-orchestration/job_orchestration/utils/spider_utils.py (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-07-03T13:33:27.460Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#0
File: :0-0
Timestamp: 2025-07-03T13:33:27.460Z
Learning: In the CLP codebase, `from __future__ import annotations` imports are used to enable Python pipe notation for union types (e.g., `int | str`) in Python versions before 3.10, and should not be considered unused imports.

Applied to files:

  • components/job-orchestration/job_orchestration/executor/compress/spider_compress.py
📚 Learning: 2025-09-28T15:00:22.170Z
Learnt from: LinZhihao-723
PR: y-scope/clp#1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Applied to files:

  • components/job-orchestration/job_orchestration/executor/compress/spider_compress.py
🧬 Code graph analysis (1)
components/job-orchestration/job_orchestration/executor/compress/spider_compress.py (4)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
  • Database (177-287)
components/clp-py-utils/clp_py_utils/clp_logging.py (1)
  • get_logger (17-25)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)
  • compression_entry_point (521-602)
components/job-orchestration/job_orchestration/utils/spider_utils.py (2)
  • int8_list_to_utf8_str (4-11)
  • utf8_str_to_int8_list (14-21)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: package-image
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: rust-checks (ubuntu-24.04)
  • GitHub Check: rust-checks (ubuntu-22.04)
  • GitHub Check: rust-checks (macos-15)
🔇 Additional comments (3)
components/job-orchestration/job_orchestration/executor/compress/spider_compress.py (1)

38-52: LGTM: JSON serialization flow is correct.

The function properly:

  1. Converts Spider's Int8 lists to UTF-8 strings for JSON inputs
  2. Parses the database config using Pydantic
  3. Calls compression_entry_point which returns a dict (via .model_dump())
  4. Serializes the result dict with json.dumps
  5. Converts back to Int8 list for Spider

This matches the expected Spider task interface.

components/job-orchestration/job_orchestration/utils/spider_utils.py (2)

4-11: LGTM: Conversion logic is sound.

The function correctly constructs bytes from the Int8 list by casting to int, then decodes to UTF-8. This is the proper way to deserialize Spider's byte representation.


14-20: LGTM: Inverse conversion is correctly structured.

The function properly encodes the string to UTF-8 bytes and wraps each byte as a spider_py.Int8. Once the variable name bug on line 21 is fixed, this will work as intended.

Comment thread components/job-orchestration/job_orchestration/utils/spider_utils.py Outdated
@LinZhihao-723 LinZhihao-723 changed the title feat(orchestration): Add spider compression task. feat(clp-package): Add Spider compression task; Split Celery compression task into a dedicated namespace. Oct 1, 2025
@LinZhihao-723 LinZhihao-723 changed the title feat(clp-package): Add Spider compression task; Split Celery compression task into a dedicated namespace. feat(clp-package): Add Spider compression task; Move Celery compression task to a dedicated namespace. Oct 1, 2025
@LinZhihao-723 LinZhihao-723 changed the title feat(clp-package): Add Spider compression task; Move Celery compression task to a dedicated namespace. feat(clp-package): Add Spider compression task; Move Celery compression task to a dedicated submodule. Oct 1, 2025
@LinZhihao-723 LinZhihao-723 changed the title feat(clp-package): Add Spider compression task; Move Celery compression task to a dedicated submodule. feat(clp-package): Add Spider compression task; Move Celery compression task into a dedicated submodule. Oct 1, 2025

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
components/job-orchestration/job_orchestration/executor/compress/spider_compress.py (1)

9-10: Logger name doesn't match module name.

The logger is named "spider_compression_scheduler" but the module is spider_compress.py. Consider renaming to "spider_compress" for consistency.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a40e8a0 and 46ba68d.

📒 Files selected for processing (2)
  • components/job-orchestration/job_orchestration/executor/compress/spider_compress.py (1 hunks)
  • components/job-orchestration/job_orchestration/utils/spider_utils.py (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-28T15:00:22.170Z
Learnt from: LinZhihao-723
PR: y-scope/clp#1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Applied to files:

  • components/job-orchestration/job_orchestration/executor/compress/spider_compress.py
📚 Learning: 2025-07-03T13:33:27.460Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#0
File: :0-0
Timestamp: 2025-07-03T13:33:27.460Z
Learning: In the CLP codebase, `from __future__ import annotations` imports are used to enable Python pipe notation for union types (e.g., `int | str`) in Python versions before 3.10, and should not be considered unused imports.

Applied to files:

  • components/job-orchestration/job_orchestration/executor/compress/spider_compress.py
🧬 Code graph analysis (1)
components/job-orchestration/job_orchestration/executor/compress/spider_compress.py (5)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
  • Database (177-287)
components/clp-py-utils/clp_py_utils/clp_logging.py (1)
  • get_logger (17-25)
components/job-orchestration/job_orchestration/executor/compress/celery_compress.py (1)
  • compress (17-34)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)
  • compression_entry_point (521-602)
components/job-orchestration/job_orchestration/utils/spider_utils.py (2)
  • int8_list_to_utf8_str (6-13)
  • utf8_str_to_int8_list (16-23)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: package-image
  • GitHub Check: rust-checks (ubuntu-24.04)
  • GitHub Check: rust-checks (macos-15)
  • GitHub Check: rust-checks (ubuntu-22.04)
🔇 Additional comments (3)
components/job-orchestration/job_orchestration/utils/spider_utils.py (2)

6-13: LGTM!

The function correctly converts spider_py.Int8 values to Python integers, constructs a bytes object, and decodes it as UTF-8.


16-23: LGTM!

The function correctly encodes the UTF-8 string and wraps each byte in spider_py.Int8. The previous issue with the undefined variable has been resolved.

components/job-orchestration/job_orchestration/executor/compress/spider_compress.py (1)

38-52: Result serialization is correct.

The function properly serializes the dict returned by compression_entry_point to JSON, then converts it to list[Int8] for Spider's return type.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Oct 1, 2025

Copy link
Copy Markdown
Contributor
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Oct 1, 2025

Copy link
Copy Markdown
Contributor
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@LinZhihao-723 LinZhihao-723 changed the title feat(clp-package): Add Spider compression task; Move Celery compression task into a dedicated submodule. feat(job-orchestration): Add Spider compression task; Move Celery compression task into a dedicated submodule. Oct 1, 2025
@LinZhihao-723

Copy link
Copy Markdown
Member

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Oct 1, 2025

Copy link
Copy Markdown
Contributor
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c231fc6 and 9574b1e.

📒 Files selected for processing (1)
  • components/job-orchestration/job_orchestration/executor/compress/spider_compress.py (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-07-03T13:33:27.460Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#0
File: :0-0
Timestamp: 2025-07-03T13:33:27.460Z
Learning: In the CLP codebase, `from __future__ import annotations` imports are used to enable Python pipe notation for union types (e.g., `int | str`) in Python versions before 3.10, and should not be considered unused imports.

Applied to files:

  • components/job-orchestration/job_orchestration/executor/compress/spider_compress.py
📚 Learning: 2025-09-28T15:00:22.170Z
Learnt from: LinZhihao-723
PR: y-scope/clp#1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Applied to files:

  • components/job-orchestration/job_orchestration/executor/compress/spider_compress.py
🧬 Code graph analysis (1)
components/job-orchestration/job_orchestration/executor/compress/spider_compress.py (4)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
  • Database (177-287)
components/clp-py-utils/clp_py_utils/clp_logging.py (1)
  • get_logger (17-25)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)
  • compression_entry_point (521-602)
components/job-orchestration/job_orchestration/utils/spider_utils.py (2)
  • int8_list_to_utf8_str (6-13)
  • utf8_str_to_int8_list (16-23)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: package-image
  • GitHub Check: rust-checks (ubuntu-24.04)
  • GitHub Check: rust-checks (ubuntu-22.04)
  • GitHub Check: rust-checks (macos-15)
🔇 Additional comments (1)
components/job-orchestration/job_orchestration/executor/compress/spider_compress.py (1)

1-10: No changes needed for import annotations

PEP 585 (introduced in Python 3.9) enables built-in collection generics (e.g., list[int]) without requiring from __future__ import annotations, so the imports and type annotations can remain as-is.

Likely an incorrect or invalid review comment.

@LinZhihao-723

Copy link
Copy Markdown
Member

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Oct 1, 2025

Copy link
Copy Markdown
Contributor
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Database.model_validate(
json.loads(int8_list_to_utf8_str(clp_metadata_db_connection_config_json))
),
json.loads(int8_list_to_utf8_str(clp_metadata_db_connection_config_json)),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sitaowang1998 compression_entry_point requires the database config to be a dictionary. I believe this was a mistake.

@LinZhihao-723 LinZhihao-723 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Tested that the celeray compression worked.
  • Can't test Spider compression since the scheduler side is not ready.

@LinZhihao-723 LinZhihao-723 merged commit 9e10b99 into y-scope:main Oct 1, 2025
23 checks passed
@sitaowang1998 sitaowang1998 deleted the spider-job-executor branch October 2, 2025 00:53
junhaoliao pushed a commit to junhaoliao/clp that referenced this pull request May 17, 2026
…pression task into a dedicated submodule. (y-scope#1340)

Co-authored-by: LinZhihao-723 <zh.lin@mail.utoronto.ca>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants