Skip to content

fix(clp-package): Write compression task failure errors to a log, and store the log path (instead of the errors) in status_msg (fixes #716).#1425

Merged
quinntaylormitchell merged 14 commits into
y-scope:mainfrom
quinntaylormitchell:fix-716
Oct 28, 2025

Conversation

@quinntaylormitchell

@quinntaylormitchell quinntaylormitchell commented Oct 16, 2025

Copy link
Copy Markdown
Collaborator

Description

This PR changes the way errors are reported in the case that a compression job fails due to one or more of its compression tasks failing. In this case, the status_msg which holds information about the failed task(s) is written to a user log file at /clp-package/var/log/user, and the path to this log file is reported to the user through status_msg.

If the full path to this user log were to be stored in status_msg, then the length of status_msg would possibly exceed that field's 512-char limit. We instead store the path relative to the clp_config.logs_directory path. The absolute path is then constructed when it is reported to the user via the command line.

Note: this PR applies the same "path-length guard" fix to the already-existing "invalid input path" compression job failure sequence, as there was no guard against exceeding 512 characters in that sequence.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Revoked read permissions on all files in mongodb, a 65GB dataset and then attempted to compress the dataset. The compression job launched 260 tasks, all of which failed because of the lack of read permissions. compression_scheduler does not crash, the user log is written accurately, and the path to this user log is reported accurately via the command line.

In addition, tested and confirmed that the user log path is reported accurately in the case of a compression job failing due to an invalid input path.

Summary by CodeRabbit

  • New Features

    • Scheduler now creates user-facing logs at relative, user-accessible paths for invalid inputs and compression task failures; status messages reference those relative log paths.
  • Bug Fixes

    • Log creation/write failures are detected and surfaced (write failures reported and may abort processing).
    • Error collection improved to capture multiple task errors, producing clearer user-facing reports.

@quinntaylormitchell quinntaylormitchell requested a review from a team as a code owner October 16, 2025 01:29
@coderabbitai

coderabbitai Bot commented Oct 16, 2025

Copy link
Copy Markdown
Contributor

Walkthrough

Replaced _write_failed_path_log with _write_user_failure_log(title, content, logs_directory, job_id, filename_suffix) which writes a user-facing log under logs_directory/user/ and returns a relative Path or None. poll_running_jobs now accepts logs_directory: Path; scheduling and polling use the new logger and surface relative log paths in messages and errors.

Changes

Cohort / File(s) Summary
Compression scheduler updates
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
Replaced _write_failed_path_log with _write_user_failure_log(title: str, content: List[str], logs_directory: Path, job_id: Any, filename_suffix: str) -> Optional[Path]. Writes logs_directory/user/job_{job_id}_{filename_suffix}.txt (title + content), creates directories, returns a relative Path or None on error. search_and_schedule_new_tasks now calls it with title="Failed input paths log.", filename_suffix="failed_paths" and raises if write fails. poll_running_jobs signature changed to poll_running_jobs(logs_directory: Path, db_conn, db_cursor), collects task errors as List[str], calls _write_user_failure_log with title="Compression task errors.", filename_suffix="task_errors", and updates status messages to reference returned relative user log paths. Call sites updated to pass logs_directory.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Scheduler
  participant Search as "search_and_schedule_new_tasks"
  participant Poll as "poll_running_jobs"
  participant Logger as "_write_user_failure_log"

  rect rgb(235,245,255)
    Scheduler->>Search: validate inputs
    alt invalid inputs
      Search->>Logger: _write_user_failure_log("Failed input paths log.", content, logs_directory, job_id, "failed_paths")
      Logger-->>Search: returns relative path or None
      alt returned path
        Search-->>Scheduler: raise/error referencing relative log path
      else None
        Search-->>Scheduler: raise/error indicating log write failure
      end
    end
  end

  rect rgb(245,255,235)
    Scheduler->>Poll: poll_running_jobs(logs_directory, db_conn, db_cursor)
    Poll->>Poll: detect task failures -> collect List[str]
    alt task failures
      Poll->>Logger: _write_user_failure_log("Compression task errors.", content, logs_directory, job_id, "task_errors")
      Logger-->>Poll: returns relative path or None
      alt returned path
        Poll-->>Scheduler: status/error referencing relative log path
      else None
        Poll-->>Scheduler: status/error indicating log write failure
      end
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Verify poll_running_jobs call sites now receive logs_directory.
  • Confirm _write_user_failure_log creates directories, writes UTF-8, returns a relative Path or None on error, and logs failures.
  • Check search_and_schedule_new_tasks and poll_running_jobs use the specified title and filename_suffix values and update status/error messages to reference relative paths.

Possibly related issues

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "fix(clp-package): Write compression task failure errors to a log, and store the log path (instead of the errors) in status_msg (fixes #716)" accurately describes the primary changes in the changeset. The title directly maps to the main objectives: the PR replaces direct error storage with a new user-facing log file approach, updates _write_user_failure_log() to handle error logging, modifies poll_running_jobs() to support this new workflow, and ensures status_msg references relative log paths rather than embedding the errors themselves. The title is specific, concise, uses clear terminology without vague phrases, and would enable a teammate scanning the repository history to quickly understand that this fix addresses how compression task failures are reported and stored.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 10efcd7 and dabf01f.

📒 Files selected for processing (1)
  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (7 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: quinntaylormitchell
PR: y-scope/clp#1125
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:267-291
Timestamp: 2025-09-15T22:20:40.750Z
Learning: For CLP compression jobs, the team has decided to fail the entire job immediately upon encountering any invalid input path, rather than continuing to process valid paths. This decision was made during PR #1125 development.
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-09-28T15:00:22.170Z
Learnt from: LinZhihao-723
PR: y-scope/clp#1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
🧬 Code graph analysis (1)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (1)
components/job-orchestration/job_orchestration/scheduler/constants.py (1)
  • CompressionJobStatus (27-32)
🪛 Ruff (0.14.2)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py

187-187: Dynamically typed expressions (typing.Any) are disallowed in job_id

(ANN401)


206-206: Do not catch blind exception: Exception

(BLE001)


207-207: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


213-213: datetime.datetime.now() called without a tz argument

(DTZ005)


217-217: Do not catch blind exception: Exception

(BLE001)


218-218: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: package-image
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (2)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (2)

551-552: Call site updated correctly.

Passing clp_config.logs_directory into poll_running_jobs aligns with the new signature. LGTM.


400-404: Signature migration verified—all call sites updated.

The search found only one call site at line 551, which is already using the new 3-argument form poll_running_jobs(clp_config.logs_directory, db_conn, db_cursor). No stale 2-argument calls remain anywhere in the codebase.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (2)

213-242: Log exceptions before returning None.

Similar to _write_failed_compression_log, the exception handlers should log errors before returning None.

Apply this diff:

     user_logs_dir = Path(logs_directory) / "user"
     try:
         user_logs_dir.mkdir(parents=True, exist_ok=True)
-    except Exception:
+    except Exception as e:
+        logger.error(f"Failed to create user logs directory: {e}")
         return None
 
     log_path = user_logs_dir / f"failed_paths_{job_id}.txt"
     try:
         with log_path.open("w", encoding="utf-8") as f:
             timestamp = datetime.datetime.now().isoformat(timespec="seconds")
             f.write(f"Failed input paths log.\nGenerated at {timestamp}.\n\n")
             for msg in invalid_path_messages:
                 f.write(f"{msg.rstrip()}\n")
-    except Exception:
+    except Exception as e:
+        logger.error(f"Failed to write invalid paths log: {e}")
         return None

235-235: Consider adding timezone to datetime.now().

For consistency with the suggestion in _write_failed_compression_log, consider using datetime.datetime.now(datetime.timezone.utc).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 06297c0 and bc3f8cc.

📒 Files selected for processing (3)
  • components/clp-package-utils/clp_package_utils/general.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py (3 hunks)
  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (6 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: quinntaylormitchell
PR: y-scope/clp#1125
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:267-291
Timestamp: 2025-09-15T22:20:40.750Z
Learning: For CLP compression jobs, the team has decided to fail the entire job immediately upon encountering any invalid input path, rather than continuing to process valid paths. This decision was made during PR #1125 development.
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-09-28T15:00:22.170Z
Learnt from: LinZhihao-723
PR: y-scope/clp#1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
🧬 Code graph analysis (1)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (2)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
  • CLPConfig (576-769)
components/job-orchestration/job_orchestration/scheduler/constants.py (1)
  • CompressionJobStatus (27-32)
🪛 Ruff (0.14.0)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py

183-183: Dynamically typed expressions (typing.Any) are disallowed in job_id

(ANN401)


198-198: Do not catch blind exception: Exception

(BLE001)


204-204: datetime.datetime.now() called without a tz argument

(DTZ005)


207-207: Do not catch blind exception: Exception

(BLE001)


214-214: Dynamically typed expressions (typing.Any) are disallowed in job_id

(ANN401)

components/clp-package-utils/clp_package_utils/scripts/native/compress.py

94-94: Logging statement uses f-string

(G004)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: package-image
  • GitHub Check: rust-checks (ubuntu-22.04)
  • GitHub Check: rust-checks (ubuntu-24.04)
  • GitHub Check: rust-checks (macos-15)
🔇 Additional comments (6)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (2)

93-94: LGTM! Environment variable expansion enables macro paths.

The expansion of environment variables in status_msg correctly enables the use of macro paths like ${CLP_LOGS_DIR}/... introduced by this PR. Since CLP_LOGS_DIR is set at line 247 before job handling, it will be available for expansion here.

Note: The static analysis hint about f-strings in logging (G004) is a minor style preference and doesn't affect functionality.


247-247: LGTM! Setting CLP_LOGS_DIR enables macro path expansion.

Setting CLP_LOGS_DIR after loading the configuration ensures the environment variable is available for macro expansion when the job fails and the status message is logged (line 93).

components/clp-package-utils/clp_package_utils/general.py (1)

436-439: LGTM! Defensive check prevents double-appending hostname.

The conditional check ensures that the hostname is only appended to data_directory and logs_directory if it's not already the final path component. This prevents issues where the configuration might already include the hostname or where the function is called multiple times.

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (3)

307-318: LGTM! Macro path construction handles 512-char limit.

The use of macro paths (${CLP_LOGS_DIR}/...) with relative log paths is a clean solution to avoid exceeding the 512-character status_msg limit. The fallback to the scheduler log when user log writing fails is appropriate.


417-417: LGTM! Function signature updated to accept clp_config.

The addition of the clp_config parameter is necessary for accessing logs_directory when writing failure logs.


560-560: LGTM! Updated function call to pass clp_config.

The call to poll_running_jobs now correctly passes clp_config as required by the updated function signature.

@junhaoliao junhaoliao left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path resolution and handling may be simplified once the Docker Compose changes in #1178 are merged, so it might be worth holding off on these changes

Comment thread components/clp-package-utils/clp_package_utils/general.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (3)

205-205: Add timezone to datetime.now().

For consistency and to avoid timezone-related issues, use datetime.datetime.now(datetime.timezone.utc) instead of datetime.datetime.now().

Apply this diff:

         with log_path.open("w", encoding="utf-8") as f:
-            timestamp = datetime.datetime.now().isoformat(timespec="seconds")
+            timestamp = datetime.datetime.now(datetime.timezone.utc).isoformat(timespec="seconds")
             f.write(f"Failed compression job log.\nGenerated at {timestamp}.\n\n")

208-210: Consider using logger.error instead of logger.warning for critical failures.

When the compression failure log cannot be written, this prevents users from seeing detailed failure information. Using logger.error would better reflect the severity and make debugging easier.

Apply this diff:

     try:
         with log_path.open("w", encoding="utf-8") as f:
             timestamp = datetime.datetime.now().isoformat(timespec="seconds")
             f.write(f"Failed compression job log.\nGenerated at {timestamp}.\n\n")
             f.write(f"{status_msg.rstrip()}\n")
     except Exception as e:
-        logger.warning(f"Failed to write compression failure log: {e}")
+        logger.error(f"Failed to write compression failure log: {e}")
         return None

198-200: Consider using logger.error instead of logger.warning for critical failures.

When the user logs directory cannot be created, this prevents users from seeing detailed failure information. Using logger.error would better reflect the severity and make debugging easier.

Apply this diff:

     try:
         user_logs_dir.mkdir(parents=True, exist_ok=True)
     except Exception as e:
-        logger.warning(f"Failed to create user logs directory: {e}")
+        logger.error(f"Failed to create user logs directory: {e}")
         return None
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bc3f8cc and 039e547.

📒 Files selected for processing (1)
  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (6 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: quinntaylormitchell
PR: y-scope/clp#1125
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:267-291
Timestamp: 2025-09-15T22:20:40.750Z
Learning: For CLP compression jobs, the team has decided to fail the entire job immediately upon encountering any invalid input path, rather than continuing to process valid paths. This decision was made during PR #1125 development.
Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
🧬 Code graph analysis (1)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (2)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
  • CLPConfig (576-769)
components/job-orchestration/job_orchestration/scheduler/constants.py (1)
  • CompressionJobStatus (27-32)
🪛 Ruff (0.14.0)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py

183-183: Dynamically typed expressions (typing.Any) are disallowed in job_id

(ANN401)


198-198: Do not catch blind exception: Exception

(BLE001)


199-199: Logging statement uses f-string

(G004)


205-205: datetime.datetime.now() called without a tz argument

(DTZ005)


208-208: Do not catch blind exception: Exception

(BLE001)


209-209: Logging statement uses f-string

(G004)


216-216: Dynamically typed expressions (typing.Any) are disallowed in job_id

(ANN401)


231-231: Do not catch blind exception: Exception

(BLE001)


232-232: Logging statement uses f-string

(G004)


242-242: Do not catch blind exception: Exception

(BLE001)


243-243: Logging statement uses f-string

(G004)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: package-image
  • GitHub Check: rust-checks (ubuntu-22.04)
  • GitHub Check: rust-checks (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (2)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (2)

311-322: LGTM! Clean error handling with appropriate fallback.

The macro path construction and fallback to the scheduler log when user log creation fails is well-implemented. The error message clearly directs users to the relevant log file.


421-421: LGTM! Function signature correctly updated.

Adding clp_config parameter enables access to logs_directory for writing user logs, which aligns with the PR objectives.

@quinntaylormitchell quinntaylormitchell changed the title fix: Write failed compression log and store path in status_msg (fixes #716). fix: Write failed compression log to /var/log/user and store log path in status_msg (fixes #716). Oct 23, 2025
clp_config.validate_logs_input_config()
clp_config.validate_logs_dir()
clp_config.database.load_credentials_from_env()
os.environ["CLP_LOGS_DIR"] = str(clp_config.logs_directory)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that the CLP_LOGS_DIR environment variable has already been set in the compression worker container in

CLP_LOGS_DIR: "/var/log/compression_worker"
.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why, but when I remove this line, the "CLP_LOGS_DIR" in the path stored in status_msg can't be expanded with os.path.expandvars(), and we get

Compression failed. One or more compression tasks failed. See ${CLP_LOGS_DIR}/user/failed_compression_log_3.txt for more details.

as our ERROR output. any thoughts as to why that might be happening?

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1936756 and 5a99b0d.

📒 Files selected for processing (1)
  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (6 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: quinntaylormitchell
PR: y-scope/clp#1125
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:267-291
Timestamp: 2025-09-15T22:20:40.750Z
Learning: For CLP compression jobs, the team has decided to fail the entire job immediately upon encountering any invalid input path, rather than continuing to process valid paths. This decision was made during PR #1125 development.
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
🧬 Code graph analysis (1)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (1)
components/job-orchestration/job_orchestration/scheduler/constants.py (1)
  • CompressionJobStatus (27-32)
🪛 Ruff (0.14.1)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py

10-10: typing.Dict is deprecated, use dict instead

(UP035)


10-10: typing.List is deprecated, use list instead

(UP035)


10-10: typing.Set is deprecated, use set instead

(UP035)


184-184: Use X | Y for type annotations

Convert to X | Y

(UP007)


186-186: Dynamically typed expressions (typing.Any) are disallowed in job_id

(ANN401)


206-206: Do not catch blind exception: Exception

(BLE001)


207-207: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


213-213: datetime.datetime.now() called without a tz argument

(DTZ005)


220-220: Do not catch blind exception: Exception

(BLE001)


221-221: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: package-image
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (5)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (5)

10-10: LGTM! Import addition is necessary.

The Union import is required for the new _write_user_failure_log function signature.


289-304: LGTM! Error handling is appropriate.

The invalid path handling correctly writes the user log, validates the result, and provides a clear error message referencing the relative log path within the configured logs directory. The RuntimeError propagation is appropriate since _write_user_failure_log already logs the failure details.


403-403: LGTM! Signature update aligns with past feedback.

The addition of logs_directory: Path parameter is necessary for writing user failure logs and aligns with the agreed approach to pass only the required parameter instead of the entire config object.

Based on learnings.


454-470: LGTM! Consistent error handling pattern.

The failed compression job handling mirrors the approach used for invalid paths (lines 289-304), maintaining consistency across failure scenarios. The error message clearly directs users to the log file location within their configured logs directory.


553-553: LGTM! Call site properly updated.

The call to poll_running_jobs correctly passes clp_config.logs_directory to match the updated function signature.

Comment on lines +183 to +224
def _write_user_failure_log(
content: Union[str, List[str]],
logs_directory: Path,
job_id: Any,
filename_prefix: str,
header_title: str,
) -> Optional[Path]:
"""
Writes the error messages in `invalid_path_messages` to a log file,
`{logs_directory}/user/failed_paths_{job_id}.txt`. The directory will be created if it doesn't
already exist.
:param invalid_path_messages:
Writes a user-visible failure log to `{logs_directory}/user/{filename_prefix}_{job_id}.txt`. The
/user directory will be created if it doesn't already exist. `content` may be either a single
string or a List of strings.

:param content:
:param logs_directory:
:param job_id:
:return: Path to the written log file or `None` if error is encountered.
:param filename_prefix:
:param header_title:
:return: Path to the written log file relative to `logs_directory`, or None on error.
"""

user_logs_dir = Path(logs_directory) / "user"
relative_log_path = Path("user") / f"{filename_prefix}_{job_id}.txt"
user_logs_dir = Path(logs_directory) / relative_log_path.parent
try:
user_logs_dir.mkdir(parents=True, exist_ok=True)
except Exception:
except Exception as e:
logger.error("Failed to create user logs directory: '%s' - %s", user_logs_dir, e)
return None

log_path = user_logs_dir / f"failed_paths_{job_id}.txt"
log_path = Path(logs_directory) / relative_log_path
try:
with log_path.open("w", encoding="utf-8") as f:
timestamp = datetime.datetime.now().isoformat(timespec="seconds")
f.write(f"Failed input paths log.\nGenerated at {timestamp}.\n\n")
for msg in invalid_path_messages:
f.write(f"{msg.rstrip()}\n")
except Exception:
f.write(f"{header_title}\nGenerated at {timestamp}.\n\n")
if isinstance(content, str):
f.write(f"{content.rstrip()}\n")
else:
for item in content:
f.write(f"{str(item).rstrip()}\n")
except Exception as e:
logger.error("Failed to write compression failure user log: '%s' - %s", log_path, e)
return None

return log_path
return relative_log_path

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Use logger.exception for exception handlers.

The exception handlers currently use logger.error, but logger.exception is preferred as it automatically includes the stack trace, which is valuable for debugging log writing failures.

Apply this diff:

     try:
         user_logs_dir.mkdir(parents=True, exist_ok=True)
     except Exception as e:
-        logger.error("Failed to create user logs directory: '%s' - %s", user_logs_dir, e)
+        logger.exception("Failed to create user logs directory: '%s'", user_logs_dir)
         return None
 
     log_path = Path(logs_directory) / relative_log_path
     try:
         with log_path.open("w", encoding="utf-8") as f:
             timestamp = datetime.datetime.now().isoformat(timespec="seconds")
             f.write(f"{header_title}\nGenerated at {timestamp}.\n\n")
             if isinstance(content, str):
                 f.write(f"{content.rstrip()}\n")
             else:
                 for item in content:
                     f.write(f"{str(item).rstrip()}\n")
     except Exception as e:
-        logger.error("Failed to write compression failure user log: '%s' - %s", log_path, e)
+        logger.exception("Failed to write compression failure user log: '%s'", log_path)
         return None

Note: logger.exception automatically includes the exception details, so the e parameter can be removed from the format string.

🧰 Tools
🪛 Ruff (0.14.1)

184-184: Use X | Y for type annotations

Convert to X | Y

(UP007)


186-186: Dynamically typed expressions (typing.Any) are disallowed in job_id

(ANN401)


206-206: Do not catch blind exception: Exception

(BLE001)


207-207: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


213-213: datetime.datetime.now() called without a tz argument

(DTZ005)


220-220: Do not catch blind exception: Exception

(BLE001)


221-221: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🤖 Prompt for AI Agents
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
lines 183-224: the except handlers use logger.error and pass the exception
object into the format string; change both to logger.exception and remove the
exception argument from the format string so the stack trace is automatically
logged (i.e., call logger.exception("Failed to create user logs directory:
'%s'", user_logs_dir) and logger.exception("Failed to write compression failure
user log: '%s'", log_path)).

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (2)

206-207: Use logger.exception for automatic stack trace inclusion.

The exception handlers currently use logger.error, but logger.exception automatically includes the stack trace, which is valuable for debugging log writing failures.

Apply this diff:

     try:
         user_logs_dir.mkdir(parents=True, exist_ok=True)
-    except Exception as e:
-        logger.error("Failed to create user logs directory: '%s' - %s", user_logs_dir, e)
+    except Exception:
+        logger.exception("Failed to create user logs directory: '%s'", user_logs_dir)
         return None
 
     log_path = logs_directory / relative_log_path
     try:
         with log_path.open("w", encoding="utf-8") as f:
             timestamp = datetime.datetime.now().isoformat(timespec="seconds")
             f.write(f"{title}\nGenerated at {timestamp}.\n\n")
             for item in content:
                 f.write(f"{item.rstrip()}\n")
-    except Exception as e:
-        logger.error("Failed to write compression failure user log: '%s' - %s", log_path, e)
+    except Exception:
+        logger.exception("Failed to write compression failure user log: '%s'", log_path)
         return None

Based on learnings

Also applies to: 217-218


213-213: Add timezone to datetime.now() for consistency.

For consistency and to avoid timezone-related issues, use datetime.datetime.now(datetime.timezone.utc) instead of datetime.datetime.now().

Apply this diff:

-            timestamp = datetime.datetime.now().isoformat(timespec="seconds")
+            timestamp = datetime.datetime.now(datetime.timezone.utc).isoformat(timespec="seconds")

Based on learnings

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5a99b0d and 45bc548.

📒 Files selected for processing (1)
  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (8 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: quinntaylormitchell
PR: y-scope/clp#1125
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:267-291
Timestamp: 2025-09-15T22:20:40.750Z
Learning: For CLP compression jobs, the team has decided to fail the entire job immediately upon encountering any invalid input path, rather than continuing to process valid paths. This decision was made during PR #1125 development.
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-09-28T15:00:22.170Z
Learnt from: LinZhihao-723
PR: y-scope/clp#1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
🧬 Code graph analysis (1)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (1)
components/job-orchestration/job_orchestration/scheduler/constants.py (1)
  • CompressionJobStatus (27-32)
🪛 Ruff (0.14.2)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py

10-10: typing.Dict is deprecated, use dict instead

(UP035)


10-10: typing.List is deprecated, use list instead

(UP035)


10-10: typing.Set is deprecated, use set instead

(UP035)


186-186: Dynamically typed expressions (typing.Any) are disallowed in job_id

(ANN401)


206-206: Do not catch blind exception: Exception

(BLE001)


207-207: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


213-213: datetime.datetime.now() called without a tz argument

(DTZ005)


217-217: Do not catch blind exception: Exception

(BLE001)


218-218: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: package-image
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (5)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (5)

286-301: LGTM! Consistent error handling for invalid input paths.

The error handling correctly writes the failure log and provides clear guidance to users about where to find detailed error information. The relative path approach ensures the status message stays within the 512-character limit.


400-400: LGTM! Parameter addition enables user log writing.

Adding the logs_directory parameter enables the function to write user-facing failure logs, aligning with the PR's objectives.


411-411: LGTM! Improved error collection structure.

Changing error_message from a string to List[str] enables cleaner collection of individual task errors and integrates well with the new _write_user_failure_log function.

Also applies to: 428-428


451-466: LGTM! Consistent error handling for task failures.

The error handling pattern mirrors the invalid input path handling (lines 286-301), writing detailed errors to a user log and providing clear guidance about the log location. The approach successfully avoids exceeding the 512-character limit for status_msg.


549-549: LGTM! Correct parameter passing.

The call site correctly passes clp_config.logs_directory to match the updated function signature.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (2)

411-429: Name reflects plurality.

error_message is a list. Rename to error_messages for clarity; update uses.

-        error_message: List[str] = []
+        error_messages: List[str] = []
@@
-                    job_success = False
-                    error_message.append(f"task {task_result.task_id}: {task_result.error_message}")
+                    job_success = False
+                    error_messages.append(
+                        f"task {task_result.task_id}: {task_result.error_message}"
+                    )

434-436: Log stack trace when result retrieval fails.

Use logger.exception to aid debugging.

-        except Exception as e:
-            logger.error(f"Error while getting results for job {job_id}: {e}")
+        except Exception:
+            logger.exception("Error while getting results for job %s", job_id)
             job_success = False
♻️ Duplicate comments (1)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (1)

202-221: Harden logging and time handling in user-log writer.

  • Catch OSError instead of blind Exception (BLE001).
  • Use logger.exception to capture stack traces (TRY400).
  • Use UTC, timezone-aware timestamp (DTZ005).
     user_logs_dir = logs_directory / relative_log_path.parent
     try:
         user_logs_dir.mkdir(parents=True, exist_ok=True)
-    except Exception as e:
-        logger.error("Failed to create user logs directory: '%s' - %s", user_logs_dir, e)
+    except OSError:
+        logger.exception("Failed to create user logs directory: '%s'", user_logs_dir)
         return None

     log_path = logs_directory / relative_log_path
     try:
         with log_path.open("w", encoding="utf-8") as f:
-            timestamp = datetime.datetime.now().isoformat(timespec="seconds")
+            timestamp = datetime.datetime.now(datetime.timezone.utc).isoformat(
+                timespec="seconds"
+            )
             f.write(f"{title}\nGenerated at {timestamp}.\n\n")
             for item in content:
                 f.write(f"{item.rstrip()}\n")
-    except Exception as e:
-        logger.error("Failed to write compression failure user log: '%s' - %s", log_path, e)
+    except OSError:
+        logger.exception("Failed to write compression failure user log: '%s'", log_path)
         return None
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 45bc548 and 1f579f2.

📒 Files selected for processing (1)
  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (7 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: quinntaylormitchell
PR: y-scope/clp#1125
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:267-291
Timestamp: 2025-09-15T22:20:40.750Z
Learning: For CLP compression jobs, the team has decided to fail the entire job immediately upon encountering any invalid input path, rather than continuing to process valid paths. This decision was made during PR #1125 development.
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-09-28T15:00:22.170Z
Learnt from: LinZhihao-723
PR: y-scope/clp#1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
🧬 Code graph analysis (1)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (1)
components/job-orchestration/job_orchestration/scheduler/constants.py (1)
  • CompressionJobStatus (27-32)
🪛 Ruff (0.14.2)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py

186-186: Dynamically typed expressions (typing.Any) are disallowed in job_id

(ANN401)


206-206: Do not catch blind exception: Exception

(BLE001)


207-207: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


213-213: datetime.datetime.now() called without a tz argument

(DTZ005)


217-217: Do not catch blind exception: Exception

(BLE001)


218-218: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

Comment on lines +183 to 201
def _write_user_failure_log(
content: List[str],
logs_directory: Path,
job_id: Any,
filename_suffix: str,
title: str,
) -> Optional[Path]:
"""
Writes the error messages in `invalid_path_messages` to a log file,
`{logs_directory}/user/failed_paths_{job_id}.txt`. The directory will be created if it doesn't
already exist.
:param invalid_path_messages:
Writes a user-oriented failure log to
`{logs_directory}/user/job_{job_id}_{filename_suffix}.txt`. The `{logs_directory}/user`
directory will be created if it does not already exist.

:param content:
:param logs_directory:
:param job_id:
:return: Path to the written log file or `None` if error is encountered.
:param filename_suffix:
:param title:
:return: Path to the written log file relative to `logs_directory`, or `None` on error.
"""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Narrow job_id type; align with lint and intent.

job_id: Any trips ANN401 and weakens the API. Use an integer.

 def _write_user_failure_log(
-    content: List[str],
-    logs_directory: Path,
-    job_id: Any,
+    content: List[str],
+    logs_directory: Path,
+    job_id: int,
     filename_suffix: str,
     title: str,
 ) -> Optional[Path]:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _write_user_failure_log(
content: List[str],
logs_directory: Path,
job_id: Any,
filename_suffix: str,
title: str,
) -> Optional[Path]:
"""
Writes the error messages in `invalid_path_messages` to a log file,
`{logs_directory}/user/failed_paths_{job_id}.txt`. The directory will be created if it doesn't
already exist.
:param invalid_path_messages:
Writes a user-oriented failure log to
`{logs_directory}/user/job_{job_id}_{filename_suffix}.txt`. The `{logs_directory}/user`
directory will be created if it does not already exist.
:param content:
:param logs_directory:
:param job_id:
:return: Path to the written log file or `None` if error is encountered.
:param filename_suffix:
:param title:
:return: Path to the written log file relative to `logs_directory`, or `None` on error.
"""
def _write_user_failure_log(
content: List[str],
logs_directory: Path,
job_id: int,
filename_suffix: str,
title: str,
) -> Optional[Path]:
"""
Writes a user-oriented failure log to
`{logs_directory}/user/job_{job_id}_{filename_suffix}.txt`. The `{logs_directory}/user`
directory will be created if it does not already exist.
:param content:
:param logs_directory:
:param job_id:
:param filename_suffix:
:param title:
:return: Path to the written log file relative to `logs_directory`, or `None` on error.
"""
🧰 Tools
🪛 Ruff (0.14.2)

186-186: Dynamically typed expressions (typing.Any) are disallowed in job_id

(ANN401)

🤖 Prompt for AI Agents
In
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
around lines 183 to 201, change the job_id parameter's type from Any to int to
satisfy ANN401 and reflect intent; update the function signature to job_id: int,
adjust any internal code that formats or concatenates job_id (use str(job_id) or
f-strings) if necessary, and update the docstring/type hints accordingly so the
API and typing are consistent.

Comment on lines +286 to 301
user_log_relative_path = _write_user_failure_log(
invalid_path_messages,
clp_config.logs_directory,
job_id,
filename_suffix="failed_paths",
title="Failed input paths log.",
)
if user_log_relative_path is None:
err_msg = "Failed to write user log for invalid input paths."
raise RuntimeError(err_msg)

error_msg = (
"At least one of your input paths could not be processed."
f" See the error log at '{user_log_relative_path}' inside your configured logs"
" directory (`logs_directory`) for more details."
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t raise on log-write failure; fail job gracefully and continue.

Raising here aborts the scheduler loop (main returns -1), stalling all jobs. Update job status with a short fallback message instead, commit, and continue.

-                if user_log_relative_path is None:
-                    err_msg = "Failed to write user log for invalid input paths."
-                    raise RuntimeError(err_msg)
+                if user_log_relative_path is None:
+                    fallback_msg = (
+                        "At least one of your input paths could not be processed."
+                        " Additionally, writing a user error log failed. See the scheduler logs"
+                        " for details."
+                    )
+                    update_compression_job_metadata(
+                        db_cursor,
+                        job_id,
+                        {
+                            "status": CompressionJobStatus.FAILED,
+                            "status_msg": fallback_msg,
+                        },
+                    )
+                    db_conn.commit()
+                    continue

Based on learnings.

🤖 Prompt for AI Agents
In
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
around lines 286 to 301, the current code raises a RuntimeError when
_write_user_failure_log returns None which aborts the scheduler loop; instead,
replace the raise with logic that sets the job status to a graceful failure
using a short fallback message (e.g., "Could not write user failure log; see
scheduler logs for details"), persist/commit that status update to the
datastore, and then continue execution without throwing so the scheduler loop is
not interrupted.

Comment on lines +451 to +467
error_log_relative_path = _write_user_failure_log(
error_message,
logs_directory,
job_id,
filename_suffix="task_errors",
title="Compression task errors.",
)
if error_log_relative_path is None:
err_msg = "Failed to write user log for failed compression job."
raise RuntimeError(err_msg)

error_msg = (
"One or more compression tasks failed."
f" See the error log at '{error_log_relative_path}' inside your configured logs"
" directory (`logs_directory`) for more details."
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid crashing the scheduler on task-error log-write failure.

Same concern as the invalid-path branch: don’t raise. Mark job FAILED with a compact message, commit, add to jobs_to_delete, and continue.

-            error_log_relative_path = _write_user_failure_log(
-                error_message,
+            error_log_relative_path = _write_user_failure_log(
+                error_messages,
                 logs_directory,
                 job_id,
                 filename_suffix="task_errors",
                 title="Compression task errors.",
             )
-            if error_log_relative_path is None:
-                err_msg = "Failed to write user log for failed compression job."
-                raise RuntimeError(err_msg)
+            if error_log_relative_path is None:
+                fallback_msg = (
+                    "One or more compression tasks failed."
+                    " Additionally, writing the user error log failed. See worker logs for details."
+                )
+                update_compression_job_metadata(
+                    db_cursor,
+                    job_id,
+                    dict(
+                        status=CompressionJobStatus.FAILED,
+                        status_msg=fallback_msg,
+                    ),
+                )
+                db_conn.commit()
+                jobs_to_delete.append(job_id)
+                continue
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
error_log_relative_path = _write_user_failure_log(
error_message,
logs_directory,
job_id,
filename_suffix="task_errors",
title="Compression task errors.",
)
if error_log_relative_path is None:
err_msg = "Failed to write user log for failed compression job."
raise RuntimeError(err_msg)
error_msg = (
"One or more compression tasks failed."
f" See the error log at '{error_log_relative_path}' inside your configured logs"
" directory (`logs_directory`) for more details."
)
error_log_relative_path = _write_user_failure_log(
error_messages,
logs_directory,
job_id,
filename_suffix="task_errors",
title="Compression task errors.",
)
if error_log_relative_path is None:
fallback_msg = (
"One or more compression tasks failed."
" Additionally, writing the user error log failed. See worker logs for details."
)
update_compression_job_metadata(
db_cursor,
job_id,
dict(
status=CompressionJobStatus.FAILED,
status_msg=fallback_msg,
),
)
db_conn.commit()
jobs_to_delete.append(job_id)
continue
error_msg = (
"One or more compression tasks failed."
f" See the error log at '{error_log_relative_path}' inside your configured logs"
" directory (`logs_directory`) for more details."
)
🤖 Prompt for AI Agents
In
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
around lines 451 to 467, the code raises a RuntimeError when writing the
user-facing task-error log fails; instead follow the same non-crashing behavior
used for the invalid-path branch: do not raise, set the job state to FAILED with
a compact message noting the inability to write the user error log, commit the
job state change, append the job_id to jobs_to_delete, and continue execution;
ensure the compact message is concise, persist the update via the same commit
method used elsewhere, and keep control flow consistent with the surrounding
error-handling pattern.

kirkrodrigues
kirkrodrigues previously approved these changes Oct 28, 2025

@kirkrodrigues kirkrodrigues left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the PR title, how about:

fix(clp-package): Write compression task failure errors to a log, and store the log path (instead of the errors) in `status_msg` (fixes #716).

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (1)

435-437: Log exceptions with stack traces.

Use logger.exception to retain context.

-        except Exception as e:
-            logger.error(f"Error while getting results for job {job_id}: {e}")
+        except Exception:
+            logger.exception("Error while getting results for job %s", job_id)
             job_success = False
♻️ Duplicate comments (4)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (4)

183-189: Narrow job_id typing; avoid Any.

job_id is a DB identifier; type it as int to satisfy ANN401 and tighten the API.

 def _write_user_failure_log(
     title: str,
     content: List[str],
     logs_directory: Path,
-    job_id: Any,
+    job_id: int,
     filename_suffix: str,
 ) -> Optional[Path]:

206-219: Use logger.exception, catch narrower errors, and make timestamp TZ‑aware.

  • Catch OSError for filesystem ops instead of blind Exception (BLE001).
  • Use logger.exception(...) to include stack traces (TRY400).
  • Write UTC timestamps to avoid naive datetimes (DTZ005).
-    except Exception as e:
-        logger.error("Failed to create user logs directory: '%s' - %s", user_logs_dir, e)
+    except OSError:
+        logger.exception("Failed to create user logs directory: '%s'", user_logs_dir)
         return None

     log_path = logs_directory / relative_log_path
     try:
         with log_path.open("w", encoding="utf-8") as f:
-            timestamp = datetime.datetime.now().isoformat(timespec="seconds")
+            timestamp = datetime.datetime.now(datetime.timezone.utc).isoformat(timespec="seconds")
             f.write(f"{title}\nGenerated at {timestamp}.\n\n")
             for item in content:
                 f.write(f"{item.rstrip()}\n")
-    except Exception as e:
-        logger.error("Failed to write compression failure user log: '%s' - %s", log_path, e)
+    except OSError:
+        logger.exception("Failed to write compression failure user log: '%s'", log_path)
         return None

286-301: Don’t raise when writing the user log fails; fail the job gracefully.

Throwing here aborts the scheduler loop. Persist a compact fallback message and continue.

-                if user_log_relative_path is None:
-                    err_msg = "Failed to write user log for invalid input paths."
-                    raise RuntimeError(err_msg)
+                if user_log_relative_path is None:
+                    fallback_msg = (
+                        "At least one of your input paths could not be processed."
+                        " Additionally, writing a user error log failed. See the scheduler logs"
+                        " for details."
+                    )
+                    update_compression_job_metadata(
+                        db_cursor,
+                        job_id,
+                        {
+                            "status": CompressionJobStatus.FAILED,
+                            "status_msg": fallback_msg,
+                        },
+                    )
+                    db_conn.commit()
+                    continue

451-460: Avoid crashing on task‑error log‑write failure; mark FAILED and continue.

Mirrors the invalid‑path branch; keeps the scheduler alive.

-            if error_log_relative_path is None:
-                err_msg = "Failed to write user log for failed compression job."
-                raise RuntimeError(err_msg)
+            if error_log_relative_path is None:
+                fallback_msg = (
+                    "One or more compression tasks failed."
+                    " Additionally, writing the user error log failed. See worker logs for details."
+                )
+                update_compression_job_metadata(
+                    db_cursor,
+                    job_id,
+                    dict(
+                        status=CompressionJobStatus.FAILED,
+                        status_msg=fallback_msg,
+                    ),
+                )
+                db_conn.commit()
+                jobs_to_delete.append(job_id)
+                continue
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3e08c7b and 97a06f8.

📒 Files selected for processing (1)
  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (7 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: quinntaylormitchell
PR: y-scope/clp#1125
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:267-291
Timestamp: 2025-09-15T22:20:40.750Z
Learning: For CLP compression jobs, the team has decided to fail the entire job immediately upon encountering any invalid input path, rather than continuing to process valid paths. This decision was made during PR #1125 development.
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-09-28T15:00:22.170Z
Learnt from: LinZhihao-723
PR: y-scope/clp#1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
🧬 Code graph analysis (1)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (1)
components/job-orchestration/job_orchestration/scheduler/constants.py (1)
  • CompressionJobStatus (27-32)
🪛 Ruff (0.14.2)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py

187-187: Dynamically typed expressions (typing.Any) are disallowed in job_id

(ANN401)


206-206: Do not catch blind exception: Exception

(BLE001)


207-207: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


213-213: datetime.datetime.now() called without a tz argument

(DTZ005)


217-217: Do not catch blind exception: Exception

(BLE001)


218-218: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🔇 Additional comments (1)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (1)

549-549: LGTM: correct propagation of logs_directory into poll_running_jobs.

Signature and call site are aligned; enables relative‑path log reporting.

@quinntaylormitchell quinntaylormitchell changed the title fix: Write failed compression log to /var/log/user and store log path in status_msg (fixes #716). fix(clp-package): Write compression task failure errors to a log, and store the log path (instead of the errors) in status_msg (fixes #716). Oct 28, 2025
@quinntaylormitchell quinntaylormitchell merged commit 61349e9 into y-scope:main Oct 28, 2025
19 checks passed
junhaoliao pushed a commit to junhaoliao/clp that referenced this pull request May 17, 2026
… store the log path (instead of the errors) in `status_msg` (fixes y-scope#716). (y-scope#1425)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants