Skip to content

fix(package): Fail job and report failure to user for compression jobs that encounter at least one invalid input path (fixes #308).#1125

Merged
kirkrodrigues merged 17 commits into
y-scope:mainfrom
quinntaylormitchell:fix-308
Sep 23, 2025

Conversation

@quinntaylormitchell

@quinntaylormitchell quinntaylormitchell commented Jul 25, 2025

Copy link
Copy Markdown
Collaborator

Description

This PR fixes issue #308 by changing the reporting system that provides the user with information about the status of their compression jobs.

Any compression job that encounters at least one invalid path will be failed. The user will be notified that at least one invalid path has been encountered, and they will be directed to a log file that communicates all these invalid paths.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Compressed three arrangements of log files:

  1. one real, one fake
  2. one real, two fake
  3. two fake

The correct log file was produced in all cases, and the user was notified correctly in all cases.

Summary by CodeRabbit

  • New Features

    • Per-job user-facing log created for invalid filesystem paths; job status includes a link to the log when available.
    • Scheduling now collects and surfaces path validation details to users.
  • Bug Fixes

    • Stronger validation of top-level and nested input paths to prevent partial scheduling when any path is invalid.
    • Clearer failure messages for S3 issues and unsupported inputs.
    • Jobs that produce no tasks are no longer auto-marked as failed.

@coderabbitai

coderabbitai Bot commented Jul 25, 2025

Copy link
Copy Markdown
Contributor

Walkthrough

The scheduler’s filesystem input processing now validates and buffers paths while collecting invalid-path messages. A new helper writes these messages to a per-job user log. The main search-and-schedule flow handles FS invalid paths, writes the log, marks jobs FAILED, broadens S3 error handling, and adjusts buffer flush/sequence.

Changes

Cohort / File(s) Summary
FS input validation return + gating
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
_process_fs_input_paths(fs_input_conf, paths_to_compress_buffer) -> List[str] now returns a List[str] of invalid-path messages; introduces paths_ok gating to stop buffering after first invalid path and accumulates invalid internal-path errors; docstring and typing updated.
User-facing failed path log helper
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
Adds _write_failed_path_log(invalid_path_messages: List[str], logs_directory: Path, job_id: Any) -> Optional[Path] to write invalid-path messages to logs_directory/user/failed_paths_<job_id>.txt, returning the path or None on error.
Scheduler error handling and flow
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
search_and_schedule_new_tasks collects invalid messages from FS processing, writes user log if present, updates job status to FAILED with a status_msg including the log path (or scheduler logs path if write failed), commits and continues. S3 processing retains broad try/except and marks FAILED on exception with “S3 Failure: {err}”. Buffer flush and task/partition derivation ordering adjusted; prior auto-FAIL on zero tasks removed.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant S as Scheduler
  participant FS as _process_fs_input_paths
  participant L as _write_failed_path_log
  participant J as JobStore

  rect rgba(230,245,255,0.6)
  note over S: FS input handling (updated)
  S->>FS: Validate and buffer FS paths
  FS-->>S: invalid_path_messages (List[str])
  alt invalid_path_messages not empty
    S->>L: Write failed paths log (messages, logs_dir, job_id)
    L-->>S: log_path or None
    S->>J: Update job -> FAILED with status_msg (log_path or scheduler logs path)
    S->>J: Commit
    S-->>S: Continue to next job
  else no invalid messages
    S->>S: Flush buffer, collect tasks/partitions
    note over S: Proceed with scheduling flow
  end
  end
Loading
sequenceDiagram
  autonumber
  participant S as Scheduler
  participant S3 as _process_s3_input
  participant J as JobStore

  rect rgba(255,240,230,0.6)
  note over S: S3 input handling (broadened)
  S->>S3: Process S3 input
  alt success
    S-->>S: Continue normal scheduling
  else exception
    S3-->>S: throw Error
    S->>J: Update job -> FAILED with "S3 Failure: {err}"
    S->>J: Commit
    S-->>S: Continue to next job
  end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title Check ✅ Passed The title succinctly and accurately captures the PR's primary behaviour change — failing compression jobs and notifying the user when an invalid filesystem input path is encountered — and includes the issue reference. It is concise, specific, and directly related to the modifications in compression_scheduler that collect invalid-path messages, write a per-job failure log, and set job status to FAILED. The conventional prefix "fix(package):" is appropriate for this change.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@quinntaylormitchell

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jul 25, 2025

Copy link
Copy Markdown
Contributor
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@quinntaylormitchell quinntaylormitchell marked this pull request as ready for review July 25, 2025 19:17
@quinntaylormitchell quinntaylormitchell requested a review from a team as a code owner July 25, 2025 19:17
@haiqi96

haiqi96 commented Jul 28, 2025

Copy link
Copy Markdown
Contributor

Quick question: what would appear on webui if a job partially fail?

Edit: I guess this is irrelevant now since a job will either pass or fail.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 20aef3e and cfd13a9.

📒 Files selected for processing (1)
  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (4 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Learnt from: haiqi96
PR: y-scope/clp#0
File: :0-0
Timestamp: 2025-07-29T14:04:13.769Z
Learning: User haiqi96 requested creating a GitHub issue to document a bug fix from PR #1136, which addressed MySQL compatibility issues with invalid SQL CAST operations in the WebUI component.
Learnt from: gibber9809
PR: y-scope/clp#504
File: components/core/src/clp_s/search/kql/CMakeLists.txt:29-29
Timestamp: 2024-10-22T15:36:04.655Z
Learning: When reviewing pull requests, focus on the changes within the PR and avoid commenting on issues outside the scope of the PR.
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
📚 Learning: in the clp-package compression flow, path validation and error handling is performed at the schedule...
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: in `fs_compression_task.py`, when refactoring the archive processing loop in python, the `src_archiv...
Learnt from: haiqi96
PR: y-scope/clp#634
File: components/job-orchestration/job_orchestration/executor/compress/fs_compression_task.py:0-0
Timestamp: 2024-12-16T21:49:06.483Z
Learning: In `fs_compression_task.py`, when refactoring the archive processing loop in Python, the `src_archive_file.unlink()` operation should remain outside of the `process_archive` function.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (4)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (4)

100-100: Inconsistency between PR objectives and implementation.

The PR objectives state the goal is to "report all partial failures during otherwise-successful compression jobs" and mention adding an append parameter to update_compression_job_metadata() to concatenate error messages. However, the implementation now fails entire jobs when any path is invalid, contradicting the stated objective of allowing partial failures.

Likely an incorrect or invalid review comment.


127-127: Consistent error handling but may be overly aggressive.

Similar to line 112, this raises a ValueError for invalid internal paths within directories, causing the entire job to fail. While consistent with the top-level path handling, this approach prevents processing any valid paths when even one invalid path is encountered.

⛔ Skipped due to learnings
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/clp-s.cpp:256-258
Timestamp: 2024-10-03T14:47:22.293Z
Learning: Error handling for file path validation is performed in the `JsonParser` constructor.
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/clp-s.cpp:256-258
Timestamp: 2024-10-08T15:52:50.753Z
Learning: Error handling for file path validation is performed in the `JsonParser` constructor.
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:654-656
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In the `parse_kv_log_event` function, catch-all exception handlers are acceptable because exceptions are intentionally rethrown to be handled gracefully at a higher level.
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:654-656
Timestamp: 2024-10-07T21:18:02.510Z
Learning: In the `parse_kv_log_event` function, catch-all exception handlers are acceptable because exceptions are intentionally rethrown to be handled gracefully at a higher level.
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.

210-223: Verify partial-failure behaviour against PR objectives and issue #308

The current logic aborts the entire compression job on any invalid input path, but the PR’s stated goal is to record individual path failures and allow the overall job to succeed. We need to confirm the intended semantics before merging.

Please:

  • Review components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (lines 210–223) to ensure invalid paths are logged and marked as failed, without terminating the whole job.
  • Consult the CLP issue #308 (internal tracker) or related PR documentation to clarify whether “partial success” behaviour was specified.
  • Update the error-handling logic so that:
    • Each bad path is reported in job metadata.
    • The job status reflects a partial-success state rather than outright failure, if that aligns with the intended fix.

112-112: [file content truncated for brevity]
Let's focus on the function that handles compression scheduling. We'll search for def schedule_compression or similar.

@quinntaylormitchell quinntaylormitchell changed the title fix: Report all partial failures during otherwise-successful compression jobs (fixes #308) fix: Immediately fail any compression job that encounters an invalid input path (fixes #308) Jul 31, 2025
Comment on lines +110 to +112
except ValueError as ex:
logger.error(str(ex))
continue
raise ValueError(str(ex))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are duplicating the try-catch in both this method and it's caller, which is unnecessary. In addition, if we are going to log an exception, we can use logger.exception which will print some fancy stack trace for you.

@kirkrodrigues
Not an expert of error handling so want to get your opinion. I feel it's not a good practice to catch an error and re-raise it. We can go either

  1. Remove the try-except block in _process_fs_input_paths so it will throw the exception, and let caller handle the exception with try-catch. What Quinn currently have in the caller code should work.
  2. Update _process_fs_input_paths run try-catch, but return a True/False to indicate success or failure

I feel we should go with 2? but 1 isn't that bad either imo.

@haiqi96 haiqi96 Aug 1, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvm, now I realized we also need to report the path that causes the error. which gives a good reason to re-raise with a different error message.

Can you raise an error message saying something like "blabla file is invalid blabla" and store that as our status message? (it should be within the 512 char limit if we only report the first file that causes the error)

@kirkrodrigues wonder if you see any better way to propagate the error message? Alternatively, the method can return a tuple of [Bool, Optional[str]] (success_flag, failure_flag), but I don't feel it is any better.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it so that the first invalid file path is now reported; @kirkrodrigues let us know if you see a better way to propagate the error message.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now I think about it again, we don't really need to catch in _process_fs_input_paths since we are just rethrowing the error?
we can simply let caller at line 211 to catch the exception (and I think you don't even need to change the caller try except)

I will ask Kirk to confirm

…ess/compression_scheduler.py

Co-authored-by: haiqi96 <14502009+haiqi96@users.noreply.github.com>
Comment on lines +110 to +112
except ValueError as ex:
logger.error(str(ex))
continue
raise ValueError(str(ex))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now I think about it again, we don't really need to catch in _process_fs_input_paths since we are just rethrowing the error?
we can simply let caller at line 211 to catch the exception (and I think you don't even need to change the caller try except)

I will ask Kirk to confirm

@quinntaylormitchell

Copy link
Copy Markdown
Collaborator Author

@haiqi96 and @kirkrodrigues, here are my thoughts on what I should change re. catching the error:

I could omit logger.error(str(ex)) from _process_fs_input_paths, and then it would be logged only once (at the caller around line 211)

OR

I could omit logger.exception(error_msg) (at the caller around line 211) and then it would still be logged only once (in process_fs_input_paths).

Not sure about the whole "double-raising" thing; I don't know how else to pass the error message from validate_path_and_get_info all the way to the caller around line 211.

@kirkrodrigues

Copy link
Copy Markdown
Member

Sorry, just getting to this now. I feel failing on the first invalid path is a bit harsh since a user might be compressing a directory with files that change or get deleted (e.g., files in /var/log during log rotation).

Ideally, we should report a partial error with the job and what paths failed, so that the user could retry just those paths if they like (and maybe one day we can even support retrying the files at the click of a button). However, before then, I think a reasonable compromise is to log the failing paths and inform the user to check the logs for which paths failed. What do y'all think?

@haiqi96

haiqi96 commented Aug 5, 2025

Copy link
Copy Markdown
Contributor

Sorry, just getting to this now. I feel failing on the first invalid path is a bit harsh since a user might be compressing a directory with files that change or get deleted (e.g., files in /var/log during log rotation).

Ideally, we should report a partial error with the job and what paths failed, so that the user could retry just those paths if they like (and maybe one day we can even support retrying the files at the click of a button). However, before then, I think a reasonable compromise is to log the failing paths and inform the user to check the logs for which paths failed. What do y'all think?

Sorry, just getting to this now. I feel failing on the first invalid path is a bit harsh since a user might be compressing a directory with files that change or get deleted (e.g., files in /var/log during log rotation).

Ideally, we should report a partial error with the job and what paths failed, so that the user could retry just those paths if they like (and maybe one day we can even support retrying the files at the click of a button). However, before then, I think a reasonable compromise is to log the failing paths and inform the user to check the logs for which paths failed. What do y'all think?

I am not against it. in that case, the _process_fs_input_paths method needs to somehow let caller knows what path has failed. I have the following proposal:

  1. _process_fs_input_paths() shall return a List[str] of failing path. an empty list means no failure. Alternatively, it can return a Tuple of (success, failing_path): [bool, List[str]]
  2. The caller checks if _process_fs_input_paths succeed. if there's any failing path:
    • It will locally generate a log file, where each line of the log file is a failing path.
    • it shall mark the job as failed without proceeding, with a short error message stating that (1) the job failed because of invalid input path(s), (2), point the user to the generated log file on the disk

@kirkrodrigues any comment?

@kirkrodrigues

Copy link
Copy Markdown
Member

Sorry, just getting to this now. I feel failing on the first invalid path is a bit harsh since a user might be compressing a directory with files that change or get deleted (e.g., files in /var/log during log rotation).
Ideally, we should report a partial error with the job and what paths failed, so that the user could retry just those paths if they like (and maybe one day we can even support retrying the files at the click of a button). However, before then, I think a reasonable compromise is to log the failing paths and inform the user to check the logs for which paths failed. What do y'all think?

Sorry, just getting to this now. I feel failing on the first invalid path is a bit harsh since a user might be compressing a directory with files that change or get deleted (e.g., files in /var/log during log rotation).
Ideally, we should report a partial error with the job and what paths failed, so that the user could retry just those paths if they like (and maybe one day we can even support retrying the files at the click of a button). However, before then, I think a reasonable compromise is to log the failing paths and inform the user to check the logs for which paths failed. What do y'all think?

I am not against it. in that case, the _process_fs_input_paths method needs to somehow let caller knows what path has failed. I have the following proposal:

1. _process_fs_input_paths() shall return a List[str] of failing path. an empty list means no failure. Alternatively, it can return a Tuple of (success, failing_path): [bool, List[str]]

2. The caller checks if _process_fs_input_paths succeed. if there's any failing path:
   
   * It will locally generate a log file, where each line of the log file is a failing path.
   * it shall mark the job as failed without proceeding, with a short error message stating that (1) the job failed because of invalid input path(s), (2), point the user to the generated log file on the disk

@kirkrodrigues any comment?

Sure

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (1)

267-278: Revise redundant/ambiguous failure branch when no tasks are created.

Given early failure on invalid paths, this branch likely means “no compressible inputs found” (e.g., only empty dirs). Update the message for clarity, or remove the branch if it’s truly unreachable.

-                    "status_msg": "invalid input path",
+                    "status_msg": "No compressible inputs found",
♻️ Duplicate comments (2)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (2)

114-119: Avoid catch-and-rethrow; preserve original traceback and remove duplicate logging.

Catching ValueError, logging, and re-raising a new ValueError loses the original traceback and duplicates logs. Let the original exception propagate; the caller already logs with logger.exception. This mirrors prior feedback.

-        except ValueError as ex:
-            logger.error(str(ex))
-            raise ValueError(str(ex))
+        except ValueError:
+            # Let caller log and handle; preserve original traceback.
+            raise

127-134: Same as above for internal paths.

Propagate the original ValueError without re-wrapping and avoid duplicate logging.

-                except ValueError as ex:
-                    logger.error(str(ex))
-                    raise ValueError(str(ex))
+                except ValueError:
+                    raise
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d60b023 and f8cfae2.

📒 Files selected for processing (1)
  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (4 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
🧬 Code graph analysis (1)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (1)
components/job-orchestration/job_orchestration/scheduler/constants.py (1)
  • CompressionJobStatus (27-32)

@quinntaylormitchell quinntaylormitchell changed the title fix: Immediately fail any compression job that encounters an invalid input path (fixes #308) fix: For compression jobs that encounter at least one invalid input path, report path invalidity to user (fixes #308) Sep 2, 2025

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (2)

128-137: Also collect invalid internal paths from directory traversal.

Same reasoning as above; don't just flip a flag.

Apply:

-                except ValueError as ex:
-                    logger.error(str(ex))
-                    paths_ok = False
+                except ValueError as ex:
+                    logger.error(str(ex))
+                    invalid_paths.append(str(internal_path))
                     continue

241-254: Bound status_msg to column size; preserve details.

Truncate to avoid DB errors and keep the salient part.

Apply:

-                update_compression_job_metadata(
+                update_compression_job_metadata(
                     db_cursor,
                     job_id,
                     {
                         "status": CompressionJobStatus.FAILED,
-                        "status_msg": f"S3 Failure: {err}",
+                        "status_msg": f"S3 Failure: {err}"[:512],
                     },
                 )
♻️ Duplicate comments (1)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (1)

224-241: Fail the job early with a per-job invalid-paths log; don't proceed to schedule tasks.

Current code only sets a note and continues, which contradicts the PR objective and the reviewers’ consensus to fail on invalid inputs and point users to a generated log.

Apply:

-        if input_type == InputType.FS.value:
-            paths_ok = _process_fs_input_paths(input_config, paths_to_compress_buffer)
-            if not paths_ok:
-                error_msg = (
-                    "NOTE: At least one input path could not be processed. "
-                    "Check the compression_scheduler log for more details."
-                )
-                update_compression_job_metadata(
-                    db_cursor,
-                    job_id,
-                    {
-                        "status_msg": error_msg,
-                    },
-                )
-                db_conn.commit()
+        if input_type == InputType.FS.value:
+            invalid_paths = _process_fs_input_paths(input_config, paths_to_compress_buffer)
+            if invalid_paths:
+                logs_dir = Path(os.getenv("CLP_LOGS_DIR", "."))
+                invalid_log = logs_dir / f"compression_job_{job_id}_invalid_paths.txt"
+                try:
+                    invalid_log.write_text("\n".join(invalid_paths), encoding="utf-8")
+                except Exception:
+                    logger.exception("Failed to write invalid paths log")
+                error_msg = f"Invalid input path(s). See {invalid_log.name}."
+                update_compression_job_metadata(
+                    db_cursor,
+                    job_id,
+                    {
+                        "status": CompressionJobStatus.FAILED,
+                        "status_msg": error_msg[:512],
+                    },
+                )
+                db_conn.commit()
+                continue
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f8cfae2 and d12fd3c.

📒 Files selected for processing (2)
  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py (1 hunks)
  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (4 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py
  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-08-08T06:59:42.436Z
Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py
🧬 Code graph analysis (1)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (2)
components/clp-py-utils/clp_py_utils/compression.py (2)
  • validate_path_and_get_info (94-128)
  • add_file (20-25)
components/job-orchestration/job_orchestration/scheduler/compress/partition.py (2)
  • add_file (52-57)
  • add_empty_directory (59-62)
🔇 Additional comments (2)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (2)

272-281: Message clarity LGTM; keep as a safety net.

“All provided paths were invalid.” is clear. If you adopt early-fail on first invalid detection, this block still protects against edge cases where none yielded tasks but no explicit invalids were captured.


144-145: Return the list of invalid paths.

Apply:

-    return paths_ok
+    return invalid_paths

Likely an incorrect or invalid review comment.

Comment thread components/clp-package-utils/clp_package_utils/scripts/native/compress.py Outdated
@quinntaylormitchell

Copy link
Copy Markdown
Collaborator Author

@haiqi96 and @kirkrodrigues, I’ve pushed a new implementation. The core elements of the new strategy are a combination of your proposals, as follows:

  1. Don’t fail compression jobs upon encountering an invalid path; complete the compression job if there is at least one valid path
  2. Log the failing paths somewhere
  3. Inform the user to check the logs for which paths failed

That being said, I felt that part of the solution that Haiqi proposed, i.e.

_process_fs_input_paths() shall return a List[str] of failing path

could instead be fulfilled by the logging infrastructure already used in this function (the failed path is already being recorded to the compression_scheduler log, which can be viewed by the user). With that in mind, it makes more sense to me to return bool from _process_fs_input_paths; if we are simply reporting to the user that at least one path failed, and telling them to check the logs for the failed paths, then the caller of _process_fs_input_paths doesn’t need to know the path(s) that failed, as the failing path(s) have already been logged in a way that the user can access.

Let me know what you think.

@haiqi96

haiqi96 commented Sep 2, 2025

Copy link
Copy Markdown
Contributor

Sorry, just getting to this now. I feel failing on the first invalid path is a bit harsh since a user might be compressing a directory with files that change or get deleted (e.g., files in /var/log during log rotation).
Ideally, we should report a partial error with the job and what paths failed, so that the user could retry just those paths if they like (and maybe one day we can even support retrying the files at the click of a button). However, before then, I think a reasonable compromise is to log the failing paths and inform the user to check the logs for which paths failed. What do y'all think?

Sorry, just getting to this now. I feel failing on the first invalid path is a bit harsh since a user might be compressing a directory with files that change or get deleted (e.g., files in /var/log during log rotation).
Ideally, we should report a partial error with the job and what paths failed, so that the user could retry just those paths if they like (and maybe one day we can even support retrying the files at the click of a button). However, before then, I think a reasonable compromise is to log the failing paths and inform the user to check the logs for which paths failed. What do y'all think?

I am not against it. in that case, the _process_fs_input_paths method needs to somehow let caller knows what path has failed. I have the following proposal:

1. _process_fs_input_paths() shall return a List[str] of failing path. an empty list means no failure. Alternatively, it can return a Tuple of (success, failing_path): [bool, List[str]]

2. The caller checks if _process_fs_input_paths succeed. if there's any failing path:
   
   * It will locally generate a log file, where each line of the log file is a failing path.
   * it shall mark the job as failed without proceeding, with a short error message stating that (1) the job failed because of invalid input path(s), (2), point the user to the generated log file on the disk

@kirkrodrigues any comment?

Sure

@haiqi96 and @kirkrodrigues, I’ve pushed a new implementation. The core elements of the new strategy are a combination of your proposals, as follows:

  1. Don’t fail compression jobs upon encountering an invalid path; complete the compression job if there is at least one valid path
  2. Log the failing paths somewhere
  3. Inform the user to check the logs for which paths failed

That being said, I felt that part of the solution that Haiqi proposed, i.e.

_process_fs_input_paths() shall return a List[str] of failing path

could instead be fulfilled by the logging infrastructure already used in this function (the failed path is already being recorded to the compression_scheduler log, which can be viewed by the user). With that in mind, it makes more sense to me to return bool from _process_fs_input_paths; if we are simply reporting to the user that at least one path failed, and telling them to check the logs for the failed paths, then the caller of _process_fs_input_paths doesn’t need to know the path(s) that failed, as the failing path(s) have already been logged in a way that the user can access.

Let me know what you think.

I don't agree with the approach of logging the paths in compression_scheduer.

  1. It's easier to point user to a specific path, rather than saying "check compression_scheduer log" since most users (and even some people in our team who don't use package frequently) don't know where is compression_scheduer's log.
  2. Imagine if the compression scheduler has been running for a while and logged many different things, it would be hard for user to find the exact portion of logs that he/she should read.

Because of the points above, it would makes sense to generate a separate log file for clear/efficient error reporting.

Comment thread components/clp-package-utils/clp_package_utils/scripts/native/compress.py Outdated
@quinntaylormitchell

Copy link
Copy Markdown
Collaborator Author

most users (and even some people in our team who don't use package frequently) don't know where is compression_scheduer's log

@haiqi96 that's true. But you also mention:

It's easier to point user to a specific path

... I could rig it so that it prints out the specific path of the compression_scheduler log for the user?

As for this point:

it would be hard for user to find the exact portion of logs that he/she should read

that's fair.

@haiqi96, I guess my objection to your proposal that "_process_fs_input_paths() shall return a List[str]" lies in the fact that the paths are already being logged in the compression_scheduler logs; my approach just directs the user there.

Also, see my comment re. the "immediate failure" vs "continue compression" discussion.

@quinntaylormitchell quinntaylormitchell changed the title fix: For compression jobs that encounter at least one invalid input path, report path invalidity to user (fixes #308) fix: For compression jobs that encounter at least one invalid input path, fail job and report failure to user (fixes #308) Sep 15, 2025

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9f9edda and 0bc05f5.

📒 Files selected for processing (1)
  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (6 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-08-08T06:59:42.436Z
Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-08-25T06:29:59.610Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1261
File: .github/workflows/clp-core-build.yaml:294-332
Timestamp: 2025-08-25T06:29:59.610Z
Learning: In the CLP project, Bill-hbrhbr prefers a "fail fast" approach for CI workflows - allowing potential command availability issues (like getconf in musllinux) to surface through CI failures rather than preemptively adding fallback logic, as they will fix issues when they occur.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
🧬 Code graph analysis (1)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (2)
components/clp-py-utils/clp_py_utils/compression.py (2)
  • validate_path_and_get_info (94-128)
  • add_file (20-25)
components/job-orchestration/job_orchestration/scheduler/constants.py (1)
  • CompressionJobStatus (27-32)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: package-image
🔇 Additional comments (1)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (1)

121-126: Confirm whether validate_path_and_get_info already logs; avoid duplicate error entries.

If validate_path_and_get_info doesn’t log, keeping logger.error(error_msg) here is correct. If it does, this will double-log per invalid path. Please confirm and adjust.

Also applies to: 141-145

Comment on lines +324 to +335
# if len(tasks) == 0:
# logger.warning(f"No tasks were created for job {job_id}")
# update_compression_job_metadata(
# db_cursor,
# job_id,
# {
# "status": CompressionJobStatus.FAILED,
# "status_msg": "All provided paths were invalid.",
# },
# )
# db_conn.commit()
# continue

@quinntaylormitchell quinntaylormitchell Sep 15, 2025

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kirkrodrigues do you think that this commented section can be deleted? I have confirmed (best that I can) that this if statement will only be entered if all paths are invalid (there are no other conditions that would cause len(tasks) to be 0)

@kirkrodrigues kirkrodrigues left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the PR title, how about:

fix(package): Fail job and report failure to user for compression jobs that encounter at least one invalid input path (fixes #308).

@quinntaylormitchell can you re-test before I approve?

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0bc05f5 and 7016554.

📒 Files selected for processing (1)
  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (5 hunks)
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: quinntaylormitchell
PR: y-scope/clp#1125
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:267-291
Timestamp: 2025-09-15T22:20:40.750Z
Learning: For CLP compression jobs, the team has decided to fail the entire job immediately upon encountering any invalid input path, rather than continuing to process valid paths. This decision was made during PR #1125 development.
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-09-15T22:20:40.750Z
Learnt from: quinntaylormitchell
PR: y-scope/clp#1125
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:267-291
Timestamp: 2025-09-15T22:20:40.750Z
Learning: For CLP compression jobs, the team has decided to fail the entire job immediately upon encountering any invalid input path, rather than continuing to process valid paths. This decision was made during PR #1125 development.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-08-08T06:59:42.436Z
Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-08-25T06:29:59.610Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1261
File: .github/workflows/clp-core-build.yaml:294-332
Timestamp: 2025-08-25T06:29:59.610Z
Learning: In the CLP project, Bill-hbrhbr prefers a "fail fast" approach for CI workflows - allowing potential command availability issues (like getconf in musllinux) to surface through CI failures rather than preemptively adding fallback logic, as they will fix issues when they occur.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-09-19T18:28:26.747Z
Learnt from: gibber9809
PR: y-scope/clp#1169
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:463-469
Timestamp: 2025-09-19T18:28:26.747Z
Learning: In the compression scheduler (components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py), the SIGTERM handling uses a coarse-grained approach that only checks received_sigterm before calling search_and_schedule_new_tasks. The maintainers consider the race condition where SIGTERM arrives mid-execution to be benign, as scheduling a few extra tasks during shutdown is acceptable and will be handled by cleanup mechanisms.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
🧬 Code graph analysis (1)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (2)
components/clp-py-utils/clp_py_utils/compression.py (2)
  • validate_path_and_get_info (94-128)
  • add_file (20-25)
components/job-orchestration/job_orchestration/scheduler/constants.py (1)
  • CompressionJobStatus (27-32)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: package-image
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (4)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (4)

9-9: Typed imports expansion looks good.

This aligns with the new return types and helper signature.


111-151: Path gating + full error collection is consistent with fail‑on‑any‑invalid.

The paths_ok gate prevents unnecessary buffering once failure is inevitable, while still enumerating all invalid entries for the per‑job log. Good trade‑off.


175-205: Don’t silently swallow log‑write failures; emit warnings and dedupe messages.

If directory/file creation fails, we return None without any trace. Log at warning level and dedupe to keep user logs concise.

 def _write_failed_path_log(
     invalid_path_messages: List[str], logs_directory: Path, job_id: Any
 ) -> Optional[Path]:
@@
-    try:
-        user_logs_dir.mkdir(parents=True, exist_ok=True)
-    except Exception:
-        return None
+    try:
+        user_logs_dir.mkdir(parents=True, exist_ok=True)
+    except Exception as e:
+        logger.warning(f"Failed to create user logs dir {user_logs_dir}: {e}")
+        return None
@@
-    try:
-        with log_path.open("w", encoding="utf-8") as f:
+    try:
+        with log_path.open("w", encoding="utf-8") as f:
             timestamp = datetime.datetime.now().isoformat(timespec="seconds")
             f.write(f"Failed input paths log.\nGenerated at {timestamp}.\n\n")
-            for msg in invalid_path_messages:
+            for msg in dict.fromkeys(invalid_path_messages):  # preserve order; remove dups
                 f.write(f"{msg.rstrip()}\n")
-    except Exception:
-        return None
+    except Exception as e:
+        logger.warning(f"Failed to write user log {log_path}: {e}")
+        return None

265-289: Point users to the concrete scheduler log file in the fallback path.

The current message references a directory; link the actual file for better discoverability.

-                if user_log_path is None:
-                    error_msg = base_msg + (
-                        f" Check the compression scheduler logs in {clp_config.logs_directory} for"
-                        " more details."
-                    )
+                if user_log_path is None:
+                    scheduler_log = (Path(os.getenv("CLP_LOGS_DIR", str(clp_config.logs_directory)))
+                                     / "compression_scheduler.log")
+                    error_msg = f"{base_msg} Check {scheduler_log} for more details."

@quinntaylormitchell quinntaylormitchell changed the title fix: For compression jobs that encounter at least one invalid input path, fail job and report failure to user (fixes #308) fix(package): Fail job and report failure to user for compression jobs that encounter at least one invalid input path (fixes #308). Sep 23, 2025
@kirkrodrigues kirkrodrigues merged commit 774a106 into y-scope:main Sep 23, 2025
19 checks passed
junhaoliao pushed a commit to junhaoliao/clp that referenced this pull request May 17, 2026
…s that encounter at least one invalid input path (fixes y-scope#308). (y-scope#1125)

Co-authored-by: haiqi96 <14502009+haiqi96@users.noreply.github.com>
Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants