fix(clp-package): Add dataset to metadata database after input paths are processed for compression jobs (fixes #2091).#2092
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughDataset existence validation in search_and_schedule_new_tasks was moved from the function start to after per-job input path processing. The pre-fetch of Changes
Sequence Diagram(s)sequenceDiagram
participant Scheduler as Scheduler
participant Buffer as PathsBuffer
participant CLP as StorageEngine/CLP_S
participant DatasetSvc as DatasetService
participant Submitter as TaskSubmitter
Scheduler->>Buffer: enumerate & buffer input paths
Buffer->>Buffer: flush()
alt StorageEngine == CLP_S
Scheduler->>CLP: read job.clp_io_config.input.dataset
CLP->>DatasetSvc: _ensure_dataset_exists(dataset)
end
Scheduler->>Submitter: _batch_and_submit_tasks(...)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 Ruff (0.15.6)components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.pyUnexpected Ruff issue shape at index 18 Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py`:
- Around line 394-406: The code fetches the full dataset set inside each CLP_S
job block causing repeated DB reads; move the call to fetch_existing_datasets
out of the per-job branch so a single cached existing_datasets set is
initialized lazily once (before looping jobs when StorageEngine.CLP_S is
possible) and then reused; keep using _ensure_dataset_exists(clp_config,
db_context, table_prefix, dataset, existing_datasets) to update the cache in
place for cache misses, and retain table_prefix and dataset usage as-is
(symbols: StorageEngine.CLP_S, fetch_existing_datasets, existing_datasets,
_ensure_dataset_exists, clp_metadata_db_connection_config, clp_config,
db_context).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: f46f1ad1-87cb-443a-b737-8afa0e30ff77
📒 Files selected for processing (1)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
| existing_datasets: set[str] = fetch_existing_datasets( | ||
| db_context.cursor, clp_metadata_db_connection_config["table_prefix"] | ||
| ) |
There was a problem hiding this comment.
i agree with the rabbit's comment at https://github.com/y-scope/clp/pull/2092/changes#r2966540337
while not too expensive, fetch_existing_datasets() reads from the DB and should be avoided in the loop.
how about adding this before the for-loop instead
existing_datasets: set[str] = set()
if StorageEngine.CLP_S == clp_config.package.storage_engine:
existing_datasets = fetch_existing_datasets(...)i acknowledge there could be risk if the user deletes a dataset after we call fetch_existing_datasets() and this for job_row in jobs: run for long, but that's more of an issue in the dataset manager's implementation (we should invalidate / empty a dataset table than actually removing it to avoid such issues) than the scheduler's
There was a problem hiding this comment.
sure that's a good point. I will restore the section before the for-loop.
…are processed for compression jobs (fixes y-scope#2091). (y-scope#2092) Co-authored-by: Junhao Liao <junhao.liao@yscope.com>
Description
This PR addresses issue #2091 by calling
_ensure_dataset_exists()after input paths are processed for a compression job, not before.Note: My only concern with this implementation is that this fix only protects against path-processing failures. Datasets will still be added to the metadata database if the compression job fails in the core. I tested the idea of calling
_ensure_dataset_exists()from_complete_compression_job()instead, and while that did strictly fix the issue (in the sense that failed jobs were no longer adding their datasets to the metadata database), it also broke all compression, because compression jobs need the dataset to exist in the metadata database before the compression job starts.Checklist
breaking change.
Validation performed
Ran the replication steps described in issue #2091; datasets are only added to the metadata database if the paths in the compression command are valid.
Summary by CodeRabbit