[data] Add back _op_task_duration_stats for Issue Detection#56958
Conversation
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request correctly restores the call to _op_task_duration_stats.add_duration(), which is essential for the hanging task issue detection mechanism. The change is simple and effective. I have one minor suggestion to make the code comment more specific for better clarity.
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
…/fix-issue-detection-metrics
| self.task_completion_time += task_time_delta | ||
|
|
||
| # NOTE: This is used for Issue Detection | ||
| self._op_task_duration_stats.add_duration(task_time_delta) |
There was a problem hiding this comment.
I removed this metric because I thought it was only being tracked for task completion mean time, however, I didn't realize it was also being used for issue detection. Issue detection relies on the stdev to calculate long running tasks (outliers), so I added it back. Now _op_task_duration_stats is only used for issue detection
There was a problem hiding this comment.
Ok, landing this to unblock, but we should unify these metrics (create a ticket)
…ect#56958) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? deprecated by accident in https://github.com/ray-project/ray/pull/55429/files, needed for issue detection. Added a test to not regress <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Restore per-operator task duration tracking for issue detection and add a test validating hanging detector behavior with adaptive thresholds. > > - **Runtime metrics**: > - Record task duration on completion in `op_runtime_metrics.on_task_finished` via `_op_task_duration_stats.add_duration(...)` for issue detection. > - **Tests**: > - Add `test_hanging_detector_detects_issues` in `python/ray/data/tests/test_issue_detection_manager.py` to verify hanging detection under different thresholds using `HangingExecutionIssueDetectorConfig` and log capture. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit c016cf2. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
…ect#56958) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? deprecated by accident in https://github.com/ray-project/ray/pull/55429/files, needed for issue detection. Added a test to not regress <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Restore per-operator task duration tracking for issue detection and add a test validating hanging detector behavior with adaptive thresholds. > > - **Runtime metrics**: > - Record task duration on completion in `op_runtime_metrics.on_task_finished` via `_op_task_duration_stats.add_duration(...)` for issue detection. > - **Tests**: > - Add `test_hanging_detector_detects_issues` in `python/ray/data/tests/test_issue_detection_manager.py` to verify hanging detection under different thresholds using `HangingExecutionIssueDetectorConfig` and log capture. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit c016cf2. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: Josh Kodi <joshkodi@gmail.com>
…ect#56958) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? deprecated by accident in https://github.com/ray-project/ray/pull/55429/files, needed for issue detection. Added a test to not regress <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Restore per-operator task duration tracking for issue detection and add a test validating hanging detector behavior with adaptive thresholds. > > - **Runtime metrics**: > - Record task duration on completion in `op_runtime_metrics.on_task_finished` via `_op_task_duration_stats.add_duration(...)` for issue detection. > - **Tests**: > - Add `test_hanging_detector_detects_issues` in `python/ray/data/tests/test_issue_detection_manager.py` to verify hanging detection under different thresholds using `HangingExecutionIssueDetectorConfig` and log capture. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit c016cf2. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
…ect#56958) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? deprecated by accident in https://github.com/ray-project/ray/pull/55429/files, needed for issue detection. Added a test to not regress <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Restore per-operator task duration tracking for issue detection and add a test validating hanging detector behavior with adaptive thresholds. > > - **Runtime metrics**: > - Record task duration on completion in `op_runtime_metrics.on_task_finished` via `_op_task_duration_stats.add_duration(...)` for issue detection. > - **Tests**: > - Add `test_hanging_detector_detects_issues` in `python/ray/data/tests/test_issue_detection_manager.py` to verify hanging detection under different thresholds using `HangingExecutionIssueDetectorConfig` and log capture. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit c016cf2. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
…ect#56958) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? deprecated by accident in https://github.com/ray-project/ray/pull/55429/files, needed for issue detection. Added a test to not regress <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Restore per-operator task duration tracking for issue detection and add a test validating hanging detector behavior with adaptive thresholds. > > - **Runtime metrics**: > - Record task duration on completion in `op_runtime_metrics.on_task_finished` via `_op_task_duration_stats.add_duration(...)` for issue detection. > - **Tests**: > - Add `test_hanging_detector_detects_issues` in `python/ray/data/tests/test_issue_detection_manager.py` to verify hanging detection under different thresholds using `HangingExecutionIssueDetectorConfig` and log capture. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit c016cf2. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
…ect#56958) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? deprecated by accident in https://github.com/ray-project/ray/pull/55429/files, needed for issue detection. Added a test to not regress <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Restore per-operator task duration tracking for issue detection and add a test validating hanging detector behavior with adaptive thresholds. > > - **Runtime metrics**: > - Record task duration on completion in `op_runtime_metrics.on_task_finished` via `_op_task_duration_stats.add_duration(...)` for issue detection. > - **Tests**: > - Add `test_hanging_detector_detects_issues` in `python/ray/data/tests/test_issue_detection_manager.py` to verify hanging detection under different thresholds using `HangingExecutionIssueDetectorConfig` and log capture. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit c016cf2. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
Why are these changes needed?
deprecated by accident in https://github.com/ray-project/ray/pull/55429/files, needed for issue detection. Added a test to not regress
Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.Note
Restore per-operator task duration tracking for issue detection and add a test validating hanging detector behavior with adaptive thresholds.
op_runtime_metrics.on_task_finishedvia_op_task_duration_stats.add_duration(...)for issue detection.test_hanging_detector_detects_issuesinpython/ray/data/tests/test_issue_detection_manager.pyto verify hanging detection under different thresholds usingHangingExecutionIssueDetectorConfigand log capture.Written by Cursor Bugbot for commit c016cf2. This will update automatically on new commits. Configure here.