Skip to content

[data] Add back _op_task_duration_stats for Issue Detection#56958

Merged
alexeykudinkin merged 7 commits intoray-project:masterfrom
iamjustinhsu:jhsu/fix-issue-detection-metrics
Oct 6, 2025
Merged

[data] Add back _op_task_duration_stats for Issue Detection#56958
alexeykudinkin merged 7 commits intoray-project:masterfrom
iamjustinhsu:jhsu/fix-issue-detection-metrics

Conversation

@iamjustinhsu
Copy link
Copy Markdown
Contributor

@iamjustinhsu iamjustinhsu commented Sep 26, 2025

Why are these changes needed?

deprecated by accident in https://github.com/ray-project/ray/pull/55429/files, needed for issue detection. Added a test to not regress

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Note

Restore per-operator task duration tracking for issue detection and add a test validating hanging detector behavior with adaptive thresholds.

  • Runtime metrics:
    • Record task duration on completion in op_runtime_metrics.on_task_finished via _op_task_duration_stats.add_duration(...) for issue detection.
  • Tests:
    • Add test_hanging_detector_detects_issues in python/ray/data/tests/test_issue_detection_manager.py to verify hanging detection under different thresholds using HangingExecutionIssueDetectorConfig and log capture.

Written by Cursor Bugbot for commit c016cf2. This will update automatically on new commits. Configure here.

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@iamjustinhsu iamjustinhsu requested a review from a team as a code owner September 26, 2025 16:21
@iamjustinhsu iamjustinhsu added the go add ONLY when ready to merge, run all tests label Sep 26, 2025
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly restores the call to _op_task_duration_stats.add_duration(), which is essential for the hanging task issue detection mechanism. The change is simple and effective. I have one minor suggestion to make the code comment more specific for better clarity.

@ray-gardener ray-gardener bot added data Ray Data-related issues observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Sep 26, 2025
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
self.task_completion_time += task_time_delta

# NOTE: This is used for Issue Detection
self._op_task_duration_stats.add_duration(task_time_delta)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iamjustinhsu didn't you add new metrics?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this metric because I thought it was only being tracked for task completion mean time, however, I didn't realize it was also being used for issue detection. Issue detection relies on the stdev to calculate long running tasks (outliers), so I added it back. Now _op_task_duration_stats is only used for issue detection

Copy link
Copy Markdown
Contributor

@alexeykudinkin alexeykudinkin Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, landing this to unblock, but we should unify these metrics (create a ticket)

@alexeykudinkin alexeykudinkin merged commit 3a2dcfe into ray-project:master Oct 6, 2025
6 checks passed
liulehui pushed a commit to liulehui/ray that referenced this pull request Oct 9, 2025
…ect#56958)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
deprecated by accident in
https://github.com/ray-project/ray/pull/55429/files, needed for issue
detection. Added a test to not regress
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(


<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Restore per-operator task duration tracking for issue detection and
add a test validating hanging detector behavior with adaptive
thresholds.
> 
> - **Runtime metrics**:
> - Record task duration on completion in
`op_runtime_metrics.on_task_finished` via
`_op_task_duration_stats.add_duration(...)` for issue detection.
> - **Tests**:
> - Add `test_hanging_detector_detects_issues` in
`python/ray/data/tests/test_issue_detection_manager.py` to verify
hanging detection under different thresholds using
`HangingExecutionIssueDetectorConfig` and log capture.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
c016cf2. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
joshkodi pushed a commit to joshkodi/ray that referenced this pull request Oct 13, 2025
…ect#56958)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
deprecated by accident in
https://github.com/ray-project/ray/pull/55429/files, needed for issue
detection. Added a test to not regress
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Restore per-operator task duration tracking for issue detection and
add a test validating hanging detector behavior with adaptive
thresholds.
>
> - **Runtime metrics**:
> - Record task duration on completion in
`op_runtime_metrics.on_task_finished` via
`_op_task_duration_stats.add_duration(...)` for issue detection.
> - **Tests**:
> - Add `test_hanging_detector_detects_issues` in
`python/ray/data/tests/test_issue_detection_manager.py` to verify
hanging detection under different thresholds using
`HangingExecutionIssueDetectorConfig` and log capture.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
c016cf2. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: Josh Kodi <joshkodi@gmail.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…ect#56958)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
deprecated by accident in
https://github.com/ray-project/ray/pull/55429/files, needed for issue
detection. Added a test to not regress
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(


<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Restore per-operator task duration tracking for issue detection and
add a test validating hanging detector behavior with adaptive
thresholds.
> 
> - **Runtime metrics**:
> - Record task duration on completion in
`op_runtime_metrics.on_task_finished` via
`_op_task_duration_stats.add_duration(...)` for issue detection.
> - **Tests**:
> - Add `test_hanging_detector_detects_issues` in
`python/ray/data/tests/test_issue_detection_manager.py` to verify
hanging detection under different thresholds using
`HangingExecutionIssueDetectorConfig` and log capture.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
c016cf2. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…ect#56958)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
deprecated by accident in
https://github.com/ray-project/ray/pull/55429/files, needed for issue
detection. Added a test to not regress
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(


<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Restore per-operator task duration tracking for issue detection and
add a test validating hanging detector behavior with adaptive
thresholds.
> 
> - **Runtime metrics**:
> - Record task duration on completion in
`op_runtime_metrics.on_task_finished` via
`_op_task_duration_stats.add_duration(...)` for issue detection.
> - **Tests**:
> - Add `test_hanging_detector_detects_issues` in
`python/ray/data/tests/test_issue_detection_manager.py` to verify
hanging detection under different thresholds using
`HangingExecutionIssueDetectorConfig` and log capture.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
c016cf2. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…ect#56958)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
deprecated by accident in
https://github.com/ray-project/ray/pull/55429/files, needed for issue
detection. Added a test to not regress
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Restore per-operator task duration tracking for issue detection and
add a test validating hanging detector behavior with adaptive
thresholds.
>
> - **Runtime metrics**:
> - Record task duration on completion in
`op_runtime_metrics.on_task_finished` via
`_op_task_duration_stats.add_duration(...)` for issue detection.
> - **Tests**:
> - Add `test_hanging_detector_detects_issues` in
`python/ray/data/tests/test_issue_detection_manager.py` to verify
hanging detection under different thresholds using
`HangingExecutionIssueDetectorConfig` and log capture.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
c016cf2. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…ect#56958)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
deprecated by accident in
https://github.com/ray-project/ray/pull/55429/files, needed for issue
detection. Added a test to not regress
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Restore per-operator task duration tracking for issue detection and
add a test validating hanging detector behavior with adaptive
thresholds.
>
> - **Runtime metrics**:
> - Record task duration on completion in
`op_runtime_metrics.on_task_finished` via
`_op_task_duration_stats.add_duration(...)` for issue detection.
> - **Tests**:
> - Add `test_hanging_detector_detects_issues` in
`python/ray/data/tests/test_issue_detection_manager.py` to verify
hanging detection under different thresholds using
`HangingExecutionIssueDetectorConfig` and log capture.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
c016cf2. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants