[Data] Remove unstable `image_classification_chaos_no_scale_back` release test by bveeramani · Pull Request #58048 · ray-project/ray

bveeramani · 2025-10-23T17:03:34Z

Summary

This PR removes the image_classification_chaos_no_scale_back release test and its associated setup script (setup_cluster_compute_config_updater.py). This test has become non-functional and is no longer providing useful signal.

Background

The image_classification_chaos_no_scale_back release test was designed to validate Ray Data's fault tolerance when many nodes abruptly get preempted at the same time.

The test worked by:

Running on an autoscaling cluster with 1-10 nodes
Updating the compute config mid-test to downscale to 5 nodes
Asserting that there are dead nodes as a sanity check

Why This Test Is Broken

After the removal of Parquet metadata fetching in #56105 (September 2, 2025), the autoscaling behavior changed significantly:

Before metadata removal: The cluster would autoscale more aggressively because metadata fetching created additional tasks that triggered faster scale-up. The cluster would scale past 5 nodes, then downscale, leaving dead nodes that the test could detect.
After metadata removal: Without the metadata fetching tasks, the cluster doesn't scale up fast enough to get past 5 nodes before the downscale happens. This means there are no dead nodes to detect, causing the test to fail.

Why We're Removing It

Test is fundamentally broken: The test's assumptions about autoscaling behavior are no longer valid after the metadata fetching removal
Not actively monitored: The test is marked as unstable and isn't closely watched

Changes

Removed image_classification_chaos_no_scale_back test from release/release_data_tests.yaml
Deleted release/nightly_tests/setup_cluster_compute_config_updater.py (only used by this test)

Code Review

This pull request removes the unstable image_classification_chaos_no_scale_back release test and its associated setup script. The justification for this removal is well-explained in the pull request description, citing that the test is broken due to changes in autoscaling behavior and is no longer providing a useful signal. The changes are straightforward and correctly implement the removal. I've added one comment on the deleted script regarding a hardcoded URL as a note for future reference.

…ease test (ray-project#58048) ## Summary This PR removes the `image_classification_chaos_no_scale_back` release test and its associated setup script (`setup_cluster_compute_config_updater.py`). This test has become non-functional and is no longer providing useful signal. ## Background The `image_classification_chaos_no_scale_back` release test was designed to validate Ray Data's fault tolerance when many nodes abruptly get preempted at the same time. The test worked by: 1. Running on an autoscaling cluster with 1-10 nodes 2. Updating the compute config mid-test to downscale to 5 nodes 3. Asserting that there are dead nodes as a sanity check ## Why This Test Is Broken After the removal of Parquet metadata fetching in ray-project#56105 (September 2, 2025), the autoscaling behavior changed significantly: - **Before metadata removal**: The cluster would autoscale more aggressively because metadata fetching created additional tasks that triggered faster scale-up. The cluster would scale past 5 nodes, then downscale, leaving dead nodes that the test could detect. - **After metadata removal**: Without the metadata fetching tasks, the cluster doesn't scale up fast enough to get past 5 nodes before the downscale happens. This means there are no dead nodes to detect, causing the test to fail. ## Why We're Removing It 1. **Test is fundamentally broken**: The test's assumptions about autoscaling behavior are no longer valid after the metadata fetching removal 2. **Not actively monitored**: The test is marked as unstable and isn't closely watched ## Changes - Removed `image_classification_chaos_no_scale_back` test from `release/release_data_tests.yaml` - Deleted `release/nightly_tests/setup_cluster_compute_config_updater.py` (only used by this test) ## Related See ray-project#56105 Fixes ray-project#56528 Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: xgui <xgui@anyscale.com>

…ease test (ray-project#58048) ## Summary This PR removes the `image_classification_chaos_no_scale_back` release test and its associated setup script (`setup_cluster_compute_config_updater.py`). This test has become non-functional and is no longer providing useful signal. ## Background The `image_classification_chaos_no_scale_back` release test was designed to validate Ray Data's fault tolerance when many nodes abruptly get preempted at the same time. The test worked by: 1. Running on an autoscaling cluster with 1-10 nodes 2. Updating the compute config mid-test to downscale to 5 nodes 3. Asserting that there are dead nodes as a sanity check ## Why This Test Is Broken After the removal of Parquet metadata fetching in ray-project#56105 (September 2, 2025), the autoscaling behavior changed significantly: - **Before metadata removal**: The cluster would autoscale more aggressively because metadata fetching created additional tasks that triggered faster scale-up. The cluster would scale past 5 nodes, then downscale, leaving dead nodes that the test could detect. - **After metadata removal**: Without the metadata fetching tasks, the cluster doesn't scale up fast enough to get past 5 nodes before the downscale happens. This means there are no dead nodes to detect, causing the test to fail. ## Why We're Removing It 1. **Test is fundamentally broken**: The test's assumptions about autoscaling behavior are no longer valid after the metadata fetching removal 2. **Not actively monitored**: The test is marked as unstable and isn't closely watched ## Changes - Removed `image_classification_chaos_no_scale_back` test from `release/release_data_tests.yaml` - Deleted `release/nightly_tests/setup_cluster_compute_config_updater.py` (only used by this test) ## Related See ray-project#56105 Fixes ray-project#56528 Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

…ease test (ray-project#58048) ## Summary This PR removes the `image_classification_chaos_no_scale_back` release test and its associated setup script (`setup_cluster_compute_config_updater.py`). This test has become non-functional and is no longer providing useful signal. ## Background The `image_classification_chaos_no_scale_back` release test was designed to validate Ray Data's fault tolerance when many nodes abruptly get preempted at the same time. The test worked by: 1. Running on an autoscaling cluster with 1-10 nodes 2. Updating the compute config mid-test to downscale to 5 nodes 3. Asserting that there are dead nodes as a sanity check ## Why This Test Is Broken After the removal of Parquet metadata fetching in ray-project#56105 (September 2, 2025), the autoscaling behavior changed significantly: - **Before metadata removal**: The cluster would autoscale more aggressively because metadata fetching created additional tasks that triggered faster scale-up. The cluster would scale past 5 nodes, then downscale, leaving dead nodes that the test could detect. - **After metadata removal**: Without the metadata fetching tasks, the cluster doesn't scale up fast enough to get past 5 nodes before the downscale happens. This means there are no dead nodes to detect, causing the test to fail. ## Why We're Removing It 1. **Test is fundamentally broken**: The test's assumptions about autoscaling behavior are no longer valid after the metadata fetching removal 2. **Not actively monitored**: The test is marked as unstable and isn't closely watched ## Changes - Removed `image_classification_chaos_no_scale_back` test from `release/release_data_tests.yaml` - Deleted `release/nightly_tests/setup_cluster_compute_config_updater.py` (only used by this test) ## Related See ray-project#56105 Fixes ray-project#56528 Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

…ease test (ray-project#58048) ## Summary This PR removes the `image_classification_chaos_no_scale_back` release test and its associated setup script (`setup_cluster_compute_config_updater.py`). This test has become non-functional and is no longer providing useful signal. ## Background The `image_classification_chaos_no_scale_back` release test was designed to validate Ray Data's fault tolerance when many nodes abruptly get preempted at the same time. The test worked by: 1. Running on an autoscaling cluster with 1-10 nodes 2. Updating the compute config mid-test to downscale to 5 nodes 3. Asserting that there are dead nodes as a sanity check ## Why This Test Is Broken After the removal of Parquet metadata fetching in ray-project#56105 (September 2, 2025), the autoscaling behavior changed significantly: - **Before metadata removal**: The cluster would autoscale more aggressively because metadata fetching created additional tasks that triggered faster scale-up. The cluster would scale past 5 nodes, then downscale, leaving dead nodes that the test could detect. - **After metadata removal**: Without the metadata fetching tasks, the cluster doesn't scale up fast enough to get past 5 nodes before the downscale happens. This means there are no dead nodes to detect, causing the test to fail. ## Why We're Removing It 1. **Test is fundamentally broken**: The test's assumptions about autoscaling behavior are no longer valid after the metadata fetching removal 2. **Not actively monitored**: The test is marked as unstable and isn't closely watched ## Changes - Removed `image_classification_chaos_no_scale_back` test from `release/release_data_tests.yaml` - Deleted `release/nightly_tests/setup_cluster_compute_config_updater.py` (only used by this test) ## Related See ray-project#56105 Fixes ray-project#56528 Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: Future-Outlier <eric901201@gmail.com>

…ease test (ray-project#58048) ## Summary This PR removes the `image_classification_chaos_no_scale_back` release test and its associated setup script (`setup_cluster_compute_config_updater.py`). This test has become non-functional and is no longer providing useful signal. ## Background The `image_classification_chaos_no_scale_back` release test was designed to validate Ray Data's fault tolerance when many nodes abruptly get preempted at the same time. The test worked by: 1. Running on an autoscaling cluster with 1-10 nodes 2. Updating the compute config mid-test to downscale to 5 nodes 3. Asserting that there are dead nodes as a sanity check ## Why This Test Is Broken After the removal of Parquet metadata fetching in ray-project#56105 (September 2, 2025), the autoscaling behavior changed significantly: - **Before metadata removal**: The cluster would autoscale more aggressively because metadata fetching created additional tasks that triggered faster scale-up. The cluster would scale past 5 nodes, then downscale, leaving dead nodes that the test could detect. - **After metadata removal**: Without the metadata fetching tasks, the cluster doesn't scale up fast enough to get past 5 nodes before the downscale happens. This means there are no dead nodes to detect, causing the test to fail. ## Why We're Removing It 1. **Test is fundamentally broken**: The test's assumptions about autoscaling behavior are no longer valid after the metadata fetching removal 2. **Not actively monitored**: The test is marked as unstable and isn't closely watched ## Changes - Removed `image_classification_chaos_no_scale_back` test from `release/release_data_tests.yaml` - Deleted `release/nightly_tests/setup_cluster_compute_config_updater.py` (only used by this test) ## Related See ray-project#56105 Fixes ray-project#56528 Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

…ease test (ray-project#58048) ## Summary This PR removes the `image_classification_chaos_no_scale_back` release test and its associated setup script (`setup_cluster_compute_config_updater.py`). This test has become non-functional and is no longer providing useful signal. ## Background The `image_classification_chaos_no_scale_back` release test was designed to validate Ray Data's fault tolerance when many nodes abruptly get preempted at the same time. The test worked by: 1. Running on an autoscaling cluster with 1-10 nodes 2. Updating the compute config mid-test to downscale to 5 nodes 3. Asserting that there are dead nodes as a sanity check ## Why This Test Is Broken After the removal of Parquet metadata fetching in ray-project#56105 (September 2, 2025), the autoscaling behavior changed significantly: - **Before metadata removal**: The cluster would autoscale more aggressively because metadata fetching created additional tasks that triggered faster scale-up. The cluster would scale past 5 nodes, then downscale, leaving dead nodes that the test could detect. - **After metadata removal**: Without the metadata fetching tasks, the cluster doesn't scale up fast enough to get past 5 nodes before the downscale happens. This means there are no dead nodes to detect, causing the test to fail. ## Why We're Removing It 1. **Test is fundamentally broken**: The test's assumptions about autoscaling behavior are no longer valid after the metadata fetching removal 2. **Not actively monitored**: The test is marked as unstable and isn't closely watched ## Changes - Removed `image_classification_chaos_no_scale_back` test from `release/release_data_tests.yaml` - Deleted `release/nightly_tests/setup_cluster_compute_config_updater.py` (only used by this test) ## Related See ray-project#56105 Fixes ray-project#56528 Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: peterxcli <peterxcli@gmail.com>

bveeramani changed the title ~~[Data] Remove unstable image_classification_chaos_no_scale_back release test~~ [Data] Remove unstable image_classification_chaos_no_scale_back release test Oct 23, 2025

gemini-code-assist bot reviewed Oct 23, 2025

View reviewed changes

bveeramani assigned iamjustinhsu Oct 23, 2025

iamjustinhsu approved these changes Oct 23, 2025

View reviewed changes

bveeramani enabled auto-merge (squash) October 23, 2025 17:15

github-actions bot added the go add ONLY when ready to merge, run all tests label Oct 23, 2025

bveeramani merged commit b8924aa into master Oct 23, 2025
7 checks passed

bveeramani deleted the remove-no-scale-back branch October 23, 2025 17:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Remove unstable `image_classification_chaos_no_scale_back` release test#58048

[Data] Remove unstable `image_classification_chaos_no_scale_back` release test#58048
bveeramani merged 1 commit intomasterfrom
remove-no-scale-back

bveeramani commented Oct 23, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bveeramani commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background

Why This Test Is Broken

Why We're Removing It

Changes

Related

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bveeramani commented Oct 23, 2025 •

edited

Loading