[Serve] Fix flaky test_router_queue_len_metric#60333
Merged
aslonnie merged 5 commits intoray-project:masterfrom Jan 21, 2026
Merged
[Serve] Fix flaky test_router_queue_len_metric#60333aslonnie merged 5 commits intoray-project:masterfrom
aslonnie merged 5 commits intoray-project:masterfrom
Conversation
The test was flaky because the router queue length gauge has a 100ms throttle (RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S) that can skip updates when they happen too quickly. When replica initialization sets the gauge to 0 and a request immediately updates it to 1, the second update may be throttled, causing the test to see 0 instead of 1. Fix by adding a fixture that disables throttling for this test. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request addresses a flaky test, test_router_queue_len_metric, by disabling the router queue length gauge throttling during the test. The fix is implemented by introducing a new pytest fixture, disable_router_queue_len_throttle, that sets the RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S environment variable to "0". The change is correct, well-isolated, and follows good testing practices. The code is clean, and the new fixture is well-documented. Overall, this is a good fix for the reported flakiness.
abrarsheikh
reviewed
Jan 20, 2026
| The throttle can cause flakiness: if the gauge is set to 0 on replica init | ||
| and then updated to 1 within the throttle window (100ms), the update is skipped. | ||
| """ | ||
| monkeypatch.setenv("RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S", "0") |
Contributor
There was a problem hiding this comment.
let's set this in the buildkite since it applies to all tests
Per review feedback, set RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S=0 in Buildkite CI configuration instead of using a test-specific fixture, since the throttle can cause flakiness in any serve test that sends requests through the router. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Keep RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S=0 only in the main ":ray-serve: serve: tests" step that runs test_metrics.py. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Closed
Set env var on py_test_module_list in BUILD.bazel instead of buildkite config so all CI targets automatically use it. This addresses reviewer feedback to have a single source of truth for the env var rather than replicating it across buildkite steps. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
jinbum-kim
pushed a commit
to jinbum-kim/ray
that referenced
this pull request
Jan 29, 2026
## Why are these changes needed? The `test_router_queue_len_metric` test was flaky because the router queue length gauge has a 100ms throttle (`RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S`) that can skip updates when they happen too quickly. When replica initialization sets the gauge to 0 and a request immediately updates it to 1, the second update may be throttled, causing the test to see 0 instead of 1. ## Related issue number Fixes flaky test introduced in ray-project#59233 after ray-project#60139 added throttling. --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
400Ping
pushed a commit
to 400Ping/ray
that referenced
this pull request
Feb 1, 2026
## Why are these changes needed? The `test_router_queue_len_metric` test was flaky because the router queue length gauge has a 100ms throttle (`RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S`) that can skip updates when they happen too quickly. When replica initialization sets the gauge to 0 and a request immediately updates it to 1, the second update may be throttled, causing the test to see 0 instead of 1. ## Related issue number Fixes flaky test introduced in ray-project#59233 after ray-project#60139 added throttling. --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: 400Ping <jiekaichang@apache.org>
ryanaoleary
pushed a commit
to ryanaoleary/ray
that referenced
this pull request
Feb 3, 2026
## Why are these changes needed? The `test_router_queue_len_metric` test was flaky because the router queue length gauge has a 100ms throttle (`RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S`) that can skip updates when they happen too quickly. When replica initialization sets the gauge to 0 and a request immediately updates it to 1, the second update may be throttled, causing the test to see 0 instead of 1. ## Related issue number Fixes flaky test introduced in ray-project#59233 after ray-project#60139 added throttling. --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com>
peterxcli
pushed a commit
to peterxcli/ray
that referenced
this pull request
Feb 25, 2026
## Why are these changes needed? The `test_router_queue_len_metric` test was flaky because the router queue length gauge has a 100ms throttle (`RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S`) that can skip updates when they happen too quickly. When replica initialization sets the gauge to 0 and a request immediately updates it to 1, the second update may be throttled, causing the test to see 0 instead of 1. ## Related issue number Fixes flaky test introduced in ray-project#59233 after ray-project#60139 added throttling. --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli
pushed a commit
to peterxcli/ray
that referenced
this pull request
Feb 25, 2026
## Why are these changes needed? The `test_router_queue_len_metric` test was flaky because the router queue length gauge has a 100ms throttle (`RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S`) that can skip updates when they happen too quickly. When replica initialization sets the gauge to 0 and a request immediately updates it to 1, the second update may be throttled, causing the test to see 0 instead of 1. ## Related issue number Fixes flaky test introduced in ray-project#59233 after ray-project#60139 added throttling. --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
The
test_router_queue_len_metrictest was flaky because the router queue length gauge has a 100ms throttle (RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S) that can skip updates when they happen too quickly.When replica initialization sets the gauge to 0 and a request immediately updates it to 1, the second update may be throttled, causing the test to see 0 instead of 1.
Related issue number
Fixes flaky test introduced in #59233 after #60139 added throttling.
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.doc/source/tune/api/under the corresponding.rstfile.