Skip to content

[Serve] Fix flaky test_router_queue_len_metric#60333

Merged
aslonnie merged 5 commits intoray-project:masterfrom
eicherseiji:fix-flaky-router-queue-len-test
Jan 21, 2026
Merged

[Serve] Fix flaky test_router_queue_len_metric#60333
aslonnie merged 5 commits intoray-project:masterfrom
eicherseiji:fix-flaky-router-queue-len-test

Conversation

@eicherseiji
Copy link
Copy Markdown
Contributor

Why are these changes needed?

The test_router_queue_len_metric test was flaky because the router queue length gauge has a 100ms throttle (RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S) that can skip updates when they happen too quickly.

When replica initialization sets the gauge to 0 and a request immediately updates it to 1, the second update may be throttled, causing the test to see 0 instead of 1.

Related issue number

Fixes flaky test introduced in #59233 after #60139 added throttling.

Checks

  • I've signed off every commit (by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests

The test was flaky because the router queue length gauge has a 100ms
throttle (RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S) that can skip
updates when they happen too quickly. When replica initialization
sets the gauge to 0 and a request immediately updates it to 1, the
second update may be throttled, causing the test to see 0 instead of 1.

Fix by adding a fixture that disables throttling for this test.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
@eicherseiji eicherseiji requested a review from a team as a code owner January 20, 2026 19:18
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a flaky test, test_router_queue_len_metric, by disabling the router queue length gauge throttling during the test. The fix is implemented by introducing a new pytest fixture, disable_router_queue_len_throttle, that sets the RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S environment variable to "0". The change is correct, well-isolated, and follows good testing practices. The code is clean, and the new fixture is well-documented. Overall, this is a good fix for the reported flakiness.

@ray-gardener ray-gardener bot added the serve Ray Serve Related Issue label Jan 20, 2026
@eicherseiji eicherseiji added the go add ONLY when ready to merge, run all tests label Jan 20, 2026
The throttle can cause flakiness: if the gauge is set to 0 on replica init
and then updated to 1 within the throttle window (100ms), the update is skipped.
"""
monkeypatch.setenv("RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S", "0")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's set this in the buildkite since it applies to all tests

eicherseiji and others added 3 commits January 20, 2026 23:35
Per review feedback, set RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S=0
in Buildkite CI configuration instead of using a test-specific fixture,
since the throttle can cause flakiness in any serve test that sends
requests through the router.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Keep RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S=0 only in the main
":ray-serve: serve: tests" step that runs test_metrics.py.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Set env var on py_test_module_list in BUILD.bazel instead of
buildkite config so all CI targets automatically use it.

This addresses reviewer feedback to have a single source of truth
for the env var rather than replicating it across buildkite steps.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
@aslonnie aslonnie merged commit 80eb45f into ray-project:master Jan 21, 2026
4 of 6 checks passed
jinbum-kim pushed a commit to jinbum-kim/ray that referenced this pull request Jan 29, 2026
## Why are these changes needed?

The `test_router_queue_len_metric` test was flaky because the router
queue length gauge has a 100ms throttle
(`RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S`) that can skip updates
when they happen too quickly.

When replica initialization sets the gauge to 0 and a request
immediately updates it to 1, the second update may be throttled, causing
the test to see 0 instead of 1.

## Related issue number

Fixes flaky test introduced in ray-project#59233 after ray-project#60139 added throttling.

---------

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
400Ping pushed a commit to 400Ping/ray that referenced this pull request Feb 1, 2026
## Why are these changes needed?

The `test_router_queue_len_metric` test was flaky because the router
queue length gauge has a 100ms throttle
(`RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S`) that can skip updates
when they happen too quickly.

When replica initialization sets the gauge to 0 and a request
immediately updates it to 1, the second update may be throttled, causing
the test to see 0 instead of 1.

## Related issue number

Fixes flaky test introduced in ray-project#59233 after ray-project#60139 added throttling.

---------

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: 400Ping <jiekaichang@apache.org>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Feb 3, 2026
## Why are these changes needed?

The `test_router_queue_len_metric` test was flaky because the router
queue length gauge has a 100ms throttle
(`RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S`) that can skip updates
when they happen too quickly.

When replica initialization sets the gauge to 0 and a request
immediately updates it to 1, the second update may be throttled, causing
the test to see 0 instead of 1.

## Related issue number

Fixes flaky test introduced in ray-project#59233 after ray-project#60139 added throttling.

---------

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
## Why are these changes needed?

The `test_router_queue_len_metric` test was flaky because the router
queue length gauge has a 100ms throttle
(`RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S`) that can skip updates
when they happen too quickly.

When replica initialization sets the gauge to 0 and a request
immediately updates it to 1, the second update may be throttled, causing
the test to see 0 instead of 1.

## Related issue number

Fixes flaky test introduced in ray-project#59233 after ray-project#60139 added throttling.

---------

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
## Why are these changes needed?

The `test_router_queue_len_metric` test was flaky because the router
queue length gauge has a 100ms throttle
(`RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S`) that can skip updates
when they happen too quickly.

When replica initialization sets the gauge to 0 and a request
immediately updates it to 1, the second update may be throttled, causing
the test to see 0 instead of 1.

## Related issue number

Fixes flaky test introduced in ray-project#59233 after ray-project#60139 added throttling.

---------

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants