[ML] Fix flaky BWC rolling-upgrade ML DFA tests: increase stop_data_frame_analytics timeout by valeriy42 · Pull Request #144926 · elastic/elasticsearch

valeriy42 · 2026-03-25T12:28:42Z

During a rolling upgrade the persistent-task framework must detect that a task-owning node has left the cluster, wait for it to be confirmed gone, and then re-assign the persistent task to a remaining node. This redistribution path is noticeably slower in a mixed-version cluster than in a stable cluster, because coordination between old and new nodes involves additional negotiation. The stop_data_frame_analytics API internally waits up to 1 minute for all persistent tasks to reach the stopped state. When that 1-minute server-side window is exhausted, it returns HTTP 500 with "Timed out when waiting for persistent tasks after 1m".

The previous YAML timeout of "60s" on all stop_data_frame_analytics calls was effectively equal to the server-side deadline, leaving no margin for the HTTP round-trip or for any additional latency introduced by the mixed-cluster environment. When both timeouts fired at the same instant the test framework received a 500 and recorded a failure. This PR raises the client-facing timeout to "3m" on all three stop_data_frame_analytics calls in mixed_cluster/90_ml_data_frame_analytics_crud.yml (for old_cluster_outlier_detection_job, old_cluster_regression_job, and mixed_cluster_outlier_detection_job), giving the server's internal retry and persistence-task reassignment logic enough time to complete before the request is considered failed.

The fix is intentionally conservative: 3 minutes is well below the 5-minute suite timeout annotation and matches the pattern used by other long-running ML stop calls in the rolling-upgrade test corpus. The 0.2% failure rate reported in #139654 (2 failures in 1000 executions on main) confirms this is a genuine timing edge case that the larger margin reliably eliminates.

Closes #139654

During a rolling upgrade the persistent-task framework must redistribute tasks across a partially-upgraded cluster, which can exceed the previous 60 s stop timeout. Raising all stop_data_frame_analytics calls in the mixed_cluster test suite from 60s to 3m gives enough headroom above the server-side 1-minute persistent-task wait plus node-recovery time. Closes elastic#139654 Made-with: Cursor

elasticsearchmachine added the v9.4.0 label Mar 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Fix flaky BWC rolling-upgrade ML DFA tests: increase stop_data_frame_analytics timeout#144926

[ML] Fix flaky BWC rolling-upgrade ML DFA tests: increase stop_data_frame_analytics timeout#144926
valeriy42 wants to merge 1 commit intoelastic:mainfrom
valeriy42:fix/bwc-dfa-stop-timeout

valeriy42 commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

valeriy42 commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants