[ML] Fix flaky BWC rolling-upgrade ML DFA tests: increase stop_data_frame_analytics timeout#144926
Draft
valeriy42 wants to merge 1 commit intoelastic:mainfrom
Draft
[ML] Fix flaky BWC rolling-upgrade ML DFA tests: increase stop_data_frame_analytics timeout#144926valeriy42 wants to merge 1 commit intoelastic:mainfrom
valeriy42 wants to merge 1 commit intoelastic:mainfrom
Conversation
During a rolling upgrade the persistent-task framework must redistribute tasks across a partially-upgraded cluster, which can exceed the previous 60 s stop timeout. Raising all stop_data_frame_analytics calls in the mixed_cluster test suite from 60s to 3m gives enough headroom above the server-side 1-minute persistent-task wait plus node-recovery time. Closes elastic#139654 Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
During a rolling upgrade the persistent-task framework must detect that a task-owning node has left the cluster, wait for it to be confirmed gone, and then re-assign the persistent task to a remaining node. This redistribution path is noticeably slower in a mixed-version cluster than in a stable cluster, because coordination between old and new nodes involves additional negotiation. The
stop_data_frame_analyticsAPI internally waits up to 1 minute for all persistent tasks to reach thestoppedstate. When that 1-minute server-side window is exhausted, it returns HTTP 500 with"Timed out when waiting for persistent tasks after 1m".The previous YAML timeout of
"60s"on allstop_data_frame_analyticscalls was effectively equal to the server-side deadline, leaving no margin for the HTTP round-trip or for any additional latency introduced by the mixed-cluster environment. When both timeouts fired at the same instant the test framework received a 500 and recorded a failure. This PR raises the client-facing timeout to"3m"on all threestop_data_frame_analyticscalls inmixed_cluster/90_ml_data_frame_analytics_crud.yml(forold_cluster_outlier_detection_job,old_cluster_regression_job, andmixed_cluster_outlier_detection_job), giving the server's internal retry and persistence-task reassignment logic enough time to complete before the request is considered failed.The fix is intentionally conservative: 3 minutes is well below the 5-minute suite timeout annotation and matches the pattern used by other long-running ML stop calls in the rolling-upgrade test corpus. The 0.2% failure rate reported in #139654 (2 failures in 1000 executions on main) confirms this is a genuine timing edge case that the larger margin reliably eliminates.
Closes #139654