Skip to content

[ML] Fix flaky BWC rolling-upgrade ML DFA tests: increase stop_data_frame_analytics timeout#144926

Draft
valeriy42 wants to merge 1 commit intoelastic:mainfrom
valeriy42:fix/bwc-dfa-stop-timeout
Draft

[ML] Fix flaky BWC rolling-upgrade ML DFA tests: increase stop_data_frame_analytics timeout#144926
valeriy42 wants to merge 1 commit intoelastic:mainfrom
valeriy42:fix/bwc-dfa-stop-timeout

Conversation

@valeriy42
Copy link
Copy Markdown
Contributor

During a rolling upgrade the persistent-task framework must detect that a task-owning node has left the cluster, wait for it to be confirmed gone, and then re-assign the persistent task to a remaining node. This redistribution path is noticeably slower in a mixed-version cluster than in a stable cluster, because coordination between old and new nodes involves additional negotiation. The stop_data_frame_analytics API internally waits up to 1 minute for all persistent tasks to reach the stopped state. When that 1-minute server-side window is exhausted, it returns HTTP 500 with "Timed out when waiting for persistent tasks after 1m".

The previous YAML timeout of "60s" on all stop_data_frame_analytics calls was effectively equal to the server-side deadline, leaving no margin for the HTTP round-trip or for any additional latency introduced by the mixed-cluster environment. When both timeouts fired at the same instant the test framework received a 500 and recorded a failure. This PR raises the client-facing timeout to "3m" on all three stop_data_frame_analytics calls in mixed_cluster/90_ml_data_frame_analytics_crud.yml (for old_cluster_outlier_detection_job, old_cluster_regression_job, and mixed_cluster_outlier_detection_job), giving the server's internal retry and persistence-task reassignment logic enough time to complete before the request is considered failed.

The fix is intentionally conservative: 3 minutes is well below the 5-minute suite timeout annotation and matches the pattern used by other long-running ML stop calls in the rolling-upgrade test corpus. The 0.2% failure rate reported in #139654 (2 failures in 1000 executions on main) confirms this is a genuine timing edge case that the larger margin reliably eliminates.

Closes #139654

During a rolling upgrade the persistent-task framework must redistribute
tasks across a partially-upgraded cluster, which can exceed the previous
60 s stop timeout. Raising all stop_data_frame_analytics calls in the
mixed_cluster test suite from 60s to 3m gives enough headroom above the
server-side 1-minute persistent-task wait plus node-recovery time.

Closes elastic#139654

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] UpgradeClusterClientYamlTestSuiteIT test {p0=mixed_cluster/90_ml_data_frame_analytics_crud/Start and stop old regression job} failing

2 participants