Skip to content

Increase Default Value of TerminationDelay#199

Merged
unmarshall merged 2 commits into
ai-dynamo:mainfrom
nvrohanv:nvrohanv/chang-default-termination-delay
Sep 23, 2025
Merged

Increase Default Value of TerminationDelay#199
unmarshall merged 2 commits into
ai-dynamo:mainfrom
nvrohanv:nvrohanv/chang-default-termination-delay

Conversation

@nvrohanv

Copy link
Copy Markdown
Contributor

PR to increase default value of terminationDelay from 30 seconds to 4 hours.

Rationale:
terminationDelay is tied to gang termination semantics, which are meant to prevent large amounts of resources from staying reserved when a system is non-functional (e.g., when minAvailable is breached). The delay provides a grace period for the system to heal before triggering termination.

The current default of 30 seconds is too short and often disruptive. Kubernetes users are accustomed to workloads continually trying to self-heal, so shutting things down after 30 seconds is jarring. In practice, we’ve also seen inference workloads terminate before they had time to fully initialize—especially when models are large. This usually stems from user or application error: worker pods mark themselves as Ready before the model has actually finished loading, either because startup probes are not configured or because the inference framework itself signals readiness too early, before completing all initialization.

Nevertheless, we want terminationDelay and gang termination semantics to stay out of the user’s way by default. The goal is for users to be deliberate about opting in and explicitly deciding how long a workload should be given to self-heal before being considered irreparably broken. By raising the default to 4 hours, we provide enough time for workloads with heavy initialization—such as loading very large models like DeepSeek-R1—to complete without being prematurely torn down. At the same time, the system will still eventually terminate if it remains in a broken state for an extended period, preserving the original intent of gang termination.

…in for most cases

Signed-off-by: Rohan Varma <rohanv@nvidia.com>
Signed-off-by: Rohan Varma <rohanv@nvidia.com>
@unmarshall unmarshall force-pushed the nvrohanv/chang-default-termination-delay branch from 531199e to 8c5a51a Compare September 23, 2025 04:01
@unmarshall unmarshall merged commit 9b31e75 into ai-dynamo:main Sep 23, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants