Increase Default Value of TerminationDelay by nvrohanv · Pull Request #199 · ai-dynamo/grove

nvrohanv · 2025-09-22T02:16:25Z

PR to increase default value of terminationDelay from 30 seconds to 4 hours.

Rationale:
terminationDelay is tied to gang termination semantics, which are meant to prevent large amounts of resources from staying reserved when a system is non-functional (e.g., when minAvailable is breached). The delay provides a grace period for the system to heal before triggering termination.

The current default of 30 seconds is too short and often disruptive. Kubernetes users are accustomed to workloads continually trying to self-heal, so shutting things down after 30 seconds is jarring. In practice, we’ve also seen inference workloads terminate before they had time to fully initialize—especially when models are large. This usually stems from user or application error: worker pods mark themselves as Ready before the model has actually finished loading, either because startup probes are not configured or because the inference framework itself signals readiness too early, before completing all initialization.

Nevertheless, we want terminationDelay and gang termination semantics to stay out of the user’s way by default. The goal is for users to be deliberate about opting in and explicitly deciding how long a workload should be given to self-heal before being considered irreparably broken. By raising the default to 4 hours, we provide enough time for workloads with heavy initialization—such as loading very large models like DeepSeek-R1—to complete without being prematurely torn down. At the same time, the system will still eventually terminate if it remains in a broken state for an extended period, preserving the original intent of gang termination.

…in for most cases Signed-off-by: Rohan Varma <rohanv@nvidia.com>

Signed-off-by: Rohan Varma <rohanv@nvidia.com>

nvrohanv requested review from sanjaychatterjee and unmarshall as code owners September 22, 2025 02:16

nvrohanv added 2 commits September 23, 2025 09:30

increase default value of terminationDelay to make it more of an opt …

b18b3ad

…in for most cases Signed-off-by: Rohan Varma <rohanv@nvidia.com>

fix api reference

8c5a51a

Signed-off-by: Rohan Varma <rohanv@nvidia.com>

unmarshall force-pushed the nvrohanv/chang-default-termination-delay branch from 531199e to 8c5a51a Compare September 23, 2025 04:01

unmarshall approved these changes Sep 23, 2025

View reviewed changes

unmarshall merged commit 9b31e75 into ai-dynamo:main Sep 23, 2025
4 checks passed

nvrohanv mentioned this pull request Oct 6, 2025

Update groveTerminationDelay to ensure it doesn't trigger for now ai-dynamo/dynamo#3437

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase Default Value of TerminationDelay#199

Increase Default Value of TerminationDelay#199
unmarshall merged 2 commits into
ai-dynamo:mainfrom
nvrohanv:nvrohanv/chang-default-termination-delay

nvrohanv commented Sep 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nvrohanv commented Sep 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants