-
Notifications
You must be signed in to change notification settings - Fork 198
Wire SIGTERM-aware graceful shutdown in thv-proxyrunner entrypoint #4207
Description
Description
Wire signal.NotifyContext for SIGTERM, SIGQUIT, and os.Interrupt in cmd/thv-proxyrunner/main.go and propagate the resulting context through cobra.ExecuteContext so that Kubernetes pod termination triggers graceful connection drain rather than abrupt process exit. Currently, main.go calls app.NewRootCmd().Execute() with no signal-aware context, meaning SIGTERM from the pod lifecycle will terminate the process immediately without giving in-flight MCP proxy connections a chance to drain.
Context
This task implements RC-14 from RFC THV-0047 (Manual Horizontal Scaling for vMCP and Proxy Runner). When the proxyrunner is deployed as multiple StatefulSet replicas, Kubernetes sends SIGTERM to pods during rolling updates, scale-down events, and node drains. Without a signal-aware context, in-flight MCP sessions are dropped at the transport layer without going through the graceful drain path in transportHandler.Stop. The fix is a small, self-contained change to main.go that mirrors the existing pattern in cmd/vmcp/main.go. The operator is responsible for setting an appropriate terminationGracePeriodSeconds; the proxyrunner simply stops accepting new connections and waits for transportHandler.Stop to complete.
Dependencies: None
Blocks: None (TASK-003 is independent of all other tasks in this epic)
Acceptance Criteria
-
cmd/thv-proxyrunner/main.gocreates asignal.NotifyContextcancellable onos.Interrupt,syscall.SIGTERM, andsyscall.SIGQUIT -
defer cancel()is called immediately aftersignal.NotifyContextto ensure the OS signal listener is released on exit -
app.NewRootCmd().ExecuteContext(ctx)is called instead ofapp.NewRootCmd().Execute(), passing the signal-aware context -
runCmdFuncincmd/thv-proxyrunner/app/run.goalready readscmd.Context()— no changes needed there; the context flows through without modification - On SIGTERM, the proxyrunner stops accepting new requests and waits for
transportHandler.Stopto complete before exiting (exercised by the existing context-cancellation path inrunner.Run) - Error handling on
ExecuteContextmatches the vmcp pattern: log the error viaslog.Errorand callos.Exit(1) - No signal channels or additional goroutines are created in
main.go—signal.NotifyContextis the single mechanism - Existing single-replica proxyrunner behaviour is unchanged (context cancellation on CTRL-C already works; this extends the same mechanism to SIGTERM/SIGQUIT)
- All existing tests pass
Technical Approach
Recommended Implementation
Replace the current app.NewRootCmd().Execute() call in cmd/thv-proxyrunner/main.go with the signal.NotifyContext + ExecuteContext pattern taken verbatim from cmd/vmcp/main.go. The context flows through the existing chain: ExecuteContext → runCmdFunc (which reads cmd.Context()) → mcpRunner.Run(ctx) → the select { case <-ctx.Done() } block that calls stopMCPServer, which calls transportHandler.Stop. No changes to the app package are required.
Patterns & Frameworks
signal.NotifyContextfromos/signal(stdlib) — creates a context that is cancelled when the specified OS signals arrive; this is the idiomatic Go pattern for signal-driven graceful shutdown since Go 1.16syscall.SIGTERMandsyscall.SIGQUITfromsyscall(stdlib) — standard Kubernetes termination signals; SIGQUIT additionally triggers a goroutine dump useful for debugging stuck drain scenarioscobra.Command.ExecuteContext— passes the context to all subcommandRunEfunctions viacmd.Context();runCmdFuncalready callscmd.Context()on line 77 ofrun.go, so no changes to the command handler are needed- Operator controls drain duration via
terminationGracePeriodSecondson the proxyrunner Deployment/Pod spec — the proxyrunner itself does not set an internal exit deadline
Code Pointers
cmd/thv-proxyrunner/main.go— the only file that needs to change; replaceExecute()withsignal.NotifyContext+ExecuteContext; add imports forcontext,fmt,os/signal, andsyscallcmd/vmcp/main.go— reference implementation to follow exactly; lines 31–38 show the complete patterncmd/thv-proxyrunner/app/run.go:77—ctx := cmd.Context()already reads the context from Cobra; no change needed herepkg/runner/runner.go:529–531—select { case <-ctx.Done(): stopMCPServer("Context cancelled") }is the existing drain path that triggerstransportHandler.Stop; it fires automatically once the signal-aware context is cancelled
Component Interfaces
The change is entirely in cmd/thv-proxyrunner/main.go. The final main function should match this structure:
func main() {
// Bind TOOLHIVE_DEBUG env var early, before logger initialization.
if err := viper.BindEnv("debug", "TOOLHIVE_DEBUG"); err != nil {
slog.Error("failed to bind TOOLHIVE_DEBUG env var", "error", err)
}
// Initialize the logger
var opts []logging.Option
if viper.GetBool("debug") {
opts = append(opts, logging.WithLevel(slog.LevelDebug))
}
l := logging.New(opts...)
slog.SetDefault(l)
// Create a context that will be canceled on signal
ctx, cancel := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM, syscall.SIGQUIT)
defer cancel()
// Execute the root command with context
if err := app.NewRootCmd().ExecuteContext(ctx); err != nil {
slog.Error(fmt.Sprintf("Error executing command: %v", err))
os.Exit(1)
}
}New imports required: "context", "fmt", "os/signal", "syscall" (in addition to the existing imports already present in the file).
Testing Strategy
Unit Tests
- No unit tests are required for
main.godirectly (it is an entrypoint with no testable logic beyond the signal wiring); the existing runner tests cover the drain path - Verify that
runCmdFuncreceives a non-nil context viacmd.Context()— this is already exercised by existing runner unit tests inpkg/runner/runner_test.go
Integration Tests
- Manual / e2e: send SIGTERM to a running proxyrunner pod; verify the process does not exit until in-flight proxy connections close or the
terminationGracePeriodSecondselapses - Manual: send
os.Interrupt(CTRL-C) to a locally running proxyrunner; verify clean exit (existing behaviour should be preserved)
Edge Cases
- Double-signal: sending a second SIGTERM after the first should be handled by
signal.NotifyContextnatively (the context is already cancelled; no additional action needed) - Signal received before
transportHandler.Start:ctx.Done()fires before the transport is running;stopMCPServeris never called because theselectis reached only afterStartreturns — verify the process exits cleanly in this path
Out of Scope
- Operator-side
terminationGracePeriodSecondsconfiguration — this is an operator concern, not a proxyrunner code change - vMCP-side graceful shutdown — covered by the vMCP epic
- Any changes to
cmd/thv-proxyrunner/app/run.go,cmd/thv-proxyrunner/app/commands.go, orpkg/runner/runner.go— the context chain and drain logic already exist - Custom exit-deadline timeout inside the proxyrunner process — the operator controls the drain window via
terminationGracePeriodSeconds
References
- RFC THV-0047: RFC: Horizontal Scaling for vMCP and Proxy Runner toolhive-rfcs#47 — source RFC; RC-14 defines the SIGTERM drain requirement
- Parent epic: https://github.com/stacklok/stacklok-epics/issues/263
- Reference implementation:
cmd/vmcp/main.goinstacklok/toolhive— identical signal-context pattern to apply signal.NotifyContextGo documentation: https://pkg.go.dev/os/signal#NotifyContext