Skip to content

Wire SIGTERM-aware graceful shutdown in thv-proxyrunner entrypoint #4207

@yrobla

Description

@yrobla

Description

Wire signal.NotifyContext for SIGTERM, SIGQUIT, and os.Interrupt in cmd/thv-proxyrunner/main.go and propagate the resulting context through cobra.ExecuteContext so that Kubernetes pod termination triggers graceful connection drain rather than abrupt process exit. Currently, main.go calls app.NewRootCmd().Execute() with no signal-aware context, meaning SIGTERM from the pod lifecycle will terminate the process immediately without giving in-flight MCP proxy connections a chance to drain.

Context

This task implements RC-14 from RFC THV-0047 (Manual Horizontal Scaling for vMCP and Proxy Runner). When the proxyrunner is deployed as multiple StatefulSet replicas, Kubernetes sends SIGTERM to pods during rolling updates, scale-down events, and node drains. Without a signal-aware context, in-flight MCP sessions are dropped at the transport layer without going through the graceful drain path in transportHandler.Stop. The fix is a small, self-contained change to main.go that mirrors the existing pattern in cmd/vmcp/main.go. The operator is responsible for setting an appropriate terminationGracePeriodSeconds; the proxyrunner simply stops accepting new connections and waits for transportHandler.Stop to complete.

Dependencies: None
Blocks: None (TASK-003 is independent of all other tasks in this epic)

Acceptance Criteria

  • cmd/thv-proxyrunner/main.go creates a signal.NotifyContext cancellable on os.Interrupt, syscall.SIGTERM, and syscall.SIGQUIT
  • defer cancel() is called immediately after signal.NotifyContext to ensure the OS signal listener is released on exit
  • app.NewRootCmd().ExecuteContext(ctx) is called instead of app.NewRootCmd().Execute(), passing the signal-aware context
  • runCmdFunc in cmd/thv-proxyrunner/app/run.go already reads cmd.Context() — no changes needed there; the context flows through without modification
  • On SIGTERM, the proxyrunner stops accepting new requests and waits for transportHandler.Stop to complete before exiting (exercised by the existing context-cancellation path in runner.Run)
  • Error handling on ExecuteContext matches the vmcp pattern: log the error via slog.Error and call os.Exit(1)
  • No signal channels or additional goroutines are created in main.gosignal.NotifyContext is the single mechanism
  • Existing single-replica proxyrunner behaviour is unchanged (context cancellation on CTRL-C already works; this extends the same mechanism to SIGTERM/SIGQUIT)
  • All existing tests pass

Technical Approach

Recommended Implementation

Replace the current app.NewRootCmd().Execute() call in cmd/thv-proxyrunner/main.go with the signal.NotifyContext + ExecuteContext pattern taken verbatim from cmd/vmcp/main.go. The context flows through the existing chain: ExecuteContextrunCmdFunc (which reads cmd.Context()) → mcpRunner.Run(ctx) → the select { case <-ctx.Done() } block that calls stopMCPServer, which calls transportHandler.Stop. No changes to the app package are required.

Patterns & Frameworks

  • signal.NotifyContext from os/signal (stdlib) — creates a context that is cancelled when the specified OS signals arrive; this is the idiomatic Go pattern for signal-driven graceful shutdown since Go 1.16
  • syscall.SIGTERM and syscall.SIGQUIT from syscall (stdlib) — standard Kubernetes termination signals; SIGQUIT additionally triggers a goroutine dump useful for debugging stuck drain scenarios
  • cobra.Command.ExecuteContext — passes the context to all subcommand RunE functions via cmd.Context(); runCmdFunc already calls cmd.Context() on line 77 of run.go, so no changes to the command handler are needed
  • Operator controls drain duration via terminationGracePeriodSeconds on the proxyrunner Deployment/Pod spec — the proxyrunner itself does not set an internal exit deadline

Code Pointers

  • cmd/thv-proxyrunner/main.go — the only file that needs to change; replace Execute() with signal.NotifyContext + ExecuteContext; add imports for context, fmt, os/signal, and syscall
  • cmd/vmcp/main.go — reference implementation to follow exactly; lines 31–38 show the complete pattern
  • cmd/thv-proxyrunner/app/run.go:77ctx := cmd.Context() already reads the context from Cobra; no change needed here
  • pkg/runner/runner.go:529–531select { case <-ctx.Done(): stopMCPServer("Context cancelled") } is the existing drain path that triggers transportHandler.Stop; it fires automatically once the signal-aware context is cancelled

Component Interfaces

The change is entirely in cmd/thv-proxyrunner/main.go. The final main function should match this structure:

func main() {
	// Bind TOOLHIVE_DEBUG env var early, before logger initialization.
	if err := viper.BindEnv("debug", "TOOLHIVE_DEBUG"); err != nil {
		slog.Error("failed to bind TOOLHIVE_DEBUG env var", "error", err)
	}

	// Initialize the logger
	var opts []logging.Option
	if viper.GetBool("debug") {
		opts = append(opts, logging.WithLevel(slog.LevelDebug))
	}
	l := logging.New(opts...)
	slog.SetDefault(l)

	// Create a context that will be canceled on signal
	ctx, cancel := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM, syscall.SIGQUIT)
	defer cancel()

	// Execute the root command with context
	if err := app.NewRootCmd().ExecuteContext(ctx); err != nil {
		slog.Error(fmt.Sprintf("Error executing command: %v", err))
		os.Exit(1)
	}
}

New imports required: "context", "fmt", "os/signal", "syscall" (in addition to the existing imports already present in the file).

Testing Strategy

Unit Tests

  • No unit tests are required for main.go directly (it is an entrypoint with no testable logic beyond the signal wiring); the existing runner tests cover the drain path
  • Verify that runCmdFunc receives a non-nil context via cmd.Context() — this is already exercised by existing runner unit tests in pkg/runner/runner_test.go

Integration Tests

  • Manual / e2e: send SIGTERM to a running proxyrunner pod; verify the process does not exit until in-flight proxy connections close or the terminationGracePeriodSeconds elapses
  • Manual: send os.Interrupt (CTRL-C) to a locally running proxyrunner; verify clean exit (existing behaviour should be preserved)

Edge Cases

  • Double-signal: sending a second SIGTERM after the first should be handled by signal.NotifyContext natively (the context is already cancelled; no additional action needed)
  • Signal received before transportHandler.Start: ctx.Done() fires before the transport is running; stopMCPServer is never called because the select is reached only after Start returns — verify the process exits cleanly in this path

Out of Scope

  • Operator-side terminationGracePeriodSeconds configuration — this is an operator concern, not a proxyrunner code change
  • vMCP-side graceful shutdown — covered by the vMCP epic
  • Any changes to cmd/thv-proxyrunner/app/run.go, cmd/thv-proxyrunner/app/commands.go, or pkg/runner/runner.go — the context chain and drain logic already exist
  • Custom exit-deadline timeout inside the proxyrunner process — the operator controls the drain window via terminationGracePeriodSeconds

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestgoPull requests that update go codekubernetesItems related to KubernetesscalabilityItems related to scalabilityvmcpVirtual MCP Server related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions