fix: initc panic on pod update after termination by steved · Pull Request #547 · ai-dynamo/grove

steved · 2026-04-21T19:26:11Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

grove-initc panic: send on closed channel [recovered, repanicked]
grove-initc
grove-initc goroutine 201 [running]:
grove-initc k8s.io/apimachinery/pkg/util/runtime.handleCrash({0x17d8688, 0xc0006f18f0}, {0x137e620, 0x17baf30}, {0x0, 0x0, 0x2?})
grove-initc 	/go/pkg/mod/k8s.io/apimachinery@v0.34.2/pkg/util/runtime/runtime.go:114 +0x1a9
grove-initc k8s.io/apimachinery/pkg/util/runtime.HandleCrashWithLogger({{0x17db960?, 0xc0006ab380?}, 0xc000935c38?}, {0x0, 0x0, 0x0})
grove-initc 	/go/pkg/mod/k8s.io/apimachinery@v0.34.2/pkg/util/runtime/runtime.go:91 +0x115
grove-initc panic({0x137e620?, 0x17baf30?})
grove-initc 	/usr/local/go/src/runtime/panic.go:783 +0x132
grove-initc github.com/ai-dynamo/grove/operator/initc/internal.(*ParentPodCliqueDependencies).registerEventHandler.func1({0x15b5f20?, 0xc0008e1208?})
grove-initc 	/go/src/github.com/ai-dynamo/grove/operator/initc/internal/wait.go:203 +0x186
grove-initc k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...)
grove-initc 	/go/pkg/mod/k8s.io/client-go@v0.34.2/tools/cache/controller.go:259
grove-initc k8s.io/client-go/tools/cache.(*processorListener).run.func1(0xc000545f5f, 0xc0006f6820, {0x142b900?, 0xc0003ba468?})
grove-initc 	/go/pkg/mod/k8s.io/client-go@v0.34.2/tools/cache/shared_informer.go:1076 +0xd1
grove-initc k8s.io/client-go/tools/cache.(*processorListener).run(0xc0006f6820)
grove-initc 	/go/pkg/mod/k8s.io/client-go@v0.34.2/tools/cache/shared_informer.go:1086 +0x3c
grove-initc k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
grove-initc 	/go/pkg/mod/k8s.io/apimachinery@v0.34.2/pkg/util/wait/wait.go:72 +0x4c
grove-initc created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 292
grove-initc 	/go/pkg/mod/k8s.io/apimachinery@v0.34.2/pkg/util/wait/wait.go:70 +0x73

If WaitForReady returns and allReadyCh is closed, but, before the process exits, a shared informer event happens that indicates a podclique is ready, it'll attempt to send on a closed channel.

I've gone the route of closing the factory, but we could also just not close allReadyCh and the process exit will clean everything up.

Which issue(s) this PR fixes:

Fixes: #548.

Special notes for your reviewer:

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:

NONE

gflarity · 2026-04-21T21:33:42Z

Thanks for the fix @steved. One small nit, otherwise LGTM.

steved · 2026-04-21T21:58:18Z

Thanks for the fix @steved. One small nit, otherwise LGTM.

Thanks for the review! I think the comment might've been lost? I don't see any notes.

gflarity · 2026-04-21T23:38:36Z

@steved I forgot to submit, but I looked at it again and it's not worth the nit. Thanks!

renormalize

thanks for the fix @steved!

I was clearly mistaken when I assumed that ensuring the right order of canceling the context and then subsequently closing the channel would suffice.

Thanks for also changing the unnecessarily buffered channel to an unbuffered one. The

	select {
	case c.allReadyCh <- struct{}{}:
	default: // already notified
	}

pattern wouldn't have worked with a buffered channel.

I think my original intent was to make the channel buffered since simultaneous events might cause multiple informers to write to the channel, and I did not want them to be blocked. But that does not matter anyway, as long as one informer routine writes to allReadyCh.

sanjaychatterjee

Thanks for the fix. Merging.

fix: initc panic on pod update after termination

91fd6bc

steved requested review from Ronkahn21, gflarity, sanjaychatterjee, shayasoolin and unmarshall as code owners April 21, 2026 19:26

gflarity approved these changes Apr 21, 2026

View reviewed changes

renormalize approved these changes Apr 22, 2026

View reviewed changes

sanjaychatterjee approved these changes Apr 22, 2026

View reviewed changes

sanjaychatterjee merged commit a2183ea into ai-dynamo:main Apr 22, 2026
15 of 16 checks passed

SAY-5 mentioned this pull request Apr 22, 2026

fix(initc): drain informer goroutines before closing allReadyCh #549

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: initc panic on pod update after termination#547

fix: initc panic on pod update after termination#547
sanjaychatterjee merged 1 commit into
ai-dynamo:mainfrom
steved:steved/initc-panic

steved commented Apr 21, 2026 •

edited

Loading

Uh oh!

gflarity commented Apr 21, 2026

Uh oh!

steved commented Apr 21, 2026

Uh oh!

gflarity commented Apr 21, 2026

Uh oh!

renormalize left a comment

Uh oh!

sanjaychatterjee left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

steved commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a API change?

Additional documentation e.g., enhancement proposals, usage docs, etc.:

Uh oh!

gflarity commented Apr 21, 2026

Uh oh!

steved commented Apr 21, 2026

Uh oh!

gflarity commented Apr 21, 2026

Uh oh!

renormalize left a comment

Choose a reason for hiding this comment

Uh oh!

sanjaychatterjee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

steved commented Apr 21, 2026 •

edited

Loading