Skip to content

fix: initc panic on pod update after termination#547

Merged
sanjaychatterjee merged 1 commit into
ai-dynamo:mainfrom
steved:steved/initc-panic
Apr 22, 2026
Merged

fix: initc panic on pod update after termination#547
sanjaychatterjee merged 1 commit into
ai-dynamo:mainfrom
steved:steved/initc-panic

Conversation

@steved

@steved steved commented Apr 21, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

grove-initc panic: send on closed channel [recovered, repanicked]
grove-initc
grove-initc goroutine 201 [running]:
grove-initc k8s.io/apimachinery/pkg/util/runtime.handleCrash({0x17d8688, 0xc0006f18f0}, {0x137e620, 0x17baf30}, {0x0, 0x0, 0x2?})
grove-initc 	/go/pkg/mod/k8s.io/apimachinery@v0.34.2/pkg/util/runtime/runtime.go:114 +0x1a9
grove-initc k8s.io/apimachinery/pkg/util/runtime.HandleCrashWithLogger({{0x17db960?, 0xc0006ab380?}, 0xc000935c38?}, {0x0, 0x0, 0x0})
grove-initc 	/go/pkg/mod/k8s.io/apimachinery@v0.34.2/pkg/util/runtime/runtime.go:91 +0x115
grove-initc panic({0x137e620?, 0x17baf30?})
grove-initc 	/usr/local/go/src/runtime/panic.go:783 +0x132
grove-initc github.com/ai-dynamo/grove/operator/initc/internal.(*ParentPodCliqueDependencies).registerEventHandler.func1({0x15b5f20?, 0xc0008e1208?})
grove-initc 	/go/src/github.com/ai-dynamo/grove/operator/initc/internal/wait.go:203 +0x186
grove-initc k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...)
grove-initc 	/go/pkg/mod/k8s.io/client-go@v0.34.2/tools/cache/controller.go:259
grove-initc k8s.io/client-go/tools/cache.(*processorListener).run.func1(0xc000545f5f, 0xc0006f6820, {0x142b900?, 0xc0003ba468?})
grove-initc 	/go/pkg/mod/k8s.io/client-go@v0.34.2/tools/cache/shared_informer.go:1076 +0xd1
grove-initc k8s.io/client-go/tools/cache.(*processorListener).run(0xc0006f6820)
grove-initc 	/go/pkg/mod/k8s.io/client-go@v0.34.2/tools/cache/shared_informer.go:1086 +0x3c
grove-initc k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
grove-initc 	/go/pkg/mod/k8s.io/apimachinery@v0.34.2/pkg/util/wait/wait.go:72 +0x4c
grove-initc created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 292
grove-initc 	/go/pkg/mod/k8s.io/apimachinery@v0.34.2/pkg/util/wait/wait.go:70 +0x73

If WaitForReady returns and allReadyCh is closed, but, before the process exits, a shared informer event happens that indicates a podclique is ready, it'll attempt to send on a closed channel.

I've gone the route of closing the factory, but we could also just not close allReadyCh and the process exit will clean everything up.

Which issue(s) this PR fixes:

Fixes: #548.

Special notes for your reviewer:

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:

NONE

@gflarity

Copy link
Copy Markdown
Contributor

Thanks for the fix @steved. One small nit, otherwise LGTM.

@steved

steved commented Apr 21, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the fix @steved. One small nit, otherwise LGTM.

Thanks for the review! I think the comment might've been lost? I don't see any notes.

@gflarity

Copy link
Copy Markdown
Contributor

@steved I forgot to submit, but I looked at it again and it's not worth the nit. Thanks!

@renormalize renormalize left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the fix @steved!

I was clearly mistaken when I assumed that ensuring the right order of canceling the context and then subsequently closing the channel would suffice.

Thanks for also changing the unnecessarily buffered channel to an unbuffered one. The

	select {
	case c.allReadyCh <- struct{}{}:
	default: // already notified
	}

pattern wouldn't have worked with a buffered channel.

I think my original intent was to make the channel buffered since simultaneous events might cause multiple informers to write to the channel, and I did not want them to be blocked. But that does not matter anyway, as long as one informer routine writes to allReadyCh.

@sanjaychatterjee sanjaychatterjee left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. Merging.

@sanjaychatterjee sanjaychatterjee merged commit a2183ea into ai-dynamo:main Apr 22, 2026
15 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: initc panic on pod update after termination

4 participants