Skip to content

refactor(mnnvl): replace two-annotation model with single mnnvl-group annotation#574

Merged
shmuel-runai merged 3 commits into
ai-dynamo:mainfrom
shmuel-runai:grove-417/single-annotation-refactor
May 6, 2026
Merged

refactor(mnnvl): replace two-annotation model with single mnnvl-group annotation#574
shmuel-runai merged 3 commits into
ai-dynamo:mainfrom
shmuel-runai:grove-417/single-annotation-refactor

Conversation

@shmuel-runai

Copy link
Copy Markdown
Contributor

/kind feature

Ref #417

Summary

  • Remove grove.io/auto-mnnvl annotation — use grove.io/mnnvl-group as the sole annotation controlling MNNVL enrollment and group assignment
  • Not backward compatible with Phase 1 (acceptable since the feature is still in alpha)
  • Simplify resolution logic, webhook validation, propagation, and CD naming

Changes

  • Constants: Remove AnnotationAutoMNNVL, AnnotationAutoMNNVLEnabled, AnnotationAutoMNNVLDisabled; add AnnotationMNNVLGroupOptOut ("none")
  • Helpers: Introduce groupStatus enum (groupEnrolled/groupWithdrawn/groupAbsent) replacing the two-boolean return from resolveGroupName; simplify ResolveGroupNameHierarchically; use case-sensitive matching per K8s convention
  • Webhook: Remove auto-mnnvl validation and conflict detection; immutability check now only covers mnnvl-group
  • Propagation: propagateMNNVLAnnotations only propagates mnnvl-group (PCS → PCSG)
  • ComputeDomain: generateComputeDomainName always uses {pcs}-{replica}-{group} format (single scheme)
  • Shared utility: Extract generic ValidateAnnotationsImmutability to utils/kubernetes/annotations.go with full test coverage
  • Tests: All unit tests updated for single-annotation semantics

Test plan

  • go build ./... passes
  • All unit tests pass (go test ./...)
  • E2E tests (deferred to follow-up PR)

Made with Cursor

Made with Cursor

@copy-pr-bot

copy-pr-bot Bot commented May 3, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@shmuel-runai shmuel-runai self-assigned this May 3, 2026
… annotation

Remove grove.io/auto-mnnvl; use grove.io/mnnvl-group as the sole
annotation controlling MNNVL enrollment and group assignment.

- Remove AnnotationAutoMNNVL and related constants/helpers
- Add AnnotationMNNVLGroupOptOut ("none") for explicit opt-out
- Introduce groupStatus enum (enrolled/withdrawn/absent) for cleaner resolution
- Simplify webhook validation, PCSG propagation, and CD naming
- Extract generic ValidateAnnotationsImmutability to utils/kubernetes
- Use case-sensitive annotation matching per K8s convention
- Update all unit tests for single-annotation semantics

Co-authored-by: Cursor <cursoragent@cursor.com>
@shmuel-runai shmuel-runai force-pushed the grove-417/single-annotation-refactor branch from f52f0c5 to 8ff06a6 Compare May 3, 2026 16:16

athreesh commented May 3, 2026

Copy link
Copy Markdown

worth fixing: can we reject the legacy grove.io/auto-mnnvl annotation instead of silently ignoring it?

i buy the alpha-breaking-change argument, but silent acceptance is a bad failure mode: a user can submit an old manifest, think MNNVL is enabled, and just get worse performance with no obvious signal. even if we do not support backward compatibility, the webhook should probably fail fast with “use grove.io/mnnvl-group: default instead” so the migration is explicit.

@shmuel-runai

Copy link
Copy Markdown
Contributor Author

@athreesh

worth fixing: can we reject the legacy grove.io/auto-mnnvl annotation instead of silently ignoring it?

i buy the alpha-breaking-change argument, but silent acceptance is a bad failure mode: a user can submit an old manifest, think MNNVL is enabled, and just get worse performance with no obvious signal. even if we do not support backward compatibility, the webhook should probably fail fast with “use grove.io/mnnvl-group: default instead” so the migration is explicit.

I thought about this, however, in phase 1, the legacy grove.io/auto-mnnvl annotation wasn't added by the user; it was added automatically by grove using a webhook.
and the last thing I want for grove it to maintaine a "zombie" code to supprot for a rare case of a API that was depreacted after a month in alpha.

Comment thread operator/internal/mnnvl/computedomain/computedomain_test.go
Comment thread operator/internal/mnnvl/computedomain/computedomain_test.go Outdated
Comment thread operator/internal/mnnvl/constants.go
Comment thread operator/internal/mnnvl/helpers.go
Comment thread operator/internal/mnnvl/helpers.go
Comment thread operator/internal/mnnvl/webhook_test.go
- Reduce Test_generateComputeDomainName from 5 to 2 distinct cases
- Add TestSyncMultipleGroups and TestSyncMultipleGroupsScaleDown for
  multi-group CD reconciliation coverage
- Allow mnnvl-group "none" when feature is disabled (harmless opt-out)
- Add missing "none + disabled -> no error" webhook test case

Co-authored-by: Cursor <cursoragent@cursor.com>
@shmuel-runai shmuel-runai force-pushed the grove-417/single-annotation-refactor branch from 49dde61 to 4f7b760 Compare May 5, 2026 15:52
Comment thread operator/internal/mnnvl/computedomain/computedomain_test.go
@shmuel-runai shmuel-runai merged commit 56f3ea5 into ai-dynamo:main May 6, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants