Releases: ai-dynamo/grove
Releases · ai-dynamo/grove
v0.1.0-alpha.10-rc1
First release candidate of the Coherent Update Strategy feature.
v0.1.0-alpha.9
What's Changed
- Refactor/e2e waiter package by @oleg-kushniriov in #526
- Add Annotations field to PodCliqueScalingGroupConfig with MNNVL propa… by @shmuel-runai in #541
- Refactor/e2e unify k8s clients by @oleg-kushniriov in #529
- Upgrade to
kindest/node:v1.34.3by @renormalize in #544 - fix: initc panic on pod update after termination by @steved in #547
- Refactor/e2e centralize data by @oleg-kushniriov in #546
- feat: add /grove-grep AI agent skill for authoring GREPs by @shayasoolin in #522
- refactor(mnnvl): add group-aware ComputeDomain naming and lifecycle m… by @shmuel-runai in #551
- perf: reduce reconcile CPU on the 1000-pod scale test by @danbar2 in #545
- fix: avoid Docker Hub rate limits using GHCR registry image by @danbar2 in #558
- feat(mnnvl): add group-aware MNNVL injection into PodSpec by @shmuel-runai in #557
- add finalizer when build resource to reduce one patch call to api server by @kangclzjc in #477
- docs: use --version flag for helm install/upgrade by @yankay in #563
- docs: document immutable operator ConfigMap by @yankay in #564
- refactor(mnnvl): remove auto-MNNVL mutating webhook, switch to opt-in by @shmuel-runai in #559
- Update code owner list by @sanjaychatterjee in #569
- multi topology implementation by @enoodle in #496
- docs: add helm template / GitOps installation guidance by @yankay in #572
- refactor(mnnvl): replace two-annotation model with single mnnvl-group annotation by @shmuel-runai in #574
- fix: cascade-delete PCS/PCSG/PCLQ via Kubernetes GC by @danbar2 in #556
- test(mnnvl): rewrite E2E tests for single-annotation model by @shmuel-runai in #583
- Add repo-managed git hooks for local validation by @enoodle in #555
- feat: add grove-user-guide AI agent skill for authoring user docs by @shmuel-runai in #589
- fix: propagate PodCliqueSet annotations to PodGang by @danbar2 in #573
- bound UpdateProgress status payload by @oleg-kushniriov in #576
- make linear complexity instead of embedded loops by @oleg-kushniriov in #579
- Fix decode test error for operator config defaults by @xianlubird in #581
- Scale Test CI by @shayasoolin in #575
- Fix gang termination when scheduled replicas regress below MinAvailable by @danbar2 in #561
- Fix KWOK stage manifest installation by @shayasoolin in #611
- Retry KWOK Stage CRD wait by @shayasoolin in #612
- Replace scheduler backend global mutable state with well defined interface and dependency injection by @kangclzjc in #512
- fix: PodCliqueSet webhook crashes on unknown scheduler name by @weizhoublue in #613
- docs(mnnvl): update GREP-417 to single-annotation model by @shmuel-runai in #568
- fix: tolerate missing ResourceClaim API during PCLQ reconcile and delete by @SAY-5 in #610
- Refine scale test delete measurements by @shayasoolin in #621
- use lodash func by @oleg-kushniriov in #618
- Use embedded etcd for scale test cluster by @shayasoolin in #624
- Rename ClusterTopology CRD to ClusterTopologyBinding by @enoodle in #616
- change TopologyConstraint.TopologyName to optional by @enoodle in #601
New Contributors
- @steved made their first contribution in #547
- @shayasoolin made their first contribution in #522
- @yankay made their first contribution in #563
- @xianlubird made their first contribution in #581
- @weizhoublue made their first contribution in #613
- @SAY-5 made their first contribution in #610
Full Changelog: v0.1.0-alpha.8...v0.1.0-alpha.9
v0.1.0-alpha.8
What's Changed
- GREP 369 - support multiple ClusterTopology by @enoodle in #413
- Crd upgrader impl by @danbar2 in #497
- [GREP-291] Introduce end-to-end tests for the
OnDeleteupdate strategy by @renormalize in #469 - feat: Expose pod index as a label to allow user-defined env vars to reference it via Downward API by @julienmancuso in #505
- Add Scheduler Backend framework by @kangclzjc in #293
- fix: use server-side apply in RU e2e tests by @enoodle in #504
- refactor: Use the scheduler backend to implement topology scheduling by @enoodle in #515
- Refactor/e2e reuse k8s clients by @oleg-kushniriov in #502
- [GREP] hierarchical resource sharing by @julienmancuso in #501
- remove utils package. move code to dedicated packages by @oleg-kushniriov in #521
- feat(e2e): scale test ergonomics with pprof and timeline tracking by @Ronkahn21 in #528
- chore: update kai version to non rc v0.14 by @enoodle in #467
- crd installer off by default by @danbar2 in #527
- add cache options for pods managed by grove only by @oleg-kushniriov in #533
- Add GREP-417: Selective MNNVL proposal by @shmuel-runai in #492
- Add mnnvl-group annotation validation and extend webhook to PodClique… by @shmuel-runai in #535
- feat: add hierarchical resource sharing by @julienmancuso in #507
New Contributors
- @enoodle made their first contribution in #413
- @oleg-kushniriov made their first contribution in #502
Full Changelog: v0.1.0-alpha.7...v0.1.0-alpha.8
v0.1.0-alpha.7
What's Changed
- Fix e2e diagnostic log artifact upload path by @gflarity in #392
- test: remaining topology e2e tests by @Ronkahn21 in #383
- Upgrade golang version and k8s deps patch version by @unmarshall in #412
- Split e2e cluster creation by @danbar2 in #371
- add pr vaidation - issue must exist by @danbar2 in #410
- unskip rolling update test 18 by @danbar2 in #414
- Release ci automation by @danbar2 in #418
- Correct API documentation in
PodCliqueSetby @renormalize in #432 - bug fix: prevent auto-mnnvl annotation injection on update by @shmuel-runai in #420
- docs: add TAS version requirement for KAI-scheduler by @Ronkahn21 in #440
- GREP-375 add scheduler backend framework by @kangclzjc in #372
- Run make tidy by @sanjaychatterjee in #443
- fix: handle startup-probe-phase pods in rolling update categorization by @gflarity in #435
- Auto-MNNVL: Add autoMNNVL e2e tests and cluster setup scripts by @shmuel-runai in #421
- GREP-291:
OnDeleteupdate strategy forPodCliqueSetby @renormalize in #403 - Fix: let operator self-create webhook TLS secret to avoid GitOps overwrites by @gflarity in #454
- Record Delete in CreateExpectation by @kangclzjc in #458
- [e2e] [mnnvl] resolve flaky e2e test in Config2_UnsupportedButEnabled by @shmuel-runai in #451
- Addition of a
deletionTimestampon aPodshould trigger aPodCliquereconciliation by @xulinfei1996 in #433 - fix: add podCliqueScalingGroup concurrentSyncs to Helm chart by @Ronkahn21 in #472
- fix: validate container env var names in PodCliqueSet webhook by @shmuel-runai in #470
- feat: add configurable pprof bind address by @Ronkahn21 in #464
- replace test image to busybox by @kangclzjc in #476
- Add NVIDIA PLC-OSS issue templates by @itayvallach in #481
- [GREP-291]
OnDeleteimplementation for standalonePodCliques andPodCliqueScalingGroups by @renormalize in #438 - refactor: add infra-manager CLI for cluster lifecycle by @Ronkahn21 in #465
- replace deprecated runner by @ranrubin in #447
- chore: migrate e2e cluster to KWOK nodes for faster CI by @Ronkahn21 in #489
- feat(e2e): add scale test measurement infrastructure by @Ronkahn21 in #484
- ci: setting up DCO validation for external contributors by @saturley-hall in #473
- Add stale issues workflow by @danbar2 in #494
- Remove duplicate and unused issue templates according to updated guidelines by @itayvallach in #486
- Add crd upgrade readme file by @danbar2 in #495
- Add auto MNNVL user guide documentation by @shmuel-runai in #437
New Contributors
- @itayvallach made their first contribution in #481
- @ranrubin made their first contribution in #447
- @saturley-hall made their first contribution in #473
Full Changelog: v0.1.0-alpha.6...v0.1.0-alpha.7
v0.1.0-alpha.6
What's Changed
- fix startup ordering checks to match implementation by @gflarity in #323
- Auto-MNNVL: add MNNVL configuration and startup validation by @shmuel-runai in #346
- only warn about MTU/PMTU issues by @gflarity in #360
- add tests for updateObservedGeneration functions by @gflarity in #354
- add command for creating k3d debug cluster identical to one used in E2E by @gflarity in #361
- retry cluster creation when it fails by @gflarity in #327
- fix typo in SO1 and SO2 tests by @gflarity in #365
- Introduce GREP template and a refactored TAS GREP by @unmarshall in #362
- Danbar/e2e rolling update 10 by @danbar2 in #356
- test: add TAS simple level and constraint tests by @Ronkahn21 in #349
- Auto-MNNVL: Update PCS's webhook support auto-mnnvl by @shmuel-runai in #370
- Auto-MNNVL: add ComputeDomain component for PCS controller by @shmuel-runai in #363
- doc: Add pod-naming and environment-variables docs by @nvrohanv in #355
- Fix tas label by @kangclzjc in #380
- run e2e only if code changed and build pass by @danbar2 in #381
- fix: use patch for topology label repplay on replaced nodes by @Ronkahn21 in #384
- Auto-MNNVL: add PodSpec injection for MNNVL resourceClaims by @shmuel-runai in #385
- Gather all operator logs on e2e test failure by @gflarity in #358
- Enable ai-dynamo copy-pr-bot for Grove by @sanjaychatterjee in #382
- Auto-MNNVL: use correct ComputeDomain CRD field paths by @shmuel-runai in #391
- Support external certificate management for webhooks by @gflarity in #344
- Auto-MNNVL: validate annotation values and sync design doc by @shmuel-runai in #386
Full Changelog: v0.1.0-alpha.5...v0.1.0-alpha.6
v0.1.0-alpha.5
What's Changed
- skip patch ObservedGeneration if no change by @xulinfei1996 in #337
- handle clean up failures better by @gflarity in #326
- fix: add PCS topology constraints to scaled PodGangs by @Ronkahn21 in #347
- fix: correct PCSG topology constraint handling for scaled PodGangs by @Ronkahn21 in #357
- test: add TAS e2e test infrastructure and basic tests by @Ronkahn21 in #348
New Contributors
- @xulinfei1996 made their first contribution in #337
Full Changelog: v0.1.0-alpha.4...v0.1.0-alpha.5
v0.1.0-alpha.4
What's Changed
- document internal/utils by @gflarity in #211
- document internal/logger and internal/utils by @gflarity in #208
- document internal/controller by @gflarity in #205
- Changes for migration from @NVIDIA to @ai-dynamo by @renormalize in #225
- Introduce badges in
README.md. by @renormalize in #227 - Remove Ask DeepWiki badge from README and add Go report badge by @unmarshall in #234
- bump CRD_REF_DOCS_VERSION by @gflarity in #232
- E2E Test Foundations by @gflarity in #207
- Grove proposal/topology by @Ronkahn21 in #224
- api: add Topology aware support by @Ronkahn21 in #235
- Bump github.com/docker/docker from 28.2.2+incompatible to 28.3.3+incompatible in /operator by @dependabot[bot] in #237
- added missed missed PR feedback by @gflarity in #238
- check for existing cluster and delete if it already exists by @gflarity in #239
- test coverage for internal/logger by @gflarity in #229
- Add Core Concepts Tutorial by @nvrohanv in #217
- Update Grove discord link with a permanent link. by @renormalize in #249
- Bump github.com/containerd/containerd from 1.7.28 to 1.7.29 in /operator by @dependabot[bot] in #253
- Fixed documentation links and formatting in README and installation by @sanjaychatterjee in #250
- test coverage for internal/webhooks by @gflarity in #230
- test coverage for internal/utils by @gflarity in #231
- Disallow reducing
PodCliqueSetTemplateSpec.PodCliqueScalingGroupConfig.Replicasto0. by @renormalize in #256 - add support for prepulling images to speed up tests on slow networks by @gflarity in #241
- Fix indentation in
docs/designs/topology.md. by @renormalize in #257 - Feat/Topology Configuration Infrastructure by @Ronkahn21 in #247
- Add validation webhook for ClusterTopology resource by @shmuel-runai in #251
- Dependency version upgrades and fixes by @unmarshall in #263
- fix(charts): correct webhook configuration scope and metadata by @shmuel-runai in #258
- docs: update topology configuration naming in design doc by @Ronkahn21 in #266
- Bump golang.org/x/crypto from 0.44.0 to 0.45.0 in /operator by @dependabot[bot] in #268
- e2e tests gang scheduling by @gflarity in #242
- prepend a g to the github has in package version to avoid semver issues by @gflarity in #272
- Update Go image in
Dockerfileto1.25.3, and other tools. by @renormalize in #254 - E2E tests for startup ordering by @gflarity in #269
- ci: add e2e tests to GitHub Actions by @shmuel-runai in #282
- E2E: Fix flaky Helm installation failures due to "cannot re-use a nam… by @shmuel-runai in #284
- improve test coverage for internal/controller by @gflarity in #252
- Stablize E2E Tests by @gflarity in #287
- New TAS Design by @Ronkahn21 in #288
- feat: chart add ns by @ls-2018 in #290
- Cleaned and updated indirect go mod deps by @unmarshall in #295
- feat: remove useless branch condition by @ls-2018 in #289
- Feat/create cluster topology and KAI topology by @Ronkahn21 in #298
- Api/add name to topology constraint group by @Ronkahn21 in #299
- add mnnvl requirements GREP file by @danbar2 in #296
- Remove
YEARin generated files, adhering to the community convention. by @renormalize in #307 - Added more code owners for Grove by @sanjaychatterjee in #306
- E2e tests rolling updates by @gflarity in #280
- Added new code owner for Grove by @sanjaychatterjee in #308
- E2E stability fixes by @gflarity in #312
- Reconcile PodClique TopologyConstraints by @unmarshall in #302
- Introduce validations for TopologyConstraints in PodCliqueSet by @unmarshall in #317
- MNNVL support design doc by @shmuel-runai in #297
- cancel stale E2E runs by @gflarity in #322
- Fix get selector labels for pod by @gflarity in #318
- E2E Failure Diagnostics by @gflarity in #314
- Fixes for topology aware scheduling validation webhook by @unmarshall in #324
- Fixes TopologyConstraints for scaled PodGangs by @unmarshall in #340
Full Changelog: https://github.com/ai-dynamo/grove/commits/v0.1.0-alpha.4
v0.1.0-alpha.3
What's Changed
- Gflarity/allow kai by @gflarity in #219
- Allow system:kube-controller-manager to update init container secret by @unmarshall in #221
- fixed the owner name to be lower case nvidia by @unmarshall in #222
- Disable Authorizer webhook by default. by @renormalize in #223
Full Changelog: v0.1.0-alpha.2...v0.1.0-alpha.3
v0.1.0-alpha.2
What's Changed
- Add a attribution file for all the licenses used in Grove by @sanjaychatterjee in #189
- Remove LastOperation from CRDs and restructure component operators by @unmarshall in #192
- Remove scheduler development doc by @sanjaychatterjee in #197
- Remove validation which prevents setting NodeSelector on PodSpec by @unmarshall in #203
- Increase Default Value of TerminationDelay by @nvrohanv in #199
- Add unit-tests for initc and improve in-line doc strings by @gflarity in #204
- remove redundancy in initial grove readme paragraphs by @nvrohanv in #213
- Remove deadlock when deploying PCS with ComputeDomain by @unmarshall in #215
- document in internal/webooks by @gflarity in #210
- Introduce the Authroizer Webhook. by @renormalize in #214
- Rename leftover
*podgangset*\.goto*podcliqueset*\.gofrom #186. by @renormalize in #216
New Contributors
Full Changelog: v0.1.0-alpha.1...v0.1.0-alpha.2
v0.1.0-alpha.1
What's Changed
- Skeleton code and scripts for grove operator by @unmarshall in #5
- Adding the operator config api which got overwritten by @unmarshall in #6
- update license files and headers by @dmitsh in #7
- adapted license header to include The Grove Authors by @unmarshall in #8
- Added Dockerfile, Skaffold, Helm Charts and other misc changes by @unmarshall in #13
- allow setting object meta in PodClique; fix typos in types.go by @dmitsh in #14
- Removed PodGang CRD by @unmarshall in #15
- implement podgangset validating webhook by @dmitsh in #9
- Bump github.com/opencontainers/runc from 1.1.13 to 1.1.14 in /scheduler-plugins by @dependabot[bot] in #3
- Bump golang.org/x/crypto from 0.24.0 to 0.31.0 in /scheduler-plugins by @dependabot[bot] in #16
- Small fixes to hack scripts directory by @unmarshall in #20
- API changes and changes to validating webhook by @unmarshall in #23
- simplify PodCliqueSpec by @dmitsh in #24
- Introduced scheduler-api, modified PodGangSet API, re-generated code by @unmarshall in #25
- Bump golang.org/x/net from 0.26.0 to 0.33.0 in /scheduler-plugins by @dependabot[bot] in #26
- update podgangset crd by @dmitsh in #28
- Add Mutating Webhooks for PodGangSet by @ritikasrivastava in #17
- Configuration and deployment of webhooks by @unmarshall in #29
- Sample NIM LLM deployment specs using LWS and Grove by @sanjaychatterjee in #30
- update API by @dmitsh in #31
- implement validation for update operation by @dmitsh in #27
- Adds skeleton reconciler code and minor modifications to API by @dmitsh in #21
- Fixes for defaulting and validating webhooks by @unmarshall in #32
- fixed typos by @dmitsh in #33
- Add Default webhook unit test by @ritikasrivastava in #35
- Introduces miscellaneous changes by @unmarshall in #38
- Fixed helm charts and added default for podclique reconciler by @unmarshall in #39
- Fixes API, controller-runtime manager scheme and charts by @unmarshall in #42
- Bump k8s.io/kubernetes from 1.31.1 to 1.31.6 in /scheduler-plugins by @dependabot[bot] in #34
- implement basic reconciliation loop by @dmitsh in #40
- Bump golang.org/x/net from 0.33.0 to 0.36.0 in /scheduler-plugins by @dependabot[bot] in #46
- Refactor PodGangSet reconciler by @unmarshall in #51
- fixed roles for events and fixed test by @unmarshall in #52
- implement pclq status update by @dmitsh in #48
- move pclq status update to reconciler by @dmitsh in #53
- Added validation for pclq metadata by @unmarshall in #54
- Added changes to the PodGang API spec by @unmarshall in #55
- Update scheduler API by @unmarshall in #56
- Added TerminationDelay to PodGangTemplateSpec by @unmarshall in #58
- Ritika/headlessservice by @ritikasrivastava in #50
- Bump golang.org/x/net from 0.34.0 to 0.36.0 in /operator by @dependabot[bot] in #47
- Renamed PodClique to PodGroup in scheduler-api by @sanjaychatterjee in #61
- Refactoring operator by @unmarshall in #62
- Added ServiceAccount, Role and RoleBinding components by @unmarshall in #63
- Bump golang.org/x/net from 0.34.0 to 0.36.0 in /scheduler-api by @dependabot[bot] in #59
- Add stub functions for Grove scheduler plugin by @sanjaychatterjee in #60
- Fix broken test, upgrade to
golangci-lint@v2.1.1and fix numerous lint errors, upgrade tool versions, removehack/tools.go, etc. by @renormalize in #64 - Replaced types.NamespacedName with own NamespacedName by @unmarshall in #65
- Corrections to API by @unmarshall in #67
- Bump golang.org/x/net from 0.35.0 to 0.38.0 in /operator by @dependabot[bot] in #66
- Introduced scheduling policy configuration in PodGangTemplateSpec by @unmarshall in #69
- Introduces scheduler policy config and other changes in operator and scheduler API by @unmarshall in #71
- Fix broken build targets,
ld-flagsduring docker builds, work-tree state during docker builds, etc. by @renormalize in #70 - Added test utilities and other misc changes by @unmarshall in #74
- Reorg scheduler-plugins dir by @sanjaychatterjee in #75
- Upgraded k8s dependencies by @unmarshall in #76
- Updated operator and scheduler-api by @unmarshall in #77
- Introduced PodCliqueScalingGroup and refactored API modules by @unmarshall in #78
- PodCliqueScalingGroupConfig enhancement and validations by @unmarshall in #79
- Introduce make targets and charts for the development, and deployment of
grove-kube-scheduler. by @renormalize in #80 - Introduce the
reset-schedulertarget which resets the kube-scheduler running in kind with the default. by @renormalize in #82 - introducing generated scheduler client by @unmarshall in #81
- Corrections in PGS components by @unmarshall in #83
- MinReplicas moves out of AutoScalingConfig to PodCliqueSpec by @unmarshall in #84
- Enhancements to API and reconcilers by @unmarshall in #86
- Allow usage of hyphen in the pgs and pclq names by @unmarshall in #87
- Specify defaults using annotations, use validating functions exposed by
apimachinery, etc. by @renormalize in #89 - Misc fixes and partial implementation of Pod component by @unmarshall in #90
- Add validation to ensure all
PodSpecs specify the sameschedulerName. by @renormalize in #91 - Implement the init container. by @renormalize in #88
- Fix scale-in issues when
PodGangSetreplicas are changed. by @renormalize in #94 - Update Readme by @nvrohanv in #93
- Enhance PGS and PCLQ reconcilers to support PodGang lifecycle management by @unmarshall in #95
- Docs and API updates by @unmarshall in #97
- Updated diagrams and docs by @unmarshall in #99
- Introduce
docs/getting-started.md. by @renormalize in #100 - Refactor PodGangSet and PodGang APIs by @unmarshall in #101
- Add test and coverage targets to Makefile by @Ronkahn21 in #102
- Modify PodCliqueScalingGroup behavior to create new PodCliques for each replica by @unmarshall in #103
- Fix HPA selector labels for
PodCliques created byPodCliqueScalingGroups. by @renormalize in #104 - Enable GitHub Actions by @renormalize in #106
- API changes for Gang termination by @unmarshall in #107
- Pod name validation pgs by @Ronkahn21 in #105
- fix: fix example by @julienmancuso in #108
- Correct example in
docs/getting-started.md. by @renormalize in #110 - Bump k8s.io/kubernetes from 1.33.1 to 1.33.2 in /scheduler by @dependabot[bot] in #92
- Add validation target to Makefile and update build-and-test.yaml by @Ronkahn21 in #111
- Gang Termination by @unmarshall in #114
- Pod discovery env vars by @Ronkahn21 in #119
- update release schedule to reflect dynamo alignment by @nvrohanv in #120
- feat: Add replicas and minAvailable fields for PodCliquesScalingGroups by @julienmancuso in #116
- Integrate `grove-initc...