Skip to content

doc: Add pod-naming and environment-variables docs#355

Merged
nvrohanv merged 12 commits into
ai-dynamo:mainfrom
nvrohanv:nvrohanv/add-naming-and-service-discovery-docs
Feb 1, 2026
Merged

doc: Add pod-naming and environment-variables docs#355
nvrohanv merged 12 commits into
ai-dynamo:mainfrom
nvrohanv:nvrohanv/add-naming-and-service-discovery-docs

Conversation

@nvrohanv

@nvrohanv nvrohanv commented Jan 20, 2026

Copy link
Copy Markdown
Contributor

Add Pod Naming and Environment Variables Documentation

Summary

This PR adds comprehensive documentation for Grove's pod naming conventions and environment variables for pod discovery. The documentation is organized into two new user guide sections:

  1. Pod and Resource Naming Conventions (docs/user-guide/02_pod-and-resource-naming-conventions/) - Explains Grove's hierarchical naming scheme and best practices for naming resources
  2. Environment Variables for Pod Discovery (docs/user-guide/03_environment-variables-for-pod-discovery/) - Documents the environment variables Grove injects for pod discovery and coordination and shows examples of how to use them to construct pod fully qualified domain names (FQDN).

Changes

New Documentation Structure

docs/user-guide/02_pod-and-resource-naming-conventions/

  • 01_overview.md - Prerequisites and guide overview
  • 02_naming-conventions.md - Complete guide covering:
    • Hierarchical pod naming patterns for standalone PodCliques and PodCliqueScalingGroups
    • Kubernetes 63-character name length considerations with scaling headroom calculations
    • Best practices for naming resources (short prefixes like pleader/pworker and dleader/dworker)
    • Why PodClique names must be unique within a PodCliqueSet (reflects Grove's philosophy that each PodClique represents a distinct component with a unique role)
    • Worked example planning names for a complex system
  • 03_hands-on-example.md - Deployable multi-node disaggregated inference example demonstrating the naming hierarchy

docs/user-guide/03_environment-variables-for-pod-discovery/

  • 01_overview.md - Prerequisites and guide overview
  • 02_env_var_reference.md - Reference guide covering:
    • Key distinction between pod names (random suffix) and hostnames (deterministic, DNS-resolvable)
    • How Grove automatically creates headless services for each PodCliqueSet replica
    • Complete reference of all Grove-injected environment variables (GROVE_PCS_NAME, GROVE_PCLQ_NAME, GROVE_PCSG_NAME, etc.)
  • 03_hands-on-examples.md - Two worked examples with deployable YAMLs:
    • Standalone PodClique environment variables
    • PCSG with leader-worker pod discovery
  • 04_common-patterns-and-takeaways.md - Practical patterns for constructing FQDNs, finding leaders, discovering peers, and key takeaways

New Sample Files

  • operator/samples/user-guide/02_pod-and-resource-naming-conventions/multinode-disaggregated-with-frontend.yaml
  • operator/samples/user-guide/03_environment-variables-for-pod-discovery/standalone-env-vars.yaml
  • operator/samples/user-guide/03_environment-variables-for-pod-discovery/pcsg-env-vars.yaml

Reorganized Existing Files

Documentation Directories - Added numeric prefixes for ordering:

  • docs/user-guide/core-concepts/docs/user-guide/01_core-concepts/
  • docs/user-guide/pod-and-resource-naming-conventions/docs/user-guide/02_pod-and-resource-naming-conventions/
  • docs/user-guide/environment-variables-for-pod-discovery/docs/user-guide/03_environment-variables-for-pod-discovery/

Core Concepts Files - Renamed with numeric prefixes:

  • overview.md01_overview.md
  • pcs_and_pclq_intro.md02_pcs_and_pclq_intro.md
  • pcsg_intro.md03_pcsg_intro.md
  • takeaways.md04_takeaways.md

Sample YAML Directories - Reorganized to match documentation structure:

  • operator/samples/user-guide/concept-overview/operator/samples/user-guide/01_core-concepts/
  • operator/samples/user-guide/pod-and-resource-naming-conventions/operator/samples/user-guide/02_pod-and-resource-naming-conventions/
  • operator/samples/user-guide/environment-variables-for-pod-discovery/operator/samples/user-guide/03_environment-variables-for-pod-discovery/

All sample YAMLs updated with:

  • Documentation reference comments pointing to new paths
  • Minimal resource requirements (cpu: 10m, memory: 32Mi) to work on a single real node in KIND clusters

Updated References

  • Updated README.md and docs/quickstart.md with links to new documentation structure
  • Updated all cross-references between documentation files
  • Updated all sample YAML documentation comments

Testing

All commands documented in the guides were tested end-to-end:

  • Created KIND cluster with make kind-up FAKE_NODES=40
  • Deployed Grove with make deploy
  • Successfully ran all kubectl apply, kubectl get pods/pclq/pcsg, kubectl logs, and kubectl exec -- nslookup commands
  • Verified environment variables are correctly injected
  • Verified DNS resolution works for pod discovery
  • Verified all internal documentation links resolve correctly

…s in user guide, update resource requirements in samples to work on one real node

Signed-off-by: Rohan Varma <rohanv@nvidia.com>
@nvrohanv nvrohanv requested a review from athreesh January 20, 2026 04:37
@nvrohanv nvrohanv added the area/documentation Categorizes the Issue/PR as related to documentation label Jan 20, 2026

@shayasoolin shayasoolin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, added one comment

Comment thread docs/user-guide/pod-naming.md Outdated
Comment thread docs/user-guide/environment-variables.md Outdated
Comment thread docs/user-guide/environment-variables.md Outdated
Comment thread docs/user-guide/environment-variables.md Outdated
Comment thread docs/user-guide/environment-variables.md Outdated
Comment thread docs/user-guide/pod-naming.md Outdated
Comment thread docs/user-guide/pod-naming.md Outdated
@gflarity

Copy link
Copy Markdown
Contributor

Just a few minor things, otherwise looks good.

Co-authored-by: Geoff Flarity <geoff.flarity@gmail.com>
Signed-off-by: Rohan Varma <rohanv@nvidia.com>
@athreesh

Copy link
Copy Markdown

Docs are quite long but it's not a bad thing. @nvrohanv would double check if there's any room to consolidate.

also, keep in mind 63-char Kubernetes limit

otherwise LGTM!

athreesh
athreesh previously approved these changes Jan 21, 2026

@renormalize renormalize left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits, but overall the changes look good to me.

Comment thread docs/user-guide/environment-variables.md Outdated
Comment thread docs/user-guide/environment-variables.md Outdated
Comment thread operator/samples/user-guide/naming-and-env-vars/standalone-env-vars.yaml Outdated
Comment thread docs/user-guide/pod-naming.md Outdated
Signed-off-by: Rohan Varma <rohanv@nvidia.com>
…adability

Signed-off-by: Rohan Varma <rohanv@nvidia.com>
Signed-off-by: Rohan Varma <rohanv@nvidia.com>
Signed-off-by: Rohan Varma <rohanv@nvidia.com>
Signed-off-by: Rohan Varma <rohanv@nvidia.com>
Signed-off-by: Rohan Varma <rohanv@nvidia.com>
Comment thread docs/user-guide/02_pod-and-resource-naming-conventions/03_hands-on-example.md Outdated
@gflarity

gflarity commented Jan 26, 2026

Copy link
Copy Markdown
Contributor

For next update:

If I was looking to find the exact names of the various PCS components I'd probably just use labels rather than futz around with constructing a name:

Grove Labels Reference

This document summarizes the Kubernetes labels that Grove sets on pods and resources, enabling programmatic queries via kubectl label selectors.

Source: operator/api/common/labels.go

Grove-Specific Labels (grove.io/)

Label Key Description
grove.io/podclique The name of the PodClique the pod belongs to
grove.io/podcliqueset-replica-index The replica index of the PodCliqueSet (0-based)
grove.io/podcliquescalinggroup The name of the PodCliqueScalingGroup (if applicable)
grove.io/podcliquescalinggroup-replica-index The replica index of the PCSG (0-based)
grove.io/podgang The name of the PodGang for gang scheduling
grove.io/base-podgang Base PodGang name for scaled PodGangs (beyond MinAvailable)
grove.io/pod-template-hash Hash of the PodSpec (used for rolling updates)

Standard Kubernetes Labels (app.kubernetes.io/)

Label Key Value Description
app.kubernetes.io/part-of PodCliqueSet name All resources belonging to a PodCliqueSet share this label
app.kubernetes.io/managed-by grove-operator Identifies resources managed by Grove
app.kubernetes.io/component varies Component type (see below)
app.kubernetes.io/name resource name Name of the specific resource

Component Values (app.kubernetes.io/component)

Value Description
pcs-podclique PodClique owned directly by PodCliqueSet (standalone)
pcsg-podclique PodClique owned by a PodCliqueScalingGroup
pcs-podcliquescalinggroup PodCliqueScalingGroup resource
pcs-headless-service Headless service for a PodCliqueSet replica
podgang PodGang resource
pod-role Role for pods in the PodCliqueSet
pod-role-binding RoleBinding for pods
pod-service-account ServiceAccount for pods
pcs-hpa HorizontalPodAutoscaler for scaling

Usage Examples

Get All Pods in a PodCliqueSet

kubectl get pods -l app.kubernetes.io/part-of=my-inference-app

Get Pods in a Specific PodClique

kubectl get pods -l grove.io/podclique=prefill-0-pleader

Get Pods in a PodCliqueScalingGroup

kubectl get pods -l grove.io/podcliquescalinggroup=prefill

Get Pods in a Specific PCSG Replica

kubectl get pods -l grove.io/podcliquescalinggroup=prefill,grove.io/podcliquescalinggroup-replica-index=0

Get Pods in a Specific PodCliqueSet Replica

kubectl get pods -l app.kubernetes.io/part-of=my-inference-app,grove.io/podcliqueset-replica-index=0

Get All Grove-Managed Resources

kubectl get pods -l app.kubernetes.io/managed-by=grove-operator

Get All Standalone PodCliques (not in a PCSG)

kubectl get pclq -l app.kubernetes.io/component=pcs-podclique

Get All PodCliques in PodCliqueScalingGroups

kubectl get pclq -l app.kubernetes.io/component=pcsg-podclique

Combining Labels with Environment Variables

Grove also injects environment variables into pods for programmatic discovery from within containers. See the Environment Variables documentation for details on:

  • GROVE_PCS_NAME - PodCliqueSet name
  • GROVE_PCS_INDEX - PodCliqueSet replica index
  • GROVE_PCLQ_NAME - PodClique name
  • GROVE_PCSG_NAME - PodCliqueScalingGroup name
  • GROVE_PCSG_INDEX - PodCliqueScalingGroup replica index
  • GROVE_HEADLESS_SERVICE - Headless service for DNS resolution
  • GROVE_PCLQ_POD_INDEX - Pod index within the PodClique

Labels vs Environment Variables:

  • Use labels for external queries (kubectl, monitoring, alerting)
  • Use environment variables for internal pod-to-pod discovery and coordination

gflarity
gflarity previously approved these changes Jan 26, 2026

@gflarity gflarity left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using the labels I mentioned might make it a lot easier for those who need to find/communication with different components. So I threw them in there for consideration. Regardless this is a huge improvement 👍

@nvrohanv

Copy link
Copy Markdown
Contributor Author

I think using the labels I mentioned might make it a lot easier for those who need to find/communication with different components. So I threw them in there for consideration. Regardless this is a huge improvement 👍

Yes this is great, I'll add this in as a separate pr since this one is already quite large.

renormalize
renormalize previously approved these changes Jan 29, 2026
Signed-off-by: Rohan Varma <rohanv@nvidia.com>
@nvrohanv nvrohanv dismissed stale reviews from renormalize and gflarity via 8a8aed2 January 30, 2026 23:34

@sanjaychatterjee sanjaychatterjee left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Looks good. One small update needed.

Comment thread docs/user-guide/02_pod-and-resource-naming-conventions/02_naming-conventions.md Outdated
nvrohanv and others added 2 commits January 30, 2026 18:38
…ng-conventions.md

Co-authored-by: Sanjay Chatterjee <sanjay.chatterjee@gmail.com>
Signed-off-by: Rohan Varma <rohanv@nvidia.com>
@nvrohanv nvrohanv merged commit 6aa0806 into ai-dynamo:main Feb 1, 2026
9 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/documentation Categorizes the Issue/PR as related to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants