Skip to content

test: add TAS e2e test infrastructure and basic tests#348

Merged
Ronkahn21 merged 5 commits into
ai-dynamo:mainfrom
Ronkahn21:test/tas-e2e-infra
Jan 20, 2026
Merged

test: add TAS e2e test infrastructure and basic tests#348
Ronkahn21 merged 5 commits into
ai-dynamo:mainfrom
Ronkahn21:test/tas-e2e-infra

Conversation

@Ronkahn21

@Ronkahn21 Ronkahn21 commented Jan 18, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR establishes the foundational infrastructure for Topology Aware Scheduling (TAS) e2e tests and includes two basic test scenarios.

Infrastructure:

  • Adds 4-level topology hierarchy setup (zone → block → rack → host)
  • Implements topology label application on k3d cluster nodes
  • Adds KAI Topology verification utilities
  • Adds ClusterTopology verification helpers
  • Updates dependencies to KAI Scheduler v0.13.0-rc1
  • Adds topology-test skaffold profile with TAS configuration
  • Adds Makefile target for selective test execution (TEST_PATTERN support)

Tests:

  • TI1 (Topology Infrastructure): Verifies ClusterTopology and KAI Topology CRs are created correctly with proper 4-level hierarchy and node labels
  • BP1 (Basic Pattern): Tests multiple cliques with different topology constraints (rack-level and block-level packing)

This PR is part 1 of 4 in the TAS e2e test suite. Additional test scenarios will be added in follow-up PRs.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Dependencies:

Test Verification:

  • All infrastructure files compile successfully
  • Linter passes with 0 issues
  • Test binary builds successfully with -tags e2e
  • Two foundational tests included: TI1 (infrastructure) and BP1 (basic pattern)

What's Next:

File Summary:

  • Infrastructure: 7 modified files (Makefile, skaffold, dependencies, setup files, workflow)
  • Utilities: 3 new files (conversions, topology verification, KAI topology utilities)
  • Tests: 1 new file with 2 tests + 1 helper function
  • YAMLs: 1 test scenario file

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:

NONE

Ronkahn21 added a commit to Ronkahn21/grove that referenced this pull request Jan 18, 2026
Add 5 tests for simple topology constraint scenarios:
- SL1: PCS-only constraint (inherited by children)
- SL2: PCSG-only constraint
- SL3: No topology constraints (baseline)
- PC1: Host-level constraint (strictest packing)
- ZL1: Zone-level constraint

These tests verify constraint behavior at different
resource levels (PCS, PCSG, PCLQ) and topology domains
(zone, rack, host, none).

Builds on PR ai-dynamo#348 (infrastructure).

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Comment thread operator/e2e/setup/k8s_clusters.go Outdated
Comment thread operator/e2e/tests/setup.go
Comment thread operator/e2e/tests/topology_test.go Outdated
Comment thread operator/e2e/tests/setup.go Outdated
Comment thread operator/e2e/tests/topology_test.go Outdated
Comment thread operator/e2e/tests/topology_test.go Outdated
Ronkahn21 added a commit to Ronkahn21/grove that referenced this pull request Jan 19, 2026
- Add topology node configuration constants
- Restore cleanup failure marking
- Refactor label verification to use loop and label selector
- Remove redundant conversion wrapper
- Rename BP1 to TAS1 following convention
- Increase node count to 28 to strengthen test

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Comment thread operator/e2e/dependencies.yaml
Comment thread operator/e2e/setup/skaffold.go
Comment thread operator/e2e/tests/topology_test.go
Comment thread operator/e2e/setup/topology.go Outdated
Comment thread operator/e2e/utils/topology.go Outdated
@gflarity

Copy link
Copy Markdown
Contributor

Just a few comments, the duplicate probably the most important to fix. Just leaving comments to avoid blocking.

Ronkahn21 added a commit to Ronkahn21/grove that referenced this pull request Jan 20, 2026
- Remove duplicate WaitForPodsReady function from topology.go
- Update topology_test.go to use canonical WaitForPods
- Add debug logging to filterEnv in skaffold.go
- Extract GetWorkerNodeLabelSelector helper function
- Remove unused time import from topology.go

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Comment thread operator/e2e/setup/topology.go Outdated
Comment thread operator/e2e/tests/topology_test.go Outdated
Comment thread operator/e2e/tests/topology_test.go Outdated
Comment thread operator/e2e/tests/topology_test.go Outdated
gflarity
gflarity previously approved these changes Jan 20, 2026
shayasoolin
shayasoolin previously approved these changes Jan 20, 2026
- Add 4-level topology hierarchy setup (zone/block/rack/host)
- Add KAI Topology verification utilities
- Add topology constraint verification helpers
- Include 2 foundational tests:
  * TI1: Topology infrastructure verification
  * BP1: Multiple cliques with different constraints
- Update dependencies to KAI Scheduler v0.13.0-rc1
- Add Makefile target for selective test execution
- Add topology-test skaffold profile

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- Add topology node configuration constants
- Restore cleanup failure marking
- Refactor label verification to use loop and label selector
- Remove redundant conversion wrapper
- Rename BP1 to TAS1 following convention
- Increase node count to 28 to strengthen test

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- Move topology constants and functions to dedicated topology.go
- Add GetZoneForNodeIndex() to complete helper function set
- Replace hard-coded topology label strings with constants
- Use label selector constants for worker node filtering

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- Remove duplicate WaitForPodsReady function from topology.go
- Update topology_test.go to use canonical WaitForPods
- Add debug logging to filterEnv in skaffold.go
- Extract GetWorkerNodeLabelSelector helper function
- Remove unused time import from topology.go

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- Change zone/block/rack indices from 1-based to 0-based
- Remove unused scenario names (TI-1, TAS-2) from test comments
- Update log messages to use correct test names (BP-1 → TAS2)
- Update documentation to reflect 0-based indexing

This ensures zone-0, block-0, rack-0 labels with no 1-based indexing.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
@Ronkahn21 Ronkahn21 dismissed stale reviews from shayasoolin and gflarity via 1b43b26 January 20, 2026 16:11
@Ronkahn21 Ronkahn21 merged commit 6b23c22 into ai-dynamo:main Jan 20, 2026
7 checks passed
danbar2 pushed a commit to danbar2/grove that referenced this pull request Jan 21, 2026
* test: add TAS e2e test infrastructure and basic tests

- Add 4-level topology hierarchy setup (zone/block/rack/host)
- Add KAI Topology verification utilities
- Add topology constraint verification helpers
- Include 2 foundational tests:
  *  Topology infrastructure verification
  *   Multiple cliques with different constraints
- Update dependencies to KAI Scheduler v0.13.0-rc1
- Add Makefile target for selective test execution
- Add topology-test skaffold profile

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Ronkahn21 added a commit to Ronkahn21/grove that referenced this pull request Jan 21, 2026
Add 5 tests for simple topology constraint scenarios:
- SL1: PCS-only constraint (inherited by children)
- SL2: PCSG-only constraint
- SL3: No topology constraints (baseline)
- PC1: Host-level constraint (strictest packing)
- ZL1: Zone-level constraint

These tests verify constraint behavior at different
resource levels (PCS, PCSG, PCLQ) and topology domains
(zone, rack, host, none).

Builds on PR ai-dynamo#348 (infrastructure).

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Ronkahn21 added a commit to Ronkahn21/grove that referenced this pull request Jan 24, 2026
Add 5 tests for simple topology constraint scenarios:
- SL1: PCS-only constraint (inherited by children)
- SL2: PCSG-only constraint
- SL3: No topology constraints (baseline)
- PC1: Host-level constraint (strictest packing)
- ZL1: Zone-level constraint

These tests verify constraint behavior at different
resource levels (PCS, PCSG, PCLQ) and topology domains
(zone, rack, host, none).

Builds on PR ai-dynamo#348 (infrastructure).

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants