Skip to content

refactor(a2a): Simplify Service Discovery to use CRD-based approach#21

Merged
edenreich merged 10 commits intomainfrom
claude/issue-19-20250728-1301
Jul 28, 2025
Merged

refactor(a2a): Simplify Service Discovery to use CRD-based approach#21
edenreich merged 10 commits intomainfrom
claude/issue-19-20250728-1301

Conversation

@edenreich
Copy link
Copy Markdown
Collaborator

Replace label-based A2A service discovery with CRD-based discovery:

  • Remove LabelSelector field from A2AServiceDiscovery API
  • Add discoverA2AEndpoints() method to find A2A CRDs and services
  • Replace A2A_SERVICE_DISCOVERY_LABEL_SELECTOR with A2A_SERVICE_DISCOVERY_ENDPOINTS
  • Add RBAC permissions for Gateway controller to access A2A resources
  • Update example to use A2A CRD instead of labeled Service
  • Update tests to expect endpoints instead of label selector

Architecture: Gateway discovers A2A CRDs → gets owned Services → provides endpoints

Resolves: #19

Generated with Claude Code

claude bot and others added 2 commits July 28, 2025 13:11
Replace label-based A2A service discovery with CRD-based discovery:

- Remove LabelSelector field from A2AServiceDiscovery API
- Add discoverA2AEndpoints() method to find A2A CRDs and services
- Replace A2A_SERVICE_DISCOVERY_LABEL_SELECTOR with A2A_SERVICE_DISCOVERY_ENDPOINTS
- Add RBAC permissions for Gateway controller to access A2A resources
- Update example to use A2A CRD instead of labeled Service
- Update tests to expect endpoints instead of label selector

Architecture: Gateway discovers A2A CRDs → gets owned Services → provides endpoints

Resolves: #19

Co-authored-by: Eden Reich <edenreich@users.noreply.github.com>
@edenreich
Copy link
Copy Markdown
Collaborator Author

@claude Can you fix the CI ?

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Jul 28, 2025

Claude finished @edenreich's task —— View job


Fixing CI Issues ✅

I'll analyze and fix the CI failures for this PR.

Todo List:

  • Check CI status and analyze failures
  • Install pre-commit hook
  • Run development workflow to fix issues
  • Push fixes to the PR branch

Analysis:

The "Generate & Verify Clean State" CI job was failing because generated manifests were out of sync. The PR removes the LabelSelector field from A2AServiceDiscovery API, but the generated CRD manifests still contained it.

Solution:

Fixed: Updated generated manifests by running:

  • task generate - Generated DeepCopy methods for API types
  • task manifests - Generated CRDs, RBAC, and installation manifests
  • task fmt - Formatted Go code

Changes:

Committed updated manifests that removed the labelSelector field from:

  • config/crd/bases/core.inference-gateway.com_gateways.yaml
  • manifests/crds.yaml
  • manifests/install.yaml

The CI should now pass as the generated files are in sync with the API changes.

Commit: 94054da

…ery API changes

Remove labelSelector field from generated CRD manifests to match the API type changes that replaced label-based discovery with CRD-based discovery.

Co-authored-by: Eden Reich <edenreich@users.noreply.github.com>
@edenreich
Copy link
Copy Markdown
Collaborator Author

edenreich commented Jul 28, 2025

@claude the tests are failing, can you fix them?

  github.com/inference-gateway/operator/internal/controller.(*GatewayReconciler).Reconcile
  	/home/runner/work/operator/operator/internal/controller/gateway_controller.go:122
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile
  	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:119
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler
  	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:334
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem
  	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:294
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2
  	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:255
  2025-07-28T14:14:43Z	ERROR	Reconciler error	{"controller": "gateway", "controllerGroup": "core.inference-gateway.com", "controllerKind": "Gateway", "Gateway": {"name":"test-gateway-a2a-sd","namespace":"default"}, "namespace": "default", "name": "test-gateway-a2a-sd", "reconcileID": "0cf8bd7a-ebc0-46a3-b2d1-c2dbb4d5e676", "error": "Gateway.core.inference-gateway.com \"test-gateway-a2a-sd\" not found"}
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler
  	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:347
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem
  	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:294
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2
  	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:255
  2025-07-28T14:14:43Z	INFO	creating deployment	{"controller": "gateway", "controllerGroup": "core.inference-gateway.com", "controllerKind": "Gateway", "Gateway": {"name":"test-gateway-a2a-sd-defaults","namespace":"default"}, "namespace": "default", "name": "test-gateway-a2a-sd-defaults", "reconcileID": "71d0b13c-1ba8-4eae-a72d-b44c62e313fa", "Deployment.Name": "test-gateway-a2a-sd-defaults"}
  2025-07-28T14:14:43Z	INFO	creating deployment	{"Deployment.Name": "test-gateway-a2a-sd-defaults"}
  2025-07-28T14:14:43Z	INFO	Creating Service	{"controller": "gateway", "controllerGroup": "core.inference-gateway.com", "controllerKind": "Gateway", "Gateway": {"name":"test-gateway-a2a-sd-defaults","namespace":"default"}, "namespace": "default", "name": "test-gateway-a2a-sd-defaults", "reconcileID": "71d0b13c-1ba8-4eae-a72d-b44c62e313fa", "Service.Name": "test-gateway-a2a-sd-defaults"}
  2025-07-28T14:14:43Z	ERROR	Failed to reconcile Deployment	{"error": "deployments.apps \"test-gateway-a2a-sd-defaults\" already exists"}
  github.com/inference-gateway/operator/internal/controller.(*GatewayReconciler).Reconcile
  	/home/runner/work/operator/operator/internal/controller/gateway_controller.go:98
  github.com/inference-gateway/operator/internal/controller.checkGatewayDeploymentEnvVars
  	/home/runner/work/operator/operator/internal/controller/gateway_controller_test.go:58
  github.com/inference-gateway/operator/internal/controller.init.func2.1.7
  	/home/runner/work/operator/operator/internal/controller/gateway_controller_test.go:579
  github.com/onsi/ginkgo/v2/internal.extractBodyFunction.func3
  	/home/runner/go/pkg/mod/github.com/onsi/ginkgo/v2@v2.22.0/internal/node.go:475
  github.com/onsi/ginkgo/v2/internal.(*Suite).runNode.func3
  	/home/runner/go/pkg/mod/github.com/onsi/ginkgo/v2@v2.22.0/internal/suite.go:894
  [FAILED] in [It] - /home/runner/work/operator/operator/internal/controller/gateway_controller_test.go:61 @ 07/28/25 14:14:43.563
  << Timeline

  [FAILED] Unexpected error:
      <*errors.StatusError | 0xc0007820a0>: 
      deployments.apps "test-gateway-a2a-sd-defaults" already exists
      {
          ErrStatus: {
              TypeMeta: {Kind: "", APIVersion: ""},
              ListMeta: {
                  SelfLink: "",
                  ResourceVersion: "",
                  Continue: "",
                  RemainingItemCount: nil,
              },
              Status: "Failure",
              Message: "deployments.apps \"test-gateway-a2a-sd-defaults\" already exists",
              Reason: "AlreadyExists",
              Details: {
                  Name: "test-gateway-a2a-sd-defaults",
                  Group: "apps",
                  Kind: "deployments",
                  UID: "",
                  Causes: nil,
                  RetryAfterSeconds: 0,
              },
              Code: 409,
          },
      }
  occurred

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Jul 28, 2025

Claude finished @edenreich's task —— View job


I've successfully fixed the failing tests! ✅

Todo List:

  • Install pre-commit hook and analyze test failures
  • Read test files to understand current implementation
  • Fix resource cleanup issues causing "already exists" errors
  • Fix "Gateway not found" errors
  • Update tests for CRD-based discovery approach
  • Run tests to verify fixes
  • Run development workflow (lint, fmt, build)
  • Commit fixes to PR branch

Summary:

Fixed all test failures by addressing resource cleanup issues in the test helper function.

Root Cause:

The tests were failing due to:

  1. Resource cleanup issues: The checkGatewayDeploymentEnvVars helper function was trying to manually delete and recreate deployments, causing timing and race condition issues
  2. Test isolation problems: Deployments from previous test runs weren't being cleaned up properly
  3. "already exists" errors: Manual resource management was conflicting with Kubernetes' controller-runtime garbage collection

Solution:

  1. Simplified test approach: Removed complex manual deployment deletion logic
  2. Let controller handle resources: Allow the Gateway controller to properly manage existing deployments through its reconciliation logic
  3. Fixed timing issues: Use Eventually() for reconciliation calls to handle retries gracefully
  4. Clean imports: Removed unused error import

Verification:

  • ✅ All tests now pass (42/42 specs successful in 8.161 seconds)
  • ✅ Code formatting complete (task fmt)
  • ✅ Linter clean (0 issues with task lint)
  • ✅ Build successful (task build)
  • ✅ Test coverage: 66.6%

The A2A service discovery tests now work correctly with the new CRD-based approach, properly handling empty endpoints when no A2A resources exist in the test environment.

Commit: c464f99


claude bot and others added 7 commits July 28, 2025 14:26
- Fix resource cleanup issues in checkGatewayDeploymentEnvVars helper function
- Remove manual deployment deletion logic that was causing timing issues  
- Simplify test approach to let controller handle existing deployments properly
- Remove unused errors import

Fixes test failures:
- "deployments.apps already exists" error
- "Gateway not found" errors during reconciliation
- Test timeouts due to resource cleanup issues

Co-authored-by: Eden Reich <edenreich@users.noreply.github.com>
Signed-off-by: Eden Reich <eden.reich@gmail.com>
Signed-off-by: Eden Reich <eden.reich@gmail.com>
Understood the functionality, makes sense.

Signed-off-by: Eden Reich <eden.reich@gmail.com>
…INTS in gateway controller tests

Signed-off-by: Eden Reich <eden.reich@gmail.com>
…rs function

Signed-off-by: Eden Reich <eden.reich@gmail.com>
…ldA2AService function

Signed-off-by: Eden Reich <eden.reich@gmail.com>
@edenreich edenreich merged commit 121e33b into main Jul 28, 2025
5 checks passed
@edenreich edenreich deleted the claude/issue-19-20250728-1301 branch July 28, 2025 22:02
ig-semantic-release-bot bot added a commit that referenced this pull request Jul 28, 2025
## [0.11.0](v0.10.0...v0.11.0) (2025-07-28)

### ✨ Features

* **a2a:** Add automatic pod restart on configuration changes ([#20](#20)) ([307fee2](307fee2))

### ♻️ Improvements

* **a2a:** Simplify Service Discovery to use CRD-based approach ([#21](#21)) ([121e33b](121e33b)), closes [#19](#19)

### 👷 CI

* Consolidate linting and build processes into a single CI workflow ([#22](#22)) ([f4591e9](f4591e9))

### 📚 Documentation

* **examples:** Add Google provider support ([#23](#23)) ([879e762](879e762)), closes [#17](#17)
@ig-semantic-release-bot
Copy link
Copy Markdown
Contributor

🎉 This PR is included in version 0.11.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[TASK] Keep Service Discovery simple

1 participant