Skip to content

Azure IPAM: Replace subscription-wide VNet enumeration with targeted subnet queries to eliminate NRP throttling#41555

Merged
julianwiedmann merged 1 commit intocilium:mainfrom
yuecong:azure-ipam-vnet-optimization-main
Jan 29, 2026
Merged

Azure IPAM: Replace subscription-wide VNet enumeration with targeted subnet queries to eliminate NRP throttling#41555
julianwiedmann merged 1 commit intocilium:mainfrom
yuecong:azure-ipam-vnet-optimization-main

Conversation

@yuecong
Copy link
Copy Markdown
Contributor

@yuecong yuecong commented Sep 6, 2025

Summary

This PR implements a targeted subnet discovery optimization for Azure IPAM that eliminates Azure NRP throttling by replacing subscription-wide VNet enumeration with targeted subnet queries.

Key Changes

  • Three-phase subnet discovery strategy:

    1. Query all node instances (existing method)
    2. Extract unique subnet IDs from node network interfaces
    3. Query only the specific subnets that nodes actually use
  • Performance improvements:

    • Reduces API calls from O(n*m) to O(k) where k = unique subnets used by nodes
    • Eliminates subscription-wide VNet enumeration that causes throttling
    • Maintains backward compatibility with fallback to full VNet discovery

Test Results

  • All API tests pass including new parseSubnetID validation tests
  • Azure IPAM functionality preserved with enhanced logging
  • Existing IPAM allocation tests demonstrate continued functionality

Breaking Changes

None - maintains full backward compatibility with existing IPAM behavior.

Performance Impact

Significantly improved performance in large Azure environments with many VNets while maintaining the same IPAM functionality. The optimization shows "targeted_subnets" count in logs for visibility.

Test plan

  • Unit tests for parseSubnetID function with 8 validation scenarios
  • Azure API tests pass
  • Mock implementation updated for GetNodesSubnets
  • IPAM allocation tests demonstrate preserved functionality

I deployed this change to the main branch to our internal clusters and confirmed that this works. Also for the 1.16 version, we have used it in production for more than 2 months in ~100 clusters

Azure IPAM: Optimize subnet discovery to eliminate Azure NRP throttling by using targeted subnet queries instead of subscription-wide VNet enumeration, significantly improving performance in large Azure environments with many VNets.

@yuecong yuecong requested a review from a team as a code owner September 6, 2025 19:52
@maintainer-s-little-helper
Copy link
Copy Markdown

Commit addd599 does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Sep 6, 2025
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Sep 6, 2025
@maintainer-s-little-helper
Copy link
Copy Markdown

Commits addd599, 6ecbdd0 do not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

yuecong added a commit to yuecong/cilium that referenced this pull request Sep 6, 2025
…xcept Azure SDK v2

This change ensures perfect alignment between PR cilium#41555 (main branch) and PR cilium#41554 (v1.16 branch):

- Fixed GetNodesSubnets signature to return only SubnetMap (not VNetMap + SubnetMap)
- Updated extractSubnetIDs to use memory-efficient map[string]struct{} instead of map[string]bool
- Aligned error handling in resyncInstances to continue with empty subnets on GetNodesSubnets failure
- Added missing TestExtractSubnetIDs test for deduplication validation
- Maintained three-phase subnet discovery optimization identical to PR cilium#41554

The only differences between the PRs are now Azure SDK version upgrades (v1 → v2).

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@maintainer-s-little-helper
Copy link
Copy Markdown

Commits addd599, 6ecbdd0, 172f4ce do not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

yuecong added a commit to yuecong/cilium that referenced this pull request Sep 6, 2025
Final alignment between PR cilium#41555 (main branch) and PR cilium#41554 (v1.16 branch):

ALIGNED FEATURES:
- Pre-compiled regex pattern with expectedCaptureGroups constant
- getSubnetWithPagination method for accurate IP configuration counting
- Enhanced parseSubnetID function with proper error handling
- Subscription ID tracking in Client struct
- Proper logrus-based logging in GetNodesSubnets
- Three-phase subnet discovery optimization identical to PR cilium#41554

SDK V2 ADAPTATIONS:
- Uses armnetwork.SubnetsClient instead of network.SubnetsClient
- Simplified pagination logic using SDK v2 built-in capabilities
- Updated ARM response handling for result.Subnet access
- Removed SDK v1-specific Azure API version detection (not applicable)

VERIFICATION:
- TestSubnetDiscovery: ✅ PASS
- TestExtractSubnetIDs: ✅ PASS
- TestParseSubnetID: ✅ PASS
- Azure API mock tests: ✅ PASS

Both PRs now implement identical optimization logic while using appropriate Azure SDK versions.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@maintainer-s-little-helper
Copy link
Copy Markdown

Commits addd599, 6ecbdd0, 172f4ce, 0b4a995 do not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@joestringer
Copy link
Copy Markdown
Member

Thanks for the submission. Apologies for the delay in review, but at this time we cannot accept the PR due to provenance concerns. I think it is in the interests of the project to improve this area of the codebase, but we'll need to figure out the right approach. I'll mark the PR as draft for now.

@joestringer joestringer marked this pull request as draft September 25, 2025 16:45
@tamilmani1989
Copy link
Copy Markdown
Contributor

@yuecong Thanks for your PR. I would like to understand a bit more context on this PR. Please feel free to reach out to me over slack directly, we can discuss a bit more on this. Also fix your commits as its missing signed signature

@yuecong yuecong force-pushed the azure-ipam-vnet-optimization-main branch from 0b4a995 to 7b739d4 Compare October 17, 2025 23:08
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Oct 17, 2025
@yuecong yuecong marked this pull request as ready for review October 17, 2025 23:22
@tamilmani1989
Copy link
Copy Markdown
Contributor

Talked with @yuecong offline. Overall intention lgtm as it tries to avoid expensive listall virtual networks in a subscription and iterate through all subnets. Instead his PR tries to get subnet info for subnet id already retrieved from getInstances api call. Few things to make sure if this is supported construct from azure and azure sdk will populate subnetID in getInstance API in all cases.. cc @hemanthmalla

@yuecong yuecong requested a review from tamilmani1989 October 21, 2025 17:30
@maintainer-s-little-helper
Copy link
Copy Markdown

Commit e703a97 does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Oct 23, 2025
@yuecong yuecong force-pushed the azure-ipam-vnet-optimization-main branch from e703a97 to b35d525 Compare October 23, 2025 22:52
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Oct 23, 2025
@yuecong yuecong force-pushed the azure-ipam-vnet-optimization-main branch from b35d525 to d60623b Compare October 23, 2025 22:56
@yuecong yuecong requested a review from tamilmani1989 October 23, 2025 22:57
@yuecong yuecong force-pushed the azure-ipam-vnet-optimization-main branch from b1258c1 to 6fce46d Compare November 6, 2025 23:40
@yuecong yuecong force-pushed the azure-ipam-vnet-optimization-main branch from 6fce46d to 50f277c Compare November 18, 2025 22:57
@yuecong yuecong force-pushed the azure-ipam-vnet-optimization-main branch from 41fc38a to ca53852 Compare January 24, 2026 01:21
@yuecong
Copy link
Copy Markdown
Contributor Author

yuecong commented Jan 24, 2026

@yuecong It looks like there are a few issues with the linter and go related checks. Could you take a peek? Thank you!

could you help run /test again

@yuecong yuecong force-pushed the azure-ipam-vnet-optimization-main branch from ca53852 to bc253a8 Compare January 24, 2026 18:50
@yuecong yuecong requested a review from a team as a code owner January 24, 2026 18:50
@yuecong yuecong requested a review from pippolo84 January 24, 2026 18:51
@yuecong yuecong force-pushed the azure-ipam-vnet-optimization-main branch from bc253a8 to 5c6ec23 Compare January 25, 2026 17:37
@yuecong yuecong requested a review from a team as a code owner January 25, 2026 17:37
@yuecong yuecong requested a review from rastislavs January 25, 2026 17:37
Copy link
Copy Markdown
Member

@pippolo84 pippolo84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, thanks!

Left a couple of nits and a question inline.

@pippolo84
Copy link
Copy Markdown
Member

/test

@yuecong
Copy link
Copy Markdown
Contributor Author

yuecong commented Jan 26, 2026

/test

The CI failures are unrelated to the Azure IPAM changes in this PR:

  1. cilium-downgrade-netkit-7-concurrent: Kubernetes leader election lease conflict

    • Source: vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:445
    • Known transient issue in K8s leader election
  2. cilium-downgrade-wireguard-3-concurrent: l7-proxy probe timeout

    • Source: pkg/status/status.go:197
    • Timing issue during downgrade test (Envoy slow startup)

Neither error touches Azure IPAM code (pkg/azure/*). could you help re-trigger a test?

@yuecong yuecong force-pushed the azure-ipam-vnet-optimization-main branch from 5c6ec23 to 65a29d7 Compare January 26, 2026 20:48
@yuecong yuecong requested a review from pippolo84 January 26, 2026 20:49
Copy link
Copy Markdown
Member

@pippolo84 pippolo84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@pippolo84
Copy link
Copy Markdown
Member

/test

@rastislavs
Copy link
Copy Markdown
Contributor

@yuecong looks like you need to run go mod tidy && go mod vendor and submit your changes

@yuecong yuecong force-pushed the azure-ipam-vnet-optimization-main branch from 65a29d7 to 2bbdc2e Compare January 27, 2026 17:14
This PR implements a targeted subnet discovery optimization for Azure IPAM that eliminates Azure NRP throttling by replacing subscription-wide VNet enumeration with targeted subnet queries.

Key Changes:

- Three-phase subnet discovery strategy:
  1. Query all node instances (existing method)
  2. Extract unique subnet IDs from node network interfaces
  3. Query only the specific subnets that nodes actually use

- Performance improvements:
  - Reduces API calls from O(n*m) to O(k) where k = unique subnets used by nodes
  - Eliminates subscription-wide VNet enumeration that causes throttling
  - Maintains backward compatibility with fallback to full VNet discovery

Test Results:

- All API tests pass including new parseSubnetID validation tests
- Azure IPAM functionality preserved with enhanced logging
- Existing IPAM allocation tests demonstrate continued functionality

Breaking Changes:

None - maintains full backward compatibility with existing IPAM behavior.

Performance Impact:

Significantly improved performance in large Azure environments with many VNets while maintaining the same IPAM functionality. The optimization shows "targeted_subnets" count in logs for visibility.

Test plan:

- Unit tests for parseSubnetID function with 8 validation scenarios
- Azure API tests pass
- Mock implementation updated for GetNodesSubnets
- IPAM allocation tests demonstrate preserved functionality

I deployed this change to the main branch to our internal clusters and confirmed that this works. Also for the 1.16 version, we have used it in production for more than 2 months in ~100 clusters

Signed-off-by: yuecong <cong@databricks.com>
@yuecong yuecong force-pushed the azure-ipam-vnet-optimization-main branch from 2bbdc2e to 712e904 Compare January 29, 2026 00:14
@rastislavs
Copy link
Copy Markdown
Contributor

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jan 29, 2026
@julianwiedmann
Copy link
Copy Markdown
Member

(I'll prime the PR for merging, but there's still an unresolved comment somewhere. Please make sure that it's addressed)

@yuecong
Copy link
Copy Markdown
Contributor Author

yuecong commented Jan 29, 2026

(I'll prime the PR for merging, but there's still an unresolved comment somewhere. Please make sure that it's addressed)

@tamilmani1989 could you check the 2 comments not resolved if that lgty? if so, can you please resolve those? thanks

@yuecong
Copy link
Copy Markdown
Contributor Author

yuecong commented Jan 29, 2026

The Failed Test

  • Test: K8s Network Policy E2E tests (dual)
  • What it tests: Kubernetes upstream NetworkPolicy conformance tests running on a Kind cluster
    (local Kubernetes, not Azure)
  • Relevance to Azure IPAM: None - these tests don't use Azure at all

can someone help trigger to rerun this test maybe it is just flaky

@julianwiedmann julianwiedmann added this pull request to the merge queue Jan 29, 2026
Merged via the queue into cilium:main with commit 9a46877 Jan 29, 2026
75 of 76 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/azure Impacts Azure based IPAM. kind/community-contribution This was a contribution made by a community member. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants