Skip to content

pkg/azure/api: Add cleanup in case of provision fail in Azure public IP provisionning#43598

Merged
aanm merged 2 commits intocilium:mainfrom
DataDog:alex.melhem/azure-public-ip-provision-fail
Jan 14, 2026
Merged

pkg/azure/api: Add cleanup in case of provision fail in Azure public IP provisionning#43598
aanm merged 2 commits intocilium:mainfrom
DataDog:alex.melhem/azure-public-ip-provision-fail

Conversation

@41ks
Copy link
Copy Markdown
Contributor

@41ks 41ks commented Jan 7, 2026

Observation

When using public IP assignment with Azure VMSS instances, we have observed a bug that causes instances to fail during provisioning due to exhausted public IP prefixes.

When a public IP prefix is assigned to a node, the prefix is first selected and then the VM update is performed. If multiple nodes are created and a prefix has, for example, only one available IP address, all nodes may see this address as available during the check phase and attempt to acquire it simultaneously.

This scenario should be handled by Azure: only one configuration should be accepted, while the others should be rejected and retried with a different prefix. However, in certain cases, Azure’s API behaves unexpectedly, and we observe the following error message:

--------------------------------------------------------------------------------
RESPONSE 200: 200 OK
ERROR CODE: PublicIpPrefixOutOfIpAddressesForVMScaleSet
--------------------------------------------------------------------------------
{
  "startTime": "...",
  "endTime": "...",
  "status": "Failed",
  "error": {
    "code": "PublicIpPrefixOutOfIpAddressesForVMScaleSet",
    "message": "IpPrefix /subscriptions/.../resourceGroups/.../providers/Microsoft.Network/publicIPPrefixes/... can provide 2 public ips at maximum, 1 of them are in use already. Current available public ip Count for this prefix is 1, which is smaller than required number of public ip count 2 for VMScaleSet on this prefix.",
    "target": "..."
  },
  "name": "..."
}
--------------------------------------------------------------------------------

The error code 200 suggests that the update was successfully carried out, which is consistent with what we observe on the VM instance. The IP configuration is registered in Azure, but the VM remains in a ProvisionFailed state.

"publicIPAddressConfiguration": {
  "name": "cilium-managed-public-ip",
  "properties": {
    "idleTimeoutInMinutes": 4,
    "ipTags": [],
    "publicIPAddressVersion": "IPv4",
    "publicIPPrefix": {
      "id": "/subscriptions/.../resourceGroups/.../providers/Microsoft.Network/publicIPPrefixes/..."
  }
}
"statuses": [
  {
    "code": "ProvisioningState/failed/PublicIpPrefixOutOfIpAddressesForVMScaleSet",
    "displayStatus": "Provisioning failed",
    "level": "Error",
    "message": "IpPrefix /subscriptions/.../resourceGroups/.../providers/Microsoft.Network/publicIPPrefixes/... can provide 2 public ips at maximum, 2 of them are in use already. Current available public ip Count for this prefix is 0, which is smaller than required number of public ip count 1 for VMScaleSet on this prefix.",
    "time": "2026-01-05T16:33:52.8945842Z"
  },
  ...
]

Since the current logic for assigning public IPs considers only the configuration state and not the provisioning status, Cilium assumes that the VM already has a public IP assigned and therefore does not attempt to assign a new one. As a result, the VM never reconciles successfully and never actually receives a public IP.

How does this fix the problem

To fix the problem, we updated the check logic to be aware of the provisioning status. It now detects when provisioning has failed due to ProvisioningState/failed/PublicIpPrefixOutOfIpAddressesForVMScaleSet. In this case, it deletes the erroneous configuration and then proceeds to reassign a new public IP prefix.

We tested this new version of the VMSS public IP assignment on our clusters and it successfully reconciled broken VMs.

pkg/azure/api : Fixed an issue where public IP assignment would permanently fail on Azure VMSS VMs.

This addresses an issue with Azure where it will accept a public IP configuration even if a prefix is exhausted. This triggers a provisioning failure and Azure fails to reconcile the machine state. This change checks for such cases of provisioning failure and addresses them by deleting the public IP configuration and retrying.

Signed-off-by: Alex Melhem <alex.melhem@datadoghq.com>
@41ks 41ks requested a review from a team as a code owner January 7, 2026 11:16
@41ks 41ks requested a review from tamilmani1989 January 7, 2026 11:16
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 7, 2026
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Jan 7, 2026
Copy link
Copy Markdown
Contributor

@antonipp antonipp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good find! Generally looks good, just one small comment to make sure we address all failure scenarios

This addresses an issue with Azure where it will accept a public IP configuration even if a prefix is exhausted. This triggers a provisioning failure and Azure fails to reconcile the machine state. This change checks for such cases of provisioning failure and addresses them by deleting the public IP configuration and retrying.

Signed-off-by: Alex Melhem <alex.melhem@datadoghq.com>
@41ks 41ks force-pushed the alex.melhem/azure-public-ip-provision-fail branch from 58a45f6 to 23a3866 Compare January 7, 2026 13:04
@antonipp
Copy link
Copy Markdown
Contributor

antonipp commented Jan 7, 2026

/test

Copy link
Copy Markdown
Contributor

@antonipp antonipp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@joestringer joestringer added release-note/bug This PR fixes an issue in a previous release of Cilium. needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch labels Jan 8, 2026
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jan 8, 2026
@joestringer
Copy link
Copy Markdown
Member

Heads up for @Azure reviewers (@antonipp in this case), you should set the release-note/xxx label and optionally backport labels as part of your review, per the review guide. For backport nomination, see the backport criteria. If in doubt, feel free to raise a discussion point in #development on Slack or in the community meeting.

@antonipp
Copy link
Copy Markdown
Contributor

antonipp commented Jan 9, 2026

Ah, I missed this, thanks for fixing this!

@aanm aanm added this pull request to the merge queue Jan 14, 2026
@aanm aanm added the needs-backport/1.19 This PR / issue needs backporting to the v1.19 branch label Jan 14, 2026
Merged via the queue into cilium:main with commit eb8de95 Jan 14, 2026
75 of 76 checks passed
@gandro
Copy link
Copy Markdown
Member

gandro commented Jan 15, 2026

Backporter reporting in:

AssignPublicIPAddressesVMSS doesn't seem to exist on the v1.18 branch. It seems to have been added in #42219

I'm thereby removing the needs-backport/1.18 label. @antonipp if you feel that is a mistake, please say so.

@gandro gandro removed the needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch label Jan 15, 2026
@gandro gandro mentioned this pull request Jan 15, 2026
4 tasks
@gandro
Copy link
Copy Markdown
Member

gandro commented Jan 15, 2026

@aanm It also looks like this was already applied to v1.19. Removing the v1.19 label as well.

@gandro gandro removed the needs-backport/1.19 This PR / issue needs backporting to the v1.19 branch label Jan 15, 2026
@gandro gandro mentioned this pull request Jan 15, 2026
3 tasks
@41ks 41ks deleted the alex.melhem/azure-public-ip-provision-fail branch January 26, 2026 10:48
@cilium-release-bot cilium-release-bot bot moved this to Released in cilium v1.19.0 Feb 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/community-contribution This was a contribution made by a community member. release-note/bug This PR fixes an issue in a previous release of Cilium.

Projects

No open projects
Status: Released

Development

Successfully merging this pull request may close these issues.

5 participants