EksCreateNodegroupOperator leaks EKS nodegroup on failure with partial IAM permissions

### Apache Airflow Provider(s)

amazon

### Versions of Apache Airflow Providers

`apache-airflow-providers-amazon==9.20.0`

### Apache Airflow version

main

### Operating System

Debian GNU/Linux 12 (bookworm)

### Deployment

Other

### Deployment details

_No response_

### What happened

When using `EksCreateNodegroupOperator`, a managed nodegroup may be successfully created even when the AWS execution role has **partial EKS permissions**, for example lacking `eks:DescribeNodegroup`.

In this scenario, the operator successfully calls `CreateNodegroup` and the nodegroup (and backing EC2 instances) is created in AWS. However, subsequent steps—such as waiting for the nodegroup to become active when `wait_for_completion=True`—fail due to insufficient permissions.

The Airflow task then fails, but the EKS managed nodegroup remains active in AWS, along with its EC2 instances, resulting in leaked infrastructure and ongoing cost.

This can occur, for example, when the execution role allows `eks:CreateNodegroup` but denies `eks:DescribeNodegroup`, which is required by the waiter used to monitor nodegroup provisioning.


### What you think should happen instead

If the operator fails after successfully creating a nodegroup (for example due to missing `DescribeNodegroup` or other follow-up permissions), it should make a best-effort attempt to clean up the partially created resource by deleting the nodegroup.

Cleanup should be attempted opportunistically (i.e. only if the nodegroup name is known and the necessary permissions are available), and failure to clean up should not mask or replace the original exception.


### How to reproduce

1. Create an IAM role that **allows** `eks:CreateNodegroup` but **denies** `eks:DescribeNodegroup`

2. Configure an AWS connection in Airflow using this role.
   (The connection ID `aws_test_conn` is used for this reproduction.)

3. Create an EKS cluster.
   (The cluster name `airflow-partial-auth-eks` is used for this reproduction.)

4. Create an IAM role for EKS managed nodegroups.
   (The role `AmazonEKSNodeRole` is used for this reproduction.)

5. Use the following DAG:

```python
from datetime import datetime

from airflow import DAG
from airflow.providers.amazon.aws.operators.eks import EksCreateNodegroupOperator


with DAG(
    dag_id="eks_partial_auth_nodegroup_leak_repro",
    start_date=datetime(2025, 1, 1),
    schedule=None,
    catchup=False,
) as dag:
    create_nodegroup = EksCreateNodegroupOperator(
        task_id="create_nodegroup",
        aws_conn_id="aws_test_conn",
        cluster_name="airflow-partial-auth-eks",
        nodegroup_name="leaky-nodegroup",
        nodegroup_subnets=[
            "subnet-xxxxxxxxxxxxxxxxx",
            "subnet-yyyyyyyyyyyyyyyyy",
        ],
        nodegroup_role_arn="arn:aws:iam::123456789012:role/AmazonEKSNodeRole",
        wait_for_completion=True,  # triggers DescribeNodegroup via waiter
    )
```
6. Trigger the DAG.

**Expected Result**

The task fails due to missing `eks:DescribeNodegroup` permissions, but the managed nodegroup is successfully created and remains active in AWS. The backing EC2 instances continue running and are not cleaned up automatically.

### Anything else

This is another instance of an AWS operator leaking resources when execution fails after partial success due to insufficient IAM permissions. Similar failure modes have already been identified across other AWS operators where resources are created successfully but not cleaned up if follow-up steps fail.

Apache Airflow is now introducing best-effort cleanup behavior for multiple AWS operators to address this class of issue. In particular, `EC2CreateInstanceOperator` now attempts cleanup on post-creation failures (PR #60904), and corresponding changes have been proposed for `EMRCreateJobFlowOperator` (PR #61010) and `EcsRunTaskOperator` (#61051) . 

Given this precedent, applying the same best-effort cleanup pattern to `EksCreateNodegroupOperator` would improve consistency across AWS providers, reduce leaked infrastructure, and make operator behavior more predictable in environments with tightly scoped IAM roles.

### Are you willing to submit PR?

- [x] Yes I am willing to submit a PR!

### Code of Conduct

- [x] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EksCreateNodegroupOperator leaks EKS nodegroup on failure with partial IAM permissions #61142

Apache Airflow Provider(s)

Versions of Apache Airflow Providers

Apache Airflow version

Operating System

Deployment

Deployment details

What happened

What you think should happen instead

How to reproduce

Anything else

Are you willing to submit PR?

Code of Conduct

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

EksCreateNodegroupOperator leaks EKS nodegroup on failure with partial IAM permissions #61142

Description

Apache Airflow Provider(s)

Versions of Apache Airflow Providers

Apache Airflow version

Operating System

Deployment

Deployment details

What happened

What you think should happen instead

How to reproduce

Anything else

Are you willing to submit PR?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions