Apache Airflow Provider(s)
amazon
Versions of Apache Airflow Providers
apache-airflow-providers-amazon==9.20.0
Apache Airflow version
main
Operating System
Debian GNU/Linux 12 (bookworm)
Deployment
Other
Deployment details
No response
What happened
When using EksCreateNodegroupOperator, a managed nodegroup may be successfully created even when the AWS execution role has partial EKS permissions, for example lacking eks:DescribeNodegroup.
In this scenario, the operator successfully calls CreateNodegroup and the nodegroup (and backing EC2 instances) is created in AWS. However, subsequent steps—such as waiting for the nodegroup to become active when wait_for_completion=True—fail due to insufficient permissions.
The Airflow task then fails, but the EKS managed nodegroup remains active in AWS, along with its EC2 instances, resulting in leaked infrastructure and ongoing cost.
This can occur, for example, when the execution role allows eks:CreateNodegroup but denies eks:DescribeNodegroup, which is required by the waiter used to monitor nodegroup provisioning.
What you think should happen instead
If the operator fails after successfully creating a nodegroup (for example due to missing DescribeNodegroup or other follow-up permissions), it should make a best-effort attempt to clean up the partially created resource by deleting the nodegroup.
Cleanup should be attempted opportunistically (i.e. only if the nodegroup name is known and the necessary permissions are available), and failure to clean up should not mask or replace the original exception.
How to reproduce
-
Create an IAM role that allows eks:CreateNodegroup but denies eks:DescribeNodegroup
-
Configure an AWS connection in Airflow using this role.
(The connection ID aws_test_conn is used for this reproduction.)
-
Create an EKS cluster.
(The cluster name airflow-partial-auth-eks is used for this reproduction.)
-
Create an IAM role for EKS managed nodegroups.
(The role AmazonEKSNodeRole is used for this reproduction.)
-
Use the following DAG:
from datetime import datetime
from airflow import DAG
from airflow.providers.amazon.aws.operators.eks import EksCreateNodegroupOperator
with DAG(
dag_id="eks_partial_auth_nodegroup_leak_repro",
start_date=datetime(2025, 1, 1),
schedule=None,
catchup=False,
) as dag:
create_nodegroup = EksCreateNodegroupOperator(
task_id="create_nodegroup",
aws_conn_id="aws_test_conn",
cluster_name="airflow-partial-auth-eks",
nodegroup_name="leaky-nodegroup",
nodegroup_subnets=[
"subnet-xxxxxxxxxxxxxxxxx",
"subnet-yyyyyyyyyyyyyyyyy",
],
nodegroup_role_arn="arn:aws:iam::123456789012:role/AmazonEKSNodeRole",
wait_for_completion=True, # triggers DescribeNodegroup via waiter
)
- Trigger the DAG.
Expected Result
The task fails due to missing eks:DescribeNodegroup permissions, but the managed nodegroup is successfully created and remains active in AWS. The backing EC2 instances continue running and are not cleaned up automatically.
Anything else
This is another instance of an AWS operator leaking resources when execution fails after partial success due to insufficient IAM permissions. Similar failure modes have already been identified across other AWS operators where resources are created successfully but not cleaned up if follow-up steps fail.
Apache Airflow is now introducing best-effort cleanup behavior for multiple AWS operators to address this class of issue. In particular, EC2CreateInstanceOperator now attempts cleanup on post-creation failures (PR #60904), and corresponding changes have been proposed for EMRCreateJobFlowOperator (PR #61010) and EcsRunTaskOperator (#61051) .
Given this precedent, applying the same best-effort cleanup pattern to EksCreateNodegroupOperator would improve consistency across AWS providers, reduce leaked infrastructure, and make operator behavior more predictable in environments with tightly scoped IAM roles.
Are you willing to submit PR?
Code of Conduct
Apache Airflow Provider(s)
amazon
Versions of Apache Airflow Providers
apache-airflow-providers-amazon==9.20.0Apache Airflow version
main
Operating System
Debian GNU/Linux 12 (bookworm)
Deployment
Other
Deployment details
No response
What happened
When using
EksCreateNodegroupOperator, a managed nodegroup may be successfully created even when the AWS execution role has partial EKS permissions, for example lackingeks:DescribeNodegroup.In this scenario, the operator successfully calls
CreateNodegroupand the nodegroup (and backing EC2 instances) is created in AWS. However, subsequent steps—such as waiting for the nodegroup to become active whenwait_for_completion=True—fail due to insufficient permissions.The Airflow task then fails, but the EKS managed nodegroup remains active in AWS, along with its EC2 instances, resulting in leaked infrastructure and ongoing cost.
This can occur, for example, when the execution role allows
eks:CreateNodegroupbut denieseks:DescribeNodegroup, which is required by the waiter used to monitor nodegroup provisioning.What you think should happen instead
If the operator fails after successfully creating a nodegroup (for example due to missing
DescribeNodegroupor other follow-up permissions), it should make a best-effort attempt to clean up the partially created resource by deleting the nodegroup.Cleanup should be attempted opportunistically (i.e. only if the nodegroup name is known and the necessary permissions are available), and failure to clean up should not mask or replace the original exception.
How to reproduce
Create an IAM role that allows
eks:CreateNodegroupbut denieseks:DescribeNodegroupConfigure an AWS connection in Airflow using this role.
(The connection ID
aws_test_connis used for this reproduction.)Create an EKS cluster.
(The cluster name
airflow-partial-auth-eksis used for this reproduction.)Create an IAM role for EKS managed nodegroups.
(The role
AmazonEKSNodeRoleis used for this reproduction.)Use the following DAG:
Expected Result
The task fails due to missing
eks:DescribeNodegrouppermissions, but the managed nodegroup is successfully created and remains active in AWS. The backing EC2 instances continue running and are not cleaned up automatically.Anything else
This is another instance of an AWS operator leaking resources when execution fails after partial success due to insufficient IAM permissions. Similar failure modes have already been identified across other AWS operators where resources are created successfully but not cleaned up if follow-up steps fail.
Apache Airflow is now introducing best-effort cleanup behavior for multiple AWS operators to address this class of issue. In particular,
EC2CreateInstanceOperatornow attempts cleanup on post-creation failures (PR #60904), and corresponding changes have been proposed forEMRCreateJobFlowOperator(PR #61010) andEcsRunTaskOperator(#61051) .Given this precedent, applying the same best-effort cleanup pattern to
EksCreateNodegroupOperatorwould improve consistency across AWS providers, reduce leaked infrastructure, and make operator behavior more predictable in environments with tightly scoped IAM roles.Are you willing to submit PR?
Code of Conduct