Skip to content

discovery/aws: Fix AWS SDK v2 credentials handling for EC2 and Lightsail discovery#17355

Merged
krajorama merged 1 commit intoprometheus:release-3.7from
roidelapluie:ec2_discovery
Oct 22, 2025
Merged

discovery/aws: Fix AWS SDK v2 credentials handling for EC2 and Lightsail discovery#17355
krajorama merged 1 commit intoprometheus:release-3.7from
roidelapluie:ec2_discovery

Conversation

@roidelapluie
Copy link
Member

@roidelapluie roidelapluie commented Oct 17, 2025

After the upgrade to AWS SDK v2, the EC2 and Lightsail service discovery stopped working when using the default AWS credential chain (environment variables, IAM roles, EC2 instance metadata, etc.).

The issue was that the code unconditionally created a StaticCredentialsProvider with empty credentials when access_key and secret_key were not configured. In AWS SDK v2, this causes a "static credentials are empty" error and prevents the SDK from falling back to its default credential chain.

Fixes #17343

Which issue(s) does the PR fix:

Does this PR introduce a user-facing change?

[BUGFIX] discovery/aws: Fix AWS SDK v2 credentials handling for EC2 and Lightsail discovery

@roidelapluie roidelapluie changed the title discovery/ec2: Fix AWS SDK v2 credentials handling for EC2 and Lightsail discovery discovery/aws: Fix AWS SDK v2 credentials handling for EC2 and Lightsail discovery Oct 17, 2025
Copy link
Contributor

@sysadmind sysadmind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I wish we had a good environment to test with. Thanks!

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for jumping in!

Should we merge it to release branch first? cc @krajorama

@krajorama krajorama changed the base branch from main to release-3.7 October 17, 2025 13:25
@krajorama krajorama changed the base branch from release-3.7 to main October 17, 2025 13:25
@bwplotka
Copy link
Member

bwplotka commented Oct 17, 2025

Is there a way you could test this out @lyz-code? ah you mentioned delay

…ail discovery

After the upgrade to AWS SDK v2, the EC2 and Lightsail service discovery
stopped working when using the default AWS credential chain (environment
variables, IAM roles, EC2 instance metadata, etc.).

The issue was that the code unconditionally created a StaticCredentialsProvider
with empty credentials when access_key and secret_key were not configured. In
AWS SDK v2, this causes a "static credentials are empty" error and prevents
the SDK from falling back to its default credential chain.

Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
@roidelapluie roidelapluie changed the base branch from main to release-3.7 October 17, 2025 13:53
@krajorama
Copy link
Member

#17343 (comment)

@krajorama
Copy link
Member

krajorama commented Oct 17, 2025

I've built the image from the PR, and get these errors when running in EKS.

time=2025-10-17T17:09:59.041Z level=DEBUG source=ec2.go:275 msg="Unable to describe availability zones" component="discovery manager scrape" discovery=ec2 config=testsd err="operation error EC2: DescribeAvailabilityZones, get identity: get credentials: failed to refresh cached credentials, no EC2 IMDS role found, operation error ec2imds: GetMetadata, canceled, context deadline exceeded"
time=2025-10-17T17:10:04.043Z level=ERROR source=refresh.go:71 msg="Unable to refresh target groups" component="discovery manager scrape" discovery=ec2 config=testsd err="could not describe instances: operation error EC2: DescribeInstances, get identity: get credentials: failed to refresh cached credentials, no EC2 IMDS role found, operation error ec2imds: GetMetadata, canceled, context deadline exceeded"

With 3.6.0 I didn't get errors. I don't have any resources in my little test env so I didn't see targets either.
With 3.7.1 I got this error:

time=2025-10-17T16:20:11.225Z level=ERROR source=refresh.go:71 msg="Unable to refresh target groups" component="discovery manager scrape" discovery=ec2 config=testsd err="could not describe instances: operation error EC2: DescribeInstances, get identity: get credentials: failed to refresh cached credentials, static credentials are empty"

Config map:

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |
    scrape_configs:
      - job_name: testsd
        ec2_sd_configs:
          - region: eu-north-1
            refresh_interval: 30s

@krajorama
Copy link
Member

I think we'll need someone to test this who knows what they are doing, I was just trying to do a dumb smoke test, see #17355 (comment) .

cc @sysadmind

@sysadmind
Copy link
Contributor

LoadDefaultConfig should do the right thing for anything running in AWS. That seems to be all most projects need. The EBS CSI driver only uses that. https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/pkg/cloud/cloud.go#L391

I think the only reason we have to do anything custom is because we have our own config options to be passed in.

@sysadmind
Copy link
Contributor

Ok I was able to set up an AWS account to do some testing. TL;DR; This PR should resolve the issues.

I created an EC2 instance to run docker and set up security groups so that I could access SSH and Prometheus ports. I created the following config file:

scrape_configs:
  - job_name: testsd
    ec2_sd_configs:
      - refresh_interval: 30s
        access_key: aaaaaaa
        secret_key: bbbbbbbb

I ran this against v3.6.0 and everything worked correctly. Then I added an IAM role to the instance and removed the access_key and secret_key from the config file. Prometheus was still able to find the local EC2 instance in SD.

I changed the image to use v3.7.1 and started receiving an error that region was required. I opened #17375 to track this. After adding region to the config, I received the error that the original issue reported

time=2025-10-21T01:06:42.186Z level=ERROR source=refresh.go:90 msg="Unable to refresh target groups" component="discovery manager scrape" discovery=ec2 config=testsd err="could not describe instances: operation error EC2: DescribeInstances, get identity: get credentials: failed to refresh cached credentials, static credentials are empty

I built a new docker image with this PR and uploaded it to my EC2 instance. Running that image, there is no more error and the EC2 instance shows up in SD. I was able to get SD working with and without the access_key and secret_key values. Unfortunately I wasn't able to test the role_arn config option.

Copy link
Member

@krajorama krajorama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @sysadmind for the tests

@krajorama krajorama merged commit 1195563 into prometheus:release-3.7 Oct 22, 2025
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AWS EC2 scraping not working fine on v3.7.0

5 participants