discovery/aws: Fix AWS SDK v2 credentials handling for EC2 and Lightsail discovery by roidelapluie · Pull Request #17355 · prometheus/prometheus

roidelapluie · 2025-10-17T12:19:26Z

After the upgrade to AWS SDK v2, the EC2 and Lightsail service discovery stopped working when using the default AWS credential chain (environment variables, IAM roles, EC2 instance metadata, etc.).

The issue was that the code unconditionally created a StaticCredentialsProvider with empty credentials when access_key and secret_key were not configured. In AWS SDK v2, this causes a "static credentials are empty" error and prevents the SDK from falling back to its default credential chain.

Fixes #17343

Which issue(s) does the PR fix:

Does this PR introduce a user-facing change?

[BUGFIX] discovery/aws: Fix AWS SDK v2 credentials handling for EC2 and Lightsail discovery

sysadmind

LGTM. I wish we had a good environment to test with. Thanks!

bwplotka

Thanks for jumping in!

Should we merge it to release branch first? cc @krajorama

bwplotka · 2025-10-17T13:28:46Z

~~Is there a way you could test this out @lyz-code?~~ ah you mentioned delay

…ail discovery After the upgrade to AWS SDK v2, the EC2 and Lightsail service discovery stopped working when using the default AWS credential chain (environment variables, IAM roles, EC2 instance metadata, etc.). The issue was that the code unconditionally created a StaticCredentialsProvider with empty credentials when access_key and secret_key were not configured. In AWS SDK v2, this causes a "static credentials are empty" error and prevents the SDK from falling back to its default credential chain. Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>

krajorama · 2025-10-17T16:22:39Z

#17343 (comment)

krajorama · 2025-10-17T17:12:50Z

I've built the image from the PR, and get these errors when running in EKS.

time=2025-10-17T17:09:59.041Z level=DEBUG source=ec2.go:275 msg="Unable to describe availability zones" component="discovery manager scrape" discovery=ec2 config=testsd err="operation error EC2: DescribeAvailabilityZones, get identity: get credentials: failed to refresh cached credentials, no EC2 IMDS role found, operation error ec2imds: GetMetadata, canceled, context deadline exceeded"
time=2025-10-17T17:10:04.043Z level=ERROR source=refresh.go:71 msg="Unable to refresh target groups" component="discovery manager scrape" discovery=ec2 config=testsd err="could not describe instances: operation error EC2: DescribeInstances, get identity: get credentials: failed to refresh cached credentials, no EC2 IMDS role found, operation error ec2imds: GetMetadata, canceled, context deadline exceeded"

With 3.6.0 I didn't get errors. I don't have any resources in my little test env so I didn't see targets either.
With 3.7.1 I got this error:

time=2025-10-17T16:20:11.225Z level=ERROR source=refresh.go:71 msg="Unable to refresh target groups" component="discovery manager scrape" discovery=ec2 config=testsd err="could not describe instances: operation error EC2: DescribeInstances, get identity: get credentials: failed to refresh cached credentials, static credentials are empty"

Config map:

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |
    scrape_configs:
      - job_name: testsd
        ec2_sd_configs:
          - region: eu-north-1
            refresh_interval: 30s

krajorama · 2025-10-17T17:17:45Z

I think we'll need someone to test this who knows what they are doing, I was just trying to do a dumb smoke test, see #17355 (comment) .

cc @sysadmind

sysadmind · 2025-10-17T21:15:02Z

LoadDefaultConfig should do the right thing for anything running in AWS. That seems to be all most projects need. The EBS CSI driver only uses that. https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/pkg/cloud/cloud.go#L391

I think the only reason we have to do anything custom is because we have our own config options to be passed in.

sysadmind · 2025-10-21T02:39:45Z

Ok I was able to set up an AWS account to do some testing. TL;DR; This PR should resolve the issues.

I created an EC2 instance to run docker and set up security groups so that I could access SSH and Prometheus ports. I created the following config file:

scrape_configs:
  - job_name: testsd
    ec2_sd_configs:
      - refresh_interval: 30s
        access_key: aaaaaaa
        secret_key: bbbbbbbb

I ran this against v3.6.0 and everything worked correctly. Then I added an IAM role to the instance and removed the access_key and secret_key from the config file. Prometheus was still able to find the local EC2 instance in SD.

I changed the image to use v3.7.1 and started receiving an error that region was required. I opened #17375 to track this. After adding region to the config, I received the error that the original issue reported

time=2025-10-21T01:06:42.186Z level=ERROR source=refresh.go:90 msg="Unable to refresh target groups" component="discovery manager scrape" discovery=ec2 config=testsd err="could not describe instances: operation error EC2: DescribeInstances, get identity: get credentials: failed to refresh cached credentials, static credentials are empty

I built a new docker image with this PR and uploaded it to my EC2 instance. Running that image, there is no more error and the EC2 instance shows up in SD. I was able to get SD working with and without the access_key and secret_key values. Unfortunately I wasn't able to test the role_arn config option.

krajorama

thank you @sysadmind for the tests

roidelapluie changed the title ~~discovery/ec2: Fix AWS SDK v2 credentials handling for EC2 and Lightsail discovery~~ discovery/aws: Fix AWS SDK v2 credentials handling for EC2 and Lightsail discovery Oct 17, 2025

roidelapluie mentioned this pull request Oct 17, 2025

AWS EC2 scraping not working fine on v3.7.0 #17343

Closed

roidelapluie force-pushed the ec2_discovery branch from db64311 to 4ed0c01 Compare October 17, 2025 12:24

sysadmind approved these changes Oct 17, 2025

View reviewed changes

bwplotka approved these changes Oct 17, 2025

View reviewed changes

krajorama changed the base branch from main to release-3.7 October 17, 2025 13:25

krajorama requested review from Nexucis, aknuds1, cstyan, jesusvazquez, juliusv and tomwilkie as code owners October 17, 2025 13:25

krajorama changed the base branch from release-3.7 to main October 17, 2025 13:25

roidelapluie force-pushed the ec2_discovery branch from 4ed0c01 to c40a574 Compare October 17, 2025 13:52

roidelapluie changed the base branch from main to release-3.7 October 17, 2025 13:53

Nexucis removed request for Nexucis, juliusv and tomwilkie October 21, 2025 12:40

sylr approved these changes Oct 21, 2025

View reviewed changes

krajorama approved these changes Oct 22, 2025

View reviewed changes

krajorama merged commit 1195563 into prometheus:release-3.7 Oct 22, 2025
28 checks passed

bwplotka mentioned this pull request Oct 28, 2025

feat: Unified AWS Service Discovery #17406

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

discovery/aws: Fix AWS SDK v2 credentials handling for EC2 and Lightsail discovery#17355

discovery/aws: Fix AWS SDK v2 credentials handling for EC2 and Lightsail discovery#17355
krajorama merged 1 commit intoprometheus:release-3.7from
roidelapluie:ec2_discovery

roidelapluie commented Oct 17, 2025 •

edited

Loading

Uh oh!

sysadmind left a comment

Uh oh!

bwplotka left a comment

Uh oh!

bwplotka commented Oct 17, 2025 •

edited

Loading

Uh oh!

krajorama commented Oct 17, 2025

Uh oh!

krajorama commented Oct 17, 2025 •

edited

Loading

Uh oh!

krajorama commented Oct 17, 2025

Uh oh!

sysadmind commented Oct 17, 2025

Uh oh!

sysadmind commented Oct 21, 2025

Uh oh!

krajorama left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

roidelapluie commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue(s) does the PR fix:

Does this PR introduce a user-facing change?

Uh oh!

sysadmind left a comment

Choose a reason for hiding this comment

Uh oh!

bwplotka left a comment

Choose a reason for hiding this comment

Uh oh!

bwplotka commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

krajorama commented Oct 17, 2025

Uh oh!

krajorama commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

krajorama commented Oct 17, 2025

Uh oh!

sysadmind commented Oct 17, 2025

Uh oh!

sysadmind commented Oct 21, 2025

Uh oh!

krajorama left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

roidelapluie commented Oct 17, 2025 •

edited

Loading

bwplotka commented Oct 17, 2025 •

edited

Loading

krajorama commented Oct 17, 2025 •

edited

Loading