Skip to content

Discovery for ECS #9310

@rakyll

Description

@rakyll

In an earlier evaluation, ECS discovery was rejected due to API rate limiting issues described at the discovery section. As of today, there are ECS users that are publishing Prometheus metrics and using CloudWatch Agent's Prometheus scraping capabilities. They configure the agent with task selection mechanism to shard the load among multiple clusters. Influenced by what the users already do, we think we can tackle the problem in a couple of ways:

  • Asking users to configure the discovery to discover a set of matching tasks from a cluster, cache metadata in memory where possible.
  • Querying the initial data with the ECS API and then relying on ECS events to identify new and terminated tasks.
  • Asking users to run Prometheus as a sidecar in their ECS tasks as a last resort.

Given we have this functionality in the CW Agent, not having a similar capability in Prometheus is confusing the ECS users. We would like to fill this gap by contributing an ECS discovery agent to Prometheus and want to switch to the discovery mechanism provided here in all our other collection agents (CW Agent, OpenTelemetry Prometheus Receiver, etc)

Goals

  • Discovery will only discover metric endpoints from a single cluster.
  • We will allow users to filter the tasks by the Cluster Query language and ECS tags.
  • Users should be able to specify ports and metrics path where the Prometheus metrics are published from the task. (See the config for more.)
  • ECS discovery will support both ECS on EC2 and ECS on Fargate.

Config

Once implemented, ECS discovery will be supported in the Prometheus config. The example below will query the cluster to discover ECS tasks/containers matching the given task selectors.

scrape_configs:
  - job_name: ecs-job
    [ metrics_path: <string> ]
    ecs_sd_configs:
      - [ refresh_interval: <string> | default = 720s ]
        [ region: <string> ]
        cluster: <string>
        [ access_key: <string> ] 
        [ secret_key: <secret> ]
        [ profile: <string> ]
        [ role_arn: <string> ]
        ports:
            - <int>
        task_selectors:
          - [ service: <string> ]
            [ family: <string> ]
            [ revisions: <int> ]
            [ launch_type: <string> ]
            [ query: <string> ]
            [ tags: 
               - <string>:  <string> ]

Discovery

Discovery is done by periodically pulling the ListTasks API. Discovery will only return the ACTIVE tasks.

As an improvement, we will switch to a model where we will listen to ECS events to be notified about the task start and terminations in the future. This will allow us to call the ListTasks for once and rely on the events for the changes as an optimization.

Labels

Prometheus discovery can automatically add ECS task/container labels to the scraped metrics. The discovery will add the following labels:

Label Source Type Description
__meta_ecs_cluster ECS Cluster string ECS cluster name.
__meta_ecs_task_launch_type ECS Task string "ec2" or "fargate".
__meta_ecs_task_family ECS Task string ECS task family.
__meta_ecs_task_family_revision ECS Task string ECS task family revision.
__meta_ecs_task_az ECS Task string Availability zone
__meta_ecs_ec2_instance_id EC2 string EC2 instance id for EC2 launch type. Otherwise "fargate".

Authentication & IAM

We will use the default credential provider chain, the following permissions are required:

  • ec2:DescribeInstances
  • ecs:ListTasks
  • ecs:DescribeContainerInstances
  • ecs:DescribeTasks

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions