inactive subsets not being removed?

We are tracking down why our active subsets keep growing, without the unused ones being removed. Given the following subset load balancer config:

```
    "lb_subset_config": {
      "fallback_policy": "DEFAULT_SUBSET",
      "default_subset": {
        "default": "true"
      },
      "subset_selectors": [
        {
          "keys": [
            "default"
          ]
        },
        {
          "keys": [
            "version"
          ]
        },
        {
          "keys": [
            "canary"
          ]
        }
      ]
    }
```

and given that we never have more than 6 possible subsets at a given time:

```
default/true
version/v1
version/v2
version/v3
version/v4
canary/true
```

we end up seeing a higher count of active subsets than what we would expect:

```
$ curl -s localhost:9901/stats | grep subset | grep app
cluster.app.lb_subsets_active: 396
cluster.app.lb_subsets_created: 396
cluster.app.lb_subsets_fallback: 1171992
cluster.app.lb_subsets_removed: 0
cluster.app.lb_subsets_selected: 46201647
```

We run with a concurrency of 32, derived via `std::thread::hardware_concurrency()`. So if there is one load balancer per worker thread and subsets are created independently for each worker, at most we'd have 32 * 6 subsets (or 32*7, if there's a duplicate subset for the default subset).

Of course, versions keep changing but there are never more than 4 at the same time. It looks like the old ones are not being removed.

One important note is that our endpoints (e.g.: (ip, port)) remain constant, but their metadata changes (e.g.: the version). So our EDS service implementation might serve the following versions of its view of the world:

1st EDS snapshot:
```
---
endpoint:
  address:
    socket_address:
      protocol: TCP
      address: 127.0.0.1
      port_value: 8888
metadata:
  filter_metadata:
    envoy.lb:
      version: '1.0'
      stage: 'prod'
```

2nd EDS snapshot (only the version changes):
```
---
endpoint:
  address:
    socket_address:
      protocol: TCP
      address: 127.0.0.1
      port_value: 8888
metadata:
  filter_metadata:
    envoy.lb:
      version: '2.0'
      stage: 'prod'
```

I wrote a test case to see if the metadata is updated and it looks like it isn't: https://github.com/rgs1/envoy/commit/0fd3dd1194c1b203750a000f716828c3d24daa4b. I am not sure if this is by design and if using EDS in this way would be related to the subset load balancer leaking inactive subsets.

This could also be related to the fact that endpoints aren't immediately removed if their healthchecks are still passing, which maybe means the new metadata is ignored if there's never a removal for the previous version of an endpoint.

Thoughts?

cc: @zuercher 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inactive subsets not being removed? #3803

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

inactive subsets not being removed? #3803

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions