Skip to content

network: perform post-DNS connection drain on a per-host basis#2240

Merged
goaway merged 5 commits intomainfrom
ms/drain-predicate
May 3, 2022
Merged

network: perform post-DNS connection drain on a per-host basis#2240
goaway merged 5 commits intomainfrom
ms/drain-predicate

Conversation

@goaway
Copy link
Copy Markdown
Contributor

@goaway goaway commented May 3, 2022

Description: Limits connection draining to be based on specific hosts as each host's DNS resolution occurs following a triggered DNS refresh.
Risk Level: Moderate
Testing: Additional Unit Coverage

Signed-off-by: Mike Schore mike.schore@gmail.com

goaway added 3 commits May 3, 2022 18:33
Signed-off-by: Mike Schore <mike.schore@gmail.com>
Signed-off-by: Mike Schore <mike.schore@gmail.com>
Signed-off-by: Mike Schore <mike.schore@gmail.com>
pending_drain_ = false;
if (enable_drain_post_dns_refresh_) {
// Check if the set of hosts pending drain contains the current resolved host.
if (hosts_to_drain_.erase(resolved_host) == 0) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my reading of the reference docs for erase, this line removes the resolved host from the hosts to drain vector and if it was the last host in the vector, we return early? That doesn't seem right to me since for example if we only had a single host, we'd still want to drain its connections. Unless I'm misunderstanding when erase returns zero.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct that this would be wrong if this was a vector, but hosts_to_drain_ is an absl::flat_hash_set.

Here is what is being called:
https://github.com/abseil/abseil-cpp/blob/master/absl/container/flat_hash_set.h#L232

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to read this is "return if nothing was erased".

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok thanks for sharing that. I still haven't come around to appreciating this overall C++ pattern of returning iterators from functions, but this isn't the place to debate that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree working with iterators is a bit of a mental adjustment. That said, in this case, the function is simply returning the number of elements that were deleted as a size_type. :)

goaway added 2 commits May 4, 2022 03:41
Signed-off-by: Mike Schore <mike.schore@gmail.com>
Signed-off-by: Mike Schore <mike.schore@gmail.com>
@goaway goaway marked this pull request as ready for review May 3, 2022 19:50
Copy link
Copy Markdown
Contributor

@Augustyniak Augustyniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with a comment/question

dns_cache->iterateHostMap(
[&](absl::string_view host,
const Extensions::Common::DynamicForwardProxy::DnsHostInfoSharedPtr&) {
hosts_to_drain_.emplace(host);
Copy link
Copy Markdown
Contributor

@Augustyniak Augustyniak May 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we explicitly empty hosts_to_drain_ before we start adding hosts here? I think that we cannot end up in a state where we have dups in here but what if we had a host in the past that does not exist anymore (not sure whether this can happen)?

Imagine we had DNS refresh 1) that included host x and y and DNS refresh 2) that included y host only. We want hosts_to_drain_ to contain only y after the second DNS refresh starts

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not actually sure if hosts can ever cease to exist in Envoy's cache implementation, since at the end of their TTL, I think they're simply refreshed. That said, previous re-resolutions should still exist even if this is triggered again before prior resolutions complete - in other words, I think the race where a host might persist exists irrespective of whether we call clear here.

In the case you describe:
The resolutions from 1) should still exist and be outstanding even if 2) is triggered, and thus we should still observe resolutions for x and y.

Copy link
Copy Markdown
Contributor

@Augustyniak Augustyniak May 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that 1) DNS refresh is cancelled and we never receive a completion callback associated with it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is the callback we're listening for should fire regardless of the result state of the resolution. @mattklein123 added it because other callbacks didn't necessarily.

dns_cache->iterateHostMap(
[&](absl::string_view host,
const Extensions::Common::DynamicForwardProxy::DnsHostInfoSharedPtr&) {
hosts_to_drain_.emplace(host);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we explicitly empty hosts_to_drain_ before we start adding hosts here? I think that we cannot end up in a state where we have dups in here but what if we had a host in the past that does not exist anymore (not sure whether this can happen)?

Imagine we had DNS refresh 1) that included host x and y and DNS refresh 2) that included y host only. We want hosts_to_drain_ to contain only y after the second DNS refresh is initialized

Copy link
Copy Markdown
Contributor Author

@goaway goaway May 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hosts_to_drain_ is a set, and should not contain dups. I don't think emptying the set in here or above actually closes the window on the small window for a race where a host ceases to exist between us enumerating hosts and then triggering the refresh.

I think putting a time bound on outstanding resolutions would guard against a potential leak caused by that race, but that adds additional complexity, and given the narrow window, I'm not sure if it's really needed.

dns_cache->iterateHostMap(
[&](absl::string_view host,
const Extensions::Common::DynamicForwardProxy::DnsHostInfoSharedPtr&) {
hosts_to_drain_.emplace(host);
Copy link
Copy Markdown
Contributor

@Augustyniak Augustyniak May 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that 1) DNS refresh is cancelled and we never receive a completion callback associated with it?

@goaway goaway merged commit 45f3fa8 into main May 3, 2022
@goaway goaway deleted the ms/drain-predicate branch May 3, 2022 21:42
jpsim added a commit that referenced this pull request May 5, 2022
…ol--fix-documentation-comment-issues

* origin/main:
  Fix `bump_lyft_support_rotation.sh` posting to Slack (#2234)
  jni: fix mismatched return value types (#2252)
  Remove bintray as a maven source (#2248)
  test: refine connection drain test (#2245)
  cleanup: remove comment in Http::Client that no longer applies (#2203)
  cleanup: fix test with inaccurate description (#2202)
  network: perform post-DNS connection drain on a per-host basis (#2240)
  network: add enableDrainPostDnsRefresh to iOS (#2242)
  envoy: update to em-cherry (#2241)
  network: support draining connections after triggered DNS refresh (#2225)
  Bump rules_apple to 0.34.2 (#2236)
  Bump Lyft Support Rotation (#2232)

Signed-off-by: JP Simard <jp@jpsim.com>
jpsim added a commit that referenced this pull request May 6, 2022
…on-6.0.0-pre.20220421.3

* origin/main:
  Upgrade rules_jvm_external from 4.1 -> 4.2 (#2254)
  CancelProofEnvoyStream (#2250)
  swift: add DrString tool & fix documentation comment issues (#2233)
  Fix `bump_lyft_support_rotation.sh` posting to Slack (#2234)
  jni: fix mismatched return value types (#2252)
  Remove bintray as a maven source (#2248)
  test: refine connection drain test (#2245)
  cleanup: remove comment in Http::Client that no longer applies (#2203)
  cleanup: fix test with inaccurate description (#2202)
  network: perform post-DNS connection drain on a per-host basis (#2240)

Signed-off-by: JP Simard <jp@jpsim.com>
jpsim added a commit that referenced this pull request May 6, 2022
* main:
  envoy: update to efbbb04 (#2258)
  Update rules_java, rules_detekt, rules_jvm_external (#2249)
  Upgrade rules_jvm_external from 4.1 -> 4.2 (#2254)
  CancelProofEnvoyStream (#2250)
  swift: add DrString tool & fix documentation comment issues (#2233)
  Fix `bump_lyft_support_rotation.sh` posting to Slack (#2234)
  jni: fix mismatched return value types (#2252)
  Remove bintray as a maven source (#2248)
  test: refine connection drain test (#2245)
  cleanup: remove comment in Http::Client that no longer applies (#2203)
  cleanup: fix test with inaccurate description (#2202)
  network: perform post-DNS connection drain on a per-host basis (#2240)
  network: add enableDrainPostDnsRefresh to iOS (#2242)
  envoy: update to em-cherry (#2241)
  network: support draining connections after triggered DNS refresh (#2225)

Signed-off-by: JP Simard <jp@jpsim.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants