Skip to content

dns: introduce ResolutionStatus for ResolveCb and fix #9927#10137

Merged
mattklein123 merged 15 commits intoenvoyproxy:masterfrom
junr03:dns-status
Feb 28, 2020
Merged

dns: introduce ResolutionStatus for ResolveCb and fix #9927#10137
mattklein123 merged 15 commits intoenvoyproxy:masterfrom
junr03:dns-status

Conversation

@junr03
Copy link
Copy Markdown
Member

@junr03 junr03 commented Feb 21, 2020

Description: this PR introduces ResolutionStatus to the ResolveCb. This status can be used to determine if DnsResolution was successful or not, and lets the callback target adjust accordingly. It is used to fix #9927. Additionally, it can be used to fix
envoyproxy/envoy-mobile#673 in a subsequent PR.
Risk Level: med. This PR changes the behavior of Strict Dns Cluster. The introduced argument is not used anywhere, so all behavior stays the same.
Testing: Updated tests to pass the new argument to the callback. But behavior stays the same.
Docs Changes: updated config docs and discovery docs.
Release Notes: added a release note.

Fixes #9927

Signed-off-by: Jose Nino jnino@lyft.com

Signed-off-by: Jose Nino <jnino@lyft.com>
@junr03 junr03 changed the title introduce status to the callback and maintain prior functionality dns: introduce status to the callback and maintain prior functionality Feb 21, 2020
@junr03
Copy link
Copy Markdown
Member Author

junr03 commented Feb 21, 2020

@mattklein123 I finally got time to come back to this. I started by just introducing the new argument for the callback, but leaving all behavior unchanged. I didn't want to conflate this with the two independent changes (don't update hosts on failure, allow different timeouts on failure) needed for the linked issues.

Let me know if you'd rather look at it all together.

Signed-off-by: Jose Nino <jnino@lyft.com>
@junr03 junr03 changed the title dns: introduce status to the callback and maintain prior functionality dns: introduce ResolutionStatus for ResolveCb Feb 21, 2020
Jose Nino added 2 commits February 21, 2020 16:19
Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: Jose Nino <jnino@lyft.com>
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this is great. Flushing out a few comments from my first pass. Thank you!

/wait

// can be done with this query and initiate a new one.
if (!cancelled_) {
callback_({});
callback_(ResolutionStatus::Failure, {});
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you have another status like cancelled/destructing/etc.? This is not quite a failure? I'm not sure it matters to higher layer code but it might?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this when I first introduced the destruction behavior. My first opinion is that higher layers only pivot behavior based on success/failure so I was going to keep it at that level of complexity for this change, and the subsequent PR where I fix #9927.

Then I was going to explore the need to add a Destruction value that allow us to set tighter timers than on "regular failure" when fixing for envoyproxy/envoy-mobile#673. But I am not sure if we need different behavior based on what particular failure scenario the resolver experienced. By the same token, people might want even more granularity in the future?

In short: I did not want to commit to more granular status in this initial PR, and I don't think it is going to be that bad to add more granularity, if desired, in subsequent PRs.

Signed-off-by: Jose Nino <jnino@lyft.com>
@junr03
Copy link
Copy Markdown
Member Author

junr03 commented Feb 25, 2020

@mattklein123 updated. I have updated dns_impl_test for expectations around status in different conditions. I also updated higher level constructs to use status, but I didn't change any behavior yet based on this new information (other than adding more relevant stats where possible). I plan on addressing #10137 (comment) and hence #9927 in a subsequent PR. Let me know if you'd rather see it here.

Edit: fixing #9927 here.

Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: Jose Nino <jnino@lyft.com>
@junr03 junr03 changed the title dns: introduce ResolutionStatus for ResolveCb dns: introduce ResolutionStatus for ResolveCb and fix #9927 Feb 26, 2020
@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/.

🐱

Caused by: #10137 was synchronize by junr03.

see: more, trace.

Jose Nino added 2 commits February 25, 2020 17:45
Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: Jose Nino <jnino@lyft.com>
Network::Utility::getAddressWithPort(*(response.front().address_),
Network::Utility::portFromTcpUrl(dns_url_));

if (respect_dns_ttl_ && response.front().ttl_ != std::chrono::seconds(0)) {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved this for consistency with the strict dns cluster code. But happy to move back up.

tls_.shutdownThread();
}

TEST_F(LogicalDnsParamTest, TtlAsDnsRefreshRate) {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding coverage that did not exist.

fmt
Signed-off-by: Jose Nino <jnino@lyft.com>
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. Just a test comment. Thank you!

/wait

Comment on lines +757 to +758
[&](DnsResolver::ResolutionStatus,
std::list<DnsResponse>&& results) -> void {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the best way would be but is there some way to simplify all of thee calls to make sure have a success/failure expectation on all of them? Perhaps some helper function/object that does the resolve, does the expectations, and stores the resolution list?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

folded everything into resolveWithExpectations

Signed-off-by: Jose Nino <jnino@lyft.com>
EXPECT_TRUE(hasAddress(address_list, "201.134.56.7"));
}

TEST_P(DnsImplTest, DnsIpAddressVersion) {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved above

Signed-off-by: Jose Nino <jnino@lyft.com>
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the awesome test cleanup. 1 question. Thank you!

/wait-any

[&](const std::list<DnsResponse> &
/*results*/) -> void { throw EnvoyException("Envoy exception"); }));
EXPECT_EQ(nullptr,
resolver_->resolve(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: Can't we use the new resolve with expectations everywhere?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes! I thought it might be overkill given that there were far fewer call sites. But if I'm cleaning up something why not go the entire way :). Now everything (for real) is folded into helper functions.

Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: Jose Nino <jnino@lyft.com>
@junr03
Copy link
Copy Markdown
Member Author

junr03 commented Feb 27, 2020

Looks like the circle machines have a funny hosts file?

[----------] 1 test from IpVersions/DnsImplTest
[ RUN      ] IpVersions/DnsImplTest.LocalLookup/IPv4
test/common/network/dns_impl_test.cc:470: Failure
Expected equality of these values:
  expected_results
    Which is: { "127.0.0.1" }
  address_as_string_list
    Which is: { "127.0.0.1", "127.0.0.1" }

@mattklein123
Copy link
Copy Markdown
Member

Thanks for the fix. If that's the CI error you are seeing it seems like it must have to do with your change?

@junr03
Copy link
Copy Markdown
Member Author

junr03 commented Feb 28, 2020

Yep, definitely has to do with my change. I just wanted to make sure that I was surfacing something latent in circle ci and not an unintended regression. Previously the LocalLookup test only asserted that 127.0.0.1 was contained in the resolved address list, not exact match like now. I opened #10197 to verify. It looks like our circle CI setup ends up with a hosts file with a double entry for localhost. https://app.circleci.com/jobs/github/envoyproxy/envoy/324116.

Signed-off-by: Jose Nino <jnino@lyft.com>
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mattklein123 mattklein123 merged commit 64d91b2 into envoyproxy:master Feb 28, 2020
rebello95 pushed a commit to envoyproxy/envoy-mobile that referenced this pull request Mar 2, 2020
Description: updates envoy ref to include envoyproxy/envoy#10137. Exposes client builder function to add dns failure refresh rate.
Risk Level: low, new configuration option
Testing: local apps, CI
Docs Changes: created #715 to create a section in the docs for all the builder options.

Fixes #673
rebello95 pushed a commit to envoyproxy/envoy-mobile that referenced this pull request Mar 3, 2020
Description: updates envoy ref to include envoyproxy/envoy#10137. Exposes client builder function to add dns failure refresh rate.
Risk Level: low, new configuration option
Testing: local apps, CI
Docs Changes: created #715 to create a section in the docs for all the builder options.

Fixes #673 
Signed-off-by: Michael Rebello <me@michaelrebello.com>
jpsim pushed a commit that referenced this pull request Nov 28, 2022
Description: updates envoy ref to include #10137. Exposes client builder function to add dns failure refresh rate.
Risk Level: low, new configuration option
Testing: local apps, CI
Docs Changes: created #715 to create a section in the docs for all the builder options.

Fixes #673

Signed-off-by: JP Simard <jp@jpsim.com>
jpsim pushed a commit that referenced this pull request Nov 29, 2022
Description: updates envoy ref to include #10137. Exposes client builder function to add dns failure refresh rate.
Risk Level: low, new configuration option
Testing: local apps, CI
Docs Changes: created #715 to create a section in the docs for all the builder options.

Fixes #673

Signed-off-by: JP Simard <jp@jpsim.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DNS timeouts cause no healthy upstream

2 participants