Skip to content

redis: fix hostname resolution bug introduced in #19541#20217

Merged
mattklein123 merged 8 commits intoenvoyproxy:mainfrom
huntsman90:fix-redis-hostname-resolution-bug
Mar 8, 2022
Merged

redis: fix hostname resolution bug introduced in #19541#20217
mattklein123 merged 8 commits intoenvoyproxy:mainfrom
huntsman90:fix-redis-hostname-resolution-bug

Conversation

@huntsman90
Copy link
Copy Markdown
Contributor

Commit Message:
In #19541 , one of the latter commits changed the hostname_resolution_required_cnt from a simple uint64_t to a shared_ptr<uint64_t>. This meant that we didn't need to pass the variable by reference into the capture-list of the DNS resolution lambda. However, this was overlooked, and results in segfaults due to shared_ptr semantics.

Simply passing this by value solves the problem.

Testing:
Unsure why this didn't fail earlier, and what test scenarios would catch it. Happy to add a test as per reviewer suggestions.

Signed-off-by: Sai Teja Duthuluri <dsaiteja1908@gmail.com>
Signed-off-by: Sai Teja Duthuluri <dsaiteja1908@gmail.com>
removes debug statements added

This reverts commit 598e0ea.

Signed-off-by: Sai Teja Duthuluri <dsaiteja1908@gmail.com>
@huntsman90 huntsman90 requested a review from mattklein123 as a code owner March 4, 2022 01:25
@mattklein123 mattklein123 self-assigned this Mar 4, 2022
@mattklein123
Copy link
Copy Markdown
Member

Hmm it makes me uncomfortable this was not caught with tests. I think the reason must be that the only tests you have either only resolve a single host and/or all resolutions take place inline? Can you make sure you have a test that covers > 1 host with resolution not happening inline? I think then you should hit this, at least in ASAN. Thank you.

/wait

Signed-off-by: Sai Teja Duthuluri <dsaiteja1908@gmail.com>
Signed-off-by: Sai Teja Duthuluri <dsaiteja1908@gmail.com>
@huntsman90
Copy link
Copy Markdown
Contributor Author

Hmm.. I did have a test earlier, but it was with one primary and two replicas. I just added a test with two primaries and two replicas each. However, I noticed that the new test I added passes even without the business logic change. The logs show that the resolution all take place serially. So, how can I force parallel resolution to reproduce the issue with old code?

This reverts commit ceea432.

Signed-off-by: Sai Teja Duthuluri <dsaiteja1908@gmail.com>
@mattklein123
Copy link
Copy Markdown
Member

If your test resolves localhost, it will happen inline. You might need to use mocks and do it in a unit test? This is unfortunately an issue with using localhost for testing DNS.

EXPECT_EQ(0U, cluster_->info()->stats().update_failure_.value());
}

TEST_F(RedisClusterTest, AddressAsHostnameMultiplePrimaries) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattklein123 this is the unit test I've added. Do you have example of prior art on how to mock non-localhost machine in the response?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that if you look at expectResolveDiscovery you call the resolution callback inline with the resolve() call. This has the effect of testing "synchronous" resolution. If you want to test asynchronous resolution and hit this bug you will need to hold the resolve callback handle and call it later, after you begin all the resolutions. Take a look at this test for example:

EXPECT_CALL(*resolver_, resolve("foo.com", _, _))
.WillOnce(DoAll(SaveArg<2>(&resolve_cb), Return(&resolver_->active_query_)));
auto result = dns_cache_->loadDnsCacheEntry("foo.com", 80, callbacks);
EXPECT_EQ(DnsCache::LoadDnsCacheEntryStatus::Loading, result.status_);
EXPECT_NE(result.handle_, nullptr);
EXPECT_EQ(absl::nullopt, result.host_info_);
checkStats(1 /* attempt */, 0 /* success */, 0 /* failure */, 0 /* address changed */,
1 /* added */, 0 /* removed */, 1 /* num hosts */);
EXPECT_CALL(*timeout_timer, disableTimer());
EXPECT_CALL(update_callbacks_,
onDnsHostAddOrUpdate("foo.com", DnsHostInfoEquals("10.0.0.1:80", "foo.com", false)));
EXPECT_CALL(callbacks,
onLoadDnsCacheComplete(DnsHostInfoEquals("10.0.0.1:80", "foo.com", false)));
EXPECT_CALL(*resolve_timer, enableTimer(std::chrono::milliseconds(dns_ttl_), _));
resolve_cb(Network::DnsResolver::ResolutionStatus::Success,
TestUtility::makeDnsResponse({"10.0.0.1"}));

Thanks for working on this test, I think it's important to have coverage here.

/wait

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the pointer! I've updated the test to reflect checks for actual asynchronous resolution. I simplified the setup a bit to reduce test complexity. The test now initializes two slots with just primaries, but both their resolutions happen asynchronously. I verified that the test fails on main branch, and reproduces the SEGFAULT that we'd seen in our manual testing.

Signed-off-by: Sai Teja Duthuluri <dsaiteja1908@gmail.com>
Signed-off-by: Sai Teja Duthuluri <dsaiteja1908@gmail.com>
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks!

@mattklein123 mattklein123 merged commit 153c825 into envoyproxy:main Mar 8, 2022
JuniorHsu pushed a commit to JuniorHsu/envoy that referenced this pull request Mar 17, 2022
…voyproxy#20217)

Signed-off-by: Sai Teja Duthuluri <dsaiteja1908@gmail.com>
Signed-off-by: kuochunghsu <kuochunghsu@pinterest.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants