redis: fix hostname resolution bug introduced in #19541#20217
redis: fix hostname resolution bug introduced in #19541#20217mattklein123 merged 8 commits intoenvoyproxy:mainfrom huntsman90:fix-redis-hostname-resolution-bug
Conversation
Signed-off-by: Sai Teja Duthuluri <dsaiteja1908@gmail.com>
Signed-off-by: Sai Teja Duthuluri <dsaiteja1908@gmail.com>
removes debug statements added This reverts commit 598e0ea. Signed-off-by: Sai Teja Duthuluri <dsaiteja1908@gmail.com>
|
Hmm it makes me uncomfortable this was not caught with tests. I think the reason must be that the only tests you have either only resolve a single host and/or all resolutions take place inline? Can you make sure you have a test that covers > 1 host with resolution not happening inline? I think then you should hit this, at least in ASAN. Thank you. /wait |
Signed-off-by: Sai Teja Duthuluri <dsaiteja1908@gmail.com>
Signed-off-by: Sai Teja Duthuluri <dsaiteja1908@gmail.com>
|
Hmm.. I did have a test earlier, but it was with one primary and two replicas. I just added a test with two primaries and two replicas each. However, I noticed that the new test I added passes even without the business logic change. The logs show that the resolution all take place serially. So, how can I force parallel resolution to reproduce the issue with old code? |
This reverts commit ceea432. Signed-off-by: Sai Teja Duthuluri <dsaiteja1908@gmail.com>
|
If your test resolves localhost, it will happen inline. You might need to use mocks and do it in a unit test? This is unfortunately an issue with using localhost for testing DNS. |
| EXPECT_EQ(0U, cluster_->info()->stats().update_failure_.value()); | ||
| } | ||
|
|
||
| TEST_F(RedisClusterTest, AddressAsHostnameMultiplePrimaries) { |
There was a problem hiding this comment.
@mattklein123 this is the unit test I've added. Do you have example of prior art on how to mock non-localhost machine in the response?
There was a problem hiding this comment.
The issue is that if you look at expectResolveDiscovery you call the resolution callback inline with the resolve() call. This has the effect of testing "synchronous" resolution. If you want to test asynchronous resolution and hit this bug you will need to hold the resolve callback handle and call it later, after you begin all the resolutions. Take a look at this test for example:
envoy/test/extensions/common/dynamic_forward_proxy/dns_cache_impl_test.cc
Lines 163 to 180 in f9b05bd
Thanks for working on this test, I think it's important to have coverage here.
/wait
There was a problem hiding this comment.
Thank you for the pointer! I've updated the test to reflect checks for actual asynchronous resolution. I simplified the setup a bit to reduce test complexity. The test now initializes two slots with just primaries, but both their resolutions happen asynchronously. I verified that the test fails on main branch, and reproduces the SEGFAULT that we'd seen in our manual testing.
Signed-off-by: Sai Teja Duthuluri <dsaiteja1908@gmail.com>
Signed-off-by: Sai Teja Duthuluri <dsaiteja1908@gmail.com>
…voyproxy#20217) Signed-off-by: Sai Teja Duthuluri <dsaiteja1908@gmail.com> Signed-off-by: kuochunghsu <kuochunghsu@pinterest.com>
Commit Message:
In #19541 , one of the latter commits changed the
hostname_resolution_required_cntfrom a simpleuint64_tto ashared_ptr<uint64_t>. This meant that we didn't need to pass the variable by reference into the capture-list of the DNS resolution lambda. However, this was overlooked, and results in segfaults due toshared_ptrsemantics.Simply passing this by value solves the problem.
Testing:
Unsure why this didn't fail earlier, and what test scenarios would catch it. Happy to add a test as per reviewer suggestions.