Fix a SIGSEGV when circuit breaker to upstream policy server is hit, fix a memory leak when client disconnects during upstream policy check, expose mixer client logging to Envoy, and add more mixer client counters.#2107
Fix a SIGSEGV when circuit breaker to upstream policy server is hit, fix a memory leak when client disconnects during upstream policy check, expose mixer client logging to Envoy, and add more mixer client counters.#2107duderino wants to merge 1 commit intoistio:release-1.1from duderino:jblatt_8224_crash_fix_logging_and_counters
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: duderino If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
The counter changes in this PR will break the istio/istio build. Please merge PR istio/istio#11591 to fix. |
kyessenov
left a comment
There was a problem hiding this comment.
Left a couple of questions (for my education purposes). Feel free to disregard them.
|
|
||
| // Whether this quota response is from cache. | ||
| bool is_quota_cache_hit{false}; | ||
| virtual const ::google::protobuf::util::Status& status() const = 0; |
There was a problem hiding this comment.
question: do status and routeDirective need to be virtual?
There was a problem hiding this comment.
The check context shared pointer with all the state now implements the check response. Before there was a copy into a result class which included a protobuf copy. What's worse, a virtual call or a protobuf copy? I figured the latter.
There was a problem hiding this comment.
Ok. I still think we could avoid it, but that probably requires more refactoring to erase mixerclient/envoy boundary than desired.
There was a problem hiding this comment.
I suppose I could create a wrapper object that delegates instead of making a virtual function call, but that would increment shared pointer ref counts. Still, this feels like premature optimization to me. For the gazillion inefficiencies of the mixer filter, replacing a virtual function call that I introduced to avoid some deep protobuf copies doesn't seem worth it
| class AttributesBuilder { | ||
| public: | ||
| AttributesBuilder(RequestContext* request) : request_(request) {} | ||
| AttributesBuilder(istio::mixer::v1::Attributes* attributes) |
There was a problem hiding this comment.
question: can we avoid a raw pointer here? what stops us from using references?
There was a problem hiding this comment.
Yes, I'd love to get rid of the raw pointers.
Initially I tried reworking all this code to use the CheckContext shared pointer, but it was a massive diff. So I exercised restraint. And Mixer v2 is coming.
There was a problem hiding this comment.
Ok, thanks for explaining. Whether or not v2 is coming soon, this is production code. I am a bit concerned about the loss of context for the lifetime of the parameter, but if we can validate that the filter never outlives the attributes with integration tests under dynamic xDS reloads, it's fine. dynamic_listener test in istio is supposed to test for that, so if it passes, then it's probably OK.
There was a problem hiding this comment.
Completely agree on the use of raw pointers. I would love to do this, but I want to minimize this overly large PR and I think the current PR makes things no worse. I have also verified that none of these raw pointers are used beyond the lifetime of the calling function.
So I'll do it if you guys are willing to review all the changes. I think a separate PR on top of this would make that easier. @mandarjog since he's also reviewing this.
There was a problem hiding this comment.
@PiotrSikora came up with a similar criticism when we were discussing a breakout of this PR. See #2117
I will do this (and I really would love to!) if you agree to review the resulting changes. I think it would be good to do this in a new PR after 2117 is merged or else it'll become even larger
I created a mixer fault test framework using an Envoy-based load generator and upstream server. See #2105. There will also be a future PR that I will link here that applies that test framework to Mixer. These tests found two issues related to memory management in the MixerClientImpl. Issue 1: If the upstream client disconnects during an upstream check to the policy server, memory is leaked. Example: ``` Leak of 375808 bytes in 367 objects allocated from: @ 1fef7cf google::protobuf::internal::ArenaImpl::NewBlock @ 1ff011e google::protobuf::internal::ArenaImpl::SerialArena::AllocateAlignedFallback @ 1ff09b6 google::protobuf::internal::ArenaImpl::SerialArena::AllocateAligned @ 1fefb78 google::protobuf::internal::ArenaImpl::AllocateAligned @ 61544f google::protobuf::Arena::CreateMaybeMessage @ 55c835 google::protobuf::MessageLite::CreateMaybeMessage @ 55b973 istio::mixer::v1::CheckRequest::mutable_attributes @ 55a0aa istio::mixerclient::MixerClientImpl::Check ``` Issue 2: If too many requests are sent to the upstream policy server a circuit breaker may be hit (by default this limits to 1024 concurrent requests). The MixerClientImpl will free memory prematurely and SIGSEGV with a stack trace like: ``` [2019-01-26 23:15:47.525][28][debug][pool] [external/envoy/source/common/http/http2/conn_pool.cc:79] max requests overflow [2019-01-26 23:15:47.525][28][debug][router] [external/envoy/source/common/router/router.cc:532] [C0][S6223237052999230964] upstream reset [2019-01-26 23:15:47.526][28][debug][http] [external/envoy/source/common/http/async_client_impl.cc:93] async http request response headers (end_stream=true): ':status', '200' 'content-type', 'application/grpc' 'grpc-status', '14' 'grpc-message', 'upstream connect error or disconnect/reset before headers' 'x-envoy-overloaded', 'true' [2019-01-26 23:15:47.526][28][debug][grpc] [src/envoy/utils/grpc_transport.cc:78] Check failed with code: 14, upstream connect error or disconnect/reset before headers [2019-01-26 23:15:47.526][28][warning][filter] [src/istio/mixerclient/client_impl.cc:136] Mixer policy check transport error: UNAVAILABLE:upstream connect error or disconnect/reset before headers [2019-01-26 23:15:47.526][28][debug][filter] [src/istio/mixerclient/client_impl.cc:142] Mixer policy check status: UNAVAILABLE:upstream connect error or disconnect/reset before headers [2019-01-26 23:15:47.526][28][debug][filter] [src/envoy/http/mixer/filter.cc:152] Called Mixer::Filter : check complete UNAVAILABLE:upstream connect error or disconnect/reset before headers [2019-01-26 23:15:47.526][28][debug][filter] [src/envoy/http/mixer/filter.cc:132] Called Mixer::Filter : encodeHeaders 3 [2019-01-26 23:15:47.526][28][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1234] [C2230][S9774838916162700467] encoding headers via codec (end_stream=false): ':status', '503' 'content-length', '69' 'content-type', 'text/plain' 'date', 'Sat, 26 Jan 2019 23:15:46 GMT' 'server', 'envoy' [2019-01-26 23:15:47.526][28][debug][filter] [src/envoy/http/mixer/filter.cc:207] Called Mixer::Filter : onDestroy state: 3 [2019-01-26 23:15:47.526][28][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:125] Caught Segmentation fault, suspect faulting address 0x8 [2019-01-26 23:15:47.526][28][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:94] Backtrace thr<28> obj</usr/lib/x86_64-linux-gnu/libstdc++.so.6> (If unsymbolized, use tools/stack_decode.py): [2019-01-26 23:15:47.526][28][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:117] thr<28> #0 0x7ff123114c04 (unknown) [2019-01-26 23:15:47.526][28][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:104] thr<28> obj</home/jblatt/.cache/bazel/_bazel_jblatt/b6258e8f3eabbcfebfab41d31f03811a/sandbox/linux-sandbox/4/execroot/__main__/bazel-out/k8-fastbuild/bin/test/integration/mixer_fault_test.runfiles/__main__/test/integration/mixer_fault_test> [2019-01-26 23:15:47.553][28][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<28> #1 0x209e113 google::protobuf::internal::GeneratedMessageReflection::HasBit() [2019-01-26 23:15:47.571][28][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<28> #2 0x20923d6 google::protobuf::internal::GeneratedMessageReflection::HasField() [2019-01-26 23:15:47.587][28][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<28> #3 0x2037f03 google::protobuf::TextFormat::Printer::PrintField() [2019-01-26 23:15:47.603][28][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<28> #4 0x2036d68 google::protobuf::TextFormat::Printer::Print() [2019-01-26 23:15:47.621][28][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<28> #5 0x20367e1 google::protobuf::TextFormat::Printer::Print() [2019-01-26 23:15:47.638][28][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<28> #6 0x20332f1 google::protobuf::TextFormat::Printer::PrintToString() [2019-01-26 23:15:47.656][28][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<28> #7 0x2033144 _ZNK6google8protobuf7Message11DebugStringB5cxx11Ev [2019-01-26 23:15:47.673][28][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<28> #8 0x50af1b Envoy::Utils::GrpcTransport<>::GrpcTransport() [2019-01-26 23:15:47.690][28][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<28> #9 0x50aaff Envoy::Utils::GrpcTransport<>::GetFunc()::{lambda()#1}::operator()() [2019-01-26 23:15:47.707][28][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<28> #10 0x50a857 std::_Function_handler<>::_M_invoke() [2019-01-26 23:15:47.724][28][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<28> #11 0x582383 std::function<>::operator()() [2019-01-26 23:15:47.758][28][critical][backtrace] [bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<28> #12 0x580b9a istio::mixerclient::MixerClientImpl::Check() ``` Issues 1 and 2 were fixed by replacing the large number of objects dynamically allocated in the MixerClientImpl::Check function and passed to an async client lambda with two shared pointers that keep all memory alive for the lifetime of the Envoy filter (which is about as long as the lifetime as the upstream request). This also had a nice side effect of removing a few large protobuf copies. It will also make it *much* easier to retry failed mixer requests because the shared pointers will preserve the memory in retry lambdas. See istio/istio#9992 which I will work on next. This is a partial fix for istio/istio#8224 Note that the original bug described this as a problem with a slow telemetry upstream. I noticed that the policy check issues were often accompanied by log messages like: ``` libprotobuf ERROR src/istio/mixerclient/report_batch.cc:75] Mixer Report failed with: UNAVAILABLE:upstream connect error or disconnect/reset before headers ``` The policy checks encountered similar issues but didn't log them. So in addition to the shared pointer bug fix for the leak and the sigsegv, I also added additional logging, a simple logging abstraction so errors in the mixer client could be relayed to envoy's logger, and additional stats so we could track error rates, cache effectiveness, etc. This should handle the policy check side. No errors were found for the telemetry side, but I'll add additional counters and logging for that too.
|
@qiwzhang Apologies for asking for your review since you've moved on to another project, but could you take a look at this? I think it'd really benefit from your expertise and experience with this code. There are essentially three large changes here:
Thanks |
lambdai
left a comment
There was a problem hiding this comment.
Initial comment. Just nits.
BTW: I believe if you can extract the logger in separate commit, it would be much easier to review
| #include "options.h" | ||
|
|
||
| #include <vector> | ||
| #include "src/istio/mixerclient/check_context.h" |
There was a problem hiding this comment.
nit: could you verify if check_context.h is needed?
There was a problem hiding this comment.
Verified it's needed by this code in that file:
class MixerClient {
virtual CancelFunc Check(istio::mixerclient::CheckContextSharedPtr& context,
TransportCheckFunc transport,
CheckDoneFunc on_done) = 0;
| // total_remote_check_calls = total_check_misses | ||
| // total_remote_check_calls >= total_remote_check_accepts + | ||
| // total_remote_check_denies | ||
| // ^ Transport errors are responsible for the >= |
|
|
||
| void Filter::completeCheck(const CheckResponseInfo& info) { | ||
| auto status = info.response_status; | ||
| auto status = info.status(); |
There was a problem hiding this comment.
const auto& ? I don't see why we need a copy. Unless it's small POD
| check_response; // by default status is unknown | ||
| auto attributes = std::make_shared<::istio::mixerclient::SharedAttributes>(); | ||
| ::istio::mixerclient::CheckContext check_response( | ||
| false, attributes); // by default status is unknown |
There was a problem hiding this comment.
| false, attributes); // by default status is unknown | |
| false /* fail_open */, attributes); // by default status is unknown |
|
|
||
| // Use mixer client object to make a Report call. | ||
| void SendReport(const RequestContext& request); | ||
| void SendReport(istio::mixerclient::SharedAttributesPtr& attributes); |
There was a problem hiding this comment.
It's extremely uncommon to use non-const shared_ptr. It's only change the raw pointer need a non-const shared_ptr.
Could you confirm?
There was a problem hiding this comment.
If the consuming code wanted to increment the ref count on attributes, could it do that if attributes were const?
There was a problem hiding this comment.
Yes. const shared_ptr restricts you from changing the raw pointer(reset() or swap()) but allows changing control block(use count)
There was a problem hiding this comment.
ACK. Will update all of these to use const. Thanks
| */ | ||
|
|
||
| #include "src/istio/control/http/attributes_builder.h" | ||
| #include "google/protobuf/stubs/status.h" |
There was a problem hiding this comment.
nit: I believe we have a tool to reorder the header files
There was a problem hiding this comment.
Would love to learn more about this tool. Until then, I manually fixed this on #2117
|
@duderino This PR is too big. It is mixing bug fixing with code refractory. For crash, it is not clear to me how the crash happened, without knowledge, I am not sure the new code will fix it. Could you use issues to explain the root cause for each crash? I will review it the issue. For the code refractory, I like to see it breaks into smaller PR. |
|
I'm closing this in favor of three smaller PRs.
|
What this PR does / why we need it:
This is a partial fix for istio/istio#8224.
I created a mixer fault test framework using an Envoy-based load generator and
upstream server. See #2105. There will also be a future PR that I will link here that applies that test framework to Mixer. These tests found two issues related to memory management in the MixerClientImpl.
Issue 1: If the upstream client disconnects during an upstream check to
the policy server, memory is leaked. Example:
Issue 2: If too many requests are sent to the upstream policy server a
circuit breaker may be hit (by default this limits to 1024 concurrent
requests). The MixerClientImpl will free memory prematurely and SIGSEGV
with a stack trace like:
Issues 1 and 2 were fixed by replacing the large number of objects
dynamically allocated in the MixerClientImpl::Check function and passed
to an async client lambda with two shared pointers that keep all memory
alive for the lifetime of the Envoy filter (which is about as long as
the lifetime as the upstream request). This also had a nice side effect
of removing a few large protobuf copies. It will also make it much easier to retry failed mixer requests because the shared pointers will preserve the memory in retry lambdas. See istio/istio#9992 which I will work on next.
Which issue this PR fixes
Partial fix for istio/istio#8224
Note that the original bug described this as a problem with a slow telemetry upstream. I noticed that the policy check issues were often accompanied by log messages like:
The policy checks encountered similar issues but didn't log them. So
in addition to the shared pointer bug fix for the leak and the sigsegv,
I also added additional logging, a simple logging abstraction so errors
in the mixer client could be relayed to envoy's logger, and additional
stats so we could track error rates, cache effectiveness, etc.
This should handle the policy check side. No errors were found for the
telemetry side, but I'll add additional counters and logging for that
too.
Special notes for your reviewer:
MixerClientImpl is pretty much a rewrite. I recommend using the 'unified view' to review that one.