Skip to content

config: ensure Config::Utility::getAndCheckTransportVersion is called on main thread.#15548

Merged
htuch merged 10 commits intoenvoyproxy:mainfrom
htuch:v2-check-main
Mar 30, 2021
Merged

config: ensure Config::Utility::getAndCheckTransportVersion is called on main thread.#15548
htuch merged 10 commits intoenvoyproxy:mainfrom
htuch:v2-check-main

Conversation

@htuch
Copy link
Copy Markdown
Member

@htuch htuch commented Mar 18, 2021

During the v2 -> v3 migration, some uses of this were added inside filter factory lambdas and could
occur on worker threads, rejecting config on the data plane.

Relevant to #10943
Fixes #15083

Risk level: Low
Testing: New assertion in method caused existing tests to fail, fix addresses these. Also manually
audited via grep.

Signed-off-by: Harvey Tuch htuch@google.com

… on main thread.

During the v2 -> v3 migration, some uses of this were added inside filter factory lambdas and could
occur on worker threads, rejecting config on the data plane.

Relevant to envoyproxy#10943

Risk level: Low
Testing: New assertion in method caused existing tests to fail, fix addresses these. Also manually
  audited via grep.

Signed-off-by: Harvey Tuch <htuch@google.com>
mattklein123
mattklein123 previously approved these changes Mar 18, 2021
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mattklein123
Copy link
Copy Markdown
Member

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #15548 (comment) was created by @mattklein123.

see: more, trace.

@mattklein123
Copy link
Copy Markdown
Member

@htuch I think this needs a main merge, there is a conflict.

/wait

Signed-off-by: Harvey Tuch <htuch@google.com>
@htuch
Copy link
Copy Markdown
Member Author

htuch commented Mar 22, 2021

@jmarantz filter_fuzz_test keeps flaking on a symbol table ASSERT in linux_x64 gcc, see https://dev.azure.com/cncf/envoy/_build/results?buildId=69835&view=logs&j=8bf29878-a4cc-50f7-4e84-2255e6fd4065&t=3267cd16-a387-57e0-0ef4-d7a4d1c6c089. This seems totally unrelated to this change, but keeps popping up when I kick CI. Any thoughts?

@jmarantz
Copy link
Copy Markdown
Contributor

jmarantz commented Mar 22, 2021 via email

@jmarantz
Copy link
Copy Markdown
Contributor

jmarantz commented Mar 22, 2021 via email

@htuch
Copy link
Copy Markdown
Member Author

htuch commented Mar 22, 2021

@jmarantz
Copy link
Copy Markdown
Contributor

That assert is just saying that there are outstanding symbols when the symbol table was destructed. The stack-trace tells us that the symbol-table was being destructed from the factory-context.

Is it possible that that this test was flaking with or without this PR's changes?

Signed-off-by: Harvey Tuch <htuch@google.com>
@jmarantz
Copy link
Copy Markdown
Contributor

jmarantz commented Mar 23, 2021

Is it possible that the fuzz test is holding onto a reference to a stat in a thread that persists pass the destruction of the factories?

Looking over the PR, I'm wondering what the lifetime of the LoggerCache is. Does that hold onto a Stats::Scope? Nothing obvious in this PR indicates to me that the lifetime of that has changed, but it looks like the thing that's been touched that might have an interesting issue with its lifetime in tests.

@htuch
Copy link
Copy Markdown
Member Author

htuch commented Mar 23, 2021

The only test that I can believe might be affected is the rate limit HTTP filter, since this is an HTTP filter fuzz test. The change there seems innocuous. @asraa for an extra pair of eyes. I'll get it replicating and go deeper as next step.

@asraa
Copy link
Copy Markdown
Contributor

asraa commented Mar 24, 2021

The change there seems innocuous. @asraa for an extra pair of eyes.

About this and lifetime, can you repro it locally?
Trying to brainstorm... the filters are destroyed at the end of fuzz iteration, so I'd be skeptical if Rate limit filter was holding on to some stats.

ACTUALLY (fortunately i procrastinated pressing comment)! maybe something is exiting early from fuzz and not running that reset() I just mentioned above

what if you try adding a reset in the try/catch early return or adding it to the filter fuzzer destructor?

@htuch
Copy link
Copy Markdown
Member Author

htuch commented Mar 25, 2021

@asraa why is https://github.com/envoyproxy/envoy/blob/main/test/extensions/filters/http/common/fuzz/filter_fuzz_test.cc#L53 setup for persistence? I think for the unit test style fuzzers (rather than e2e ones), we would be better off with non-persistence if possible to allow earlier detection of these issues. I can repro locally and digging further into this one, the issue is with ratelimit and test/extensions/filters/http/common/fuzz/filter_corpus/clusterfuzz-testcase-minimized-filter_fuzz_test-6484279454466048.

@jmarantz
Copy link
Copy Markdown
Contributor

jmarantz commented Mar 25, 2021 via email

@htuch
Copy link
Copy Markdown
Member Author

htuch commented Mar 30, 2021

Well, the good news is I have a fix, the bad news is I don't understand why it works; see 1e1ac2a. Basically, filter_config (a shared pointer) seems to be sensitive to the exception after having its reference count bumped for the lambda capture. I'm not sure what I'm missing here around behavior of lambda capture, initializer and exceptions, but it seems kind of magical.

@mattklein123
Copy link
Copy Markdown
Member

Well, the good news is I have a fix, the bad news is I don't understand why it works; see 1e1ac2a. Basically, filter_config (a shared pointer) seems to be sensitive to the exception after having its reference count bumped for the lambda capture. I'm not sure what I'm missing here around behavior of lambda capture, initializer and exceptions, but it seems kind of magical.

I'm no expert but I think lambda captures are not deterministic (C++ sigh), so I think the way to make it deterministic is to pre-capture into a struct and then pass that. I know @lambdai has been fixing similar issues so might be worth doing that here?

@htuch
Copy link
Copy Markdown
Member Author

htuch commented Mar 30, 2021

Still not sure how that relates though; even if non-deterministic, if we take the ref count, it should safely destruct presumably and decrement, if we don't take it then there should not be an issue.

@mattklein123
Copy link
Copy Markdown
Member

Still not sure how that relates though; even if non-deterministic, if we take the ref count, it should safely destruct presumably and decrement, if we don't take it then there should not be an issue.

Honestly I think you are hitting a compiler bug. Given that lambda capture order is not deterministic I would probably just switch to a struct anyway and be done with it, but up to you.

@asraa
Copy link
Copy Markdown
Contributor

asraa commented Mar 31, 2021

@asraa why is https://github.com/envoyproxy/envoy/blob/main/test/extensions/filters/http/common/fuzz/filter_fuzz_test.cc#L53 setup for persistence?

MIssed this but worth the comment: this class just contains mock objects that could persist (and state's meant to be reset at the end for this class) for speed reasons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Missing transport_api_version in rate limiting leads to crash

4 participants