http: fix segfault discovered during Lyft's smoketest of #1267.#1282
http: fix segfault discovered during Lyft's smoketest of #1267.#1282mattklein123 merged 5 commits intoenvoyproxy:masterfrom
Conversation
|
@mattklein123 This should fix your segfault. I'd appreciate thoughts on whether we should also guard in |
mattklein123
left a comment
There was a problem hiding this comment.
Thanks for looking into this. I agree this whole thing is a bit sketchy. I'm flying to NY now but I should be online in the morning and we can figure out the right solution.
| }, | ||
| nullptr); | ||
| #endif | ||
| if (remote_closed_) { |
There was a problem hiding this comment.
How can this one happen? I can see how data can happen. I think trailers can also happen. We need to guard against that also I think.
There was a problem hiding this comment.
Sorry, this PR is stale, I think I had some dirty client state locally that had remove from encode{Headers,Trailers} the guard.
There was a problem hiding this comment.
IMO we want a guard in data/trailers, but not headers. I don't think there is any reasonable way headers can happen after reset with a substantial bug beyond what we have today.
| // return to Utility::sendLocalReply() it unconditionally sends a body with | ||
| // encodeData(). | ||
| // TODO(htuch): This is the only case we catch today, but conceivably we could | ||
| // see encodeTrailers() as well (or encodeHeaders()?) from router on its way |
There was a problem hiding this comment.
Per above I think we need to guard encodeTrailers() if we do the fix like this. I don't understand the comment about encodeHeaders() ? You guard that above, but I don't think that can happen?
In general, I agree this is sketchy. Arguably, this is a bug in Utility::secondLocalReply(). Basically, even local replies should adhere to a reset and stop sending data. Fixing that is probably out of scope for this change, but I might put in a TODO/NOTE that we should look at this more.
re: coverage you should be able to hit this via unit tests I think?
There was a problem hiding this comment.
Yeah, I think we should ship the fix and test, will add a TODO and look into fixing the broader problem today, it seems dodgy to leave it for much longer.
For coverage, to do unit tests I need to make the stream decoder callbacks on Http::AsycnStreamImpl public to poke at via the unit test. I notice we do this in various places, seems not so great though to have to change visibility of entire interfaces just for unit tests.
There was a problem hiding this comment.
re: testing, I don't follow. In the tests you control the async client that gRPC async client uses. You can mock this and do whatever you want? (Fire callbacks on the decoder, etc.). I think you can do whatever you want without any other changes? (We have lots of tests like this).
|
data point: new async client unary send works great for my filter so long as gRPC injection service can be connected to. As soon that connection cannot be established (was TLS issue), it bails at the assertion (even with the 3 async client changes in this PR) |
|
Yup @htuch is fixing. If you give us a few more days all of this code will be running in prod at Lyft and we can say with reasonable assurance that it works. |
|
Hmm sorry read that too quickly @msample. If it still fails with this PR there may be some other issue still. |
|
@msample I'll add an integration test for connection failure today as well, hopefully we can trigger this behavior as well in test. |
|
It's the HTTP client the needs the unit tests, I'll put out an update when
I get to the office to demonstrate (with callback visibility changes).
…On Wed, Jul 19, 2017 at 8:13 AM Matt Klein ***@***.***> wrote:
***@***.**** commented on this pull request.
In source/common/http/async_client_impl.cc
<#1282 (comment)>:
> @@ -87,6 +90,18 @@ void AsyncStreamImpl::encodeHeaders(HeaderMapPtr&& headers, bool end_stream) {
void AsyncStreamImpl::encodeData(Buffer::Instance& data, bool end_stream) {
ENVOY_LOG(trace, "async http request response data (length={} end_stream={})", data.length(),
end_stream);
+ // It's possible we have been reset by the client but the router is still
+ // sending data on its way out. This happens for example when
+ // Utility::sendLocalReply() tells the gRPC async client that it has a non-200
+ // via encodeHeaders(), gRPC async client resets the HTTP async stream but on
+ // return to Utility::sendLocalReply() it unconditionally sends a body with
+ // encodeData().
+ // TODO(htuch): This is the only case we catch today, but conceivably we could
+ // see encodeTrailers() as well (or encodeHeaders()?) from router on its way
re: testing, I don't follow. In the tests you control the async client
that gRPC async client uses. You can mock this and do whatever you want?
(Fire callbacks on the decoder, etc.). I think you can do whatever you want
without any other changes? (We have lots of tests like this).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1282 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AKaLv37O_96vV5LtGPzFjHjcd2GWgY_hks5sPfL_gaJpZM4OcH2N>
.
|
|
Ah I see. Ok. If it can't happen it can't happen. Local reply can only do
this with data so I would just fix that until we come up with a better
solution.
…On Jul 19, 2017 8:16 AM, "htuch" ***@***.***> wrote:
It's the HTTP client the needs the unit tests, I'll put out an update when
I get to the office to demonstrate (with callback visibility changes).
On Wed, Jul 19, 2017 at 8:13 AM Matt Klein ***@***.***>
wrote:
> ***@***.**** commented on this pull request.
>
> In source/common/http/async_client_impl.cc
> <#1282 (comment)>:
>
> > @@ -87,6 +90,18 @@ void AsyncStreamImpl::encodeHeaders(HeaderMapPtr&&
headers, bool end_stream) {
> void AsyncStreamImpl::encodeData(Buffer::Instance& data, bool
end_stream) {
> ENVOY_LOG(trace, "async http request response data (length={}
end_stream={})", data.length(),
> end_stream);
> + // It's possible we have been reset by the client but the router is
still
> + // sending data on its way out. This happens for example when
> + // Utility::sendLocalReply() tells the gRPC async client that it has a
non-200
> + // via encodeHeaders(), gRPC async client resets the HTTP async stream
but on
> + // return to Utility::sendLocalReply() it unconditionally sends a body
with
> + // encodeData().
> + // TODO(htuch): This is the only case we catch today, but conceivably
we could
> + // see encodeTrailers() as well (or encodeHeaders()?) from router on
its way
>
> re: testing, I don't follow. In the tests you control the async client
> that gRPC async client uses. You can mock this and do whatever you want?
> (Fire callbacks on the decoder, etc.). I think you can do whatever you
want
> without any other changes? (We have lots of tests like this).
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#1282 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AKaLv37O_
96vV5LtGPzFjHjcd2GWgY_hks5sPfL_gaJpZM4OcH2N>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1282 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AGA17FTSs8GmT4oehYRRdy74d9BCk1LXks5sPfOlgaJpZM4OcH2N>
.
|
|
Actually, with suitable downcasting and upcasting, we can avoid visibility changes on the callbacks to do unit tests. |
|
@msample I just added two integration tests for initial connection issues, I don't see the assertion triggered. Do you think there is some code path that isn't being exercised here? |
|
@htuch I will give it another try with the 3 async_client_impl.cc changes in this PR changes later today. In my filter's case, if it can't talk to the gRPC service to get headers, it lets the original request continue upstream as-is. That may be a different flow than your test. |
|
Per offline convo given simplicity I would be in favor of just fixing #1283 when fixing this. But I can go either way. |
mattklein123
left a comment
There was a problem hiding this comment.
Thanks. This looks great.
**Description** This commit makes the Goose E2E test less flaky by not assuming any specific type on the response json. Previously, it assumed, for example, the flight Price field is in String but sometimes LLM returns Number as you see in the following ``` 2025-10-06T17:23:43.1493910Z === RUN TestMCPGooseRecipe/kiwi_recipe.yaml 2025-10-06T17:23:43.1494608Z goose_test.go:51: Executing goose recipe: kiwi_recipe.yaml 2025-10-06T17:34:36.2820059Z goose_test.go:107: Failed to unmarshal flight search results, retrying...: json: cannot unmarshal number into Go struct field kiwiFlightSearchResult.price of type string 2025-10-06T17:34:36.2822007Z goose_test.go:51: Executing goose recipe: kiwi_recipe.yaml 2025-10-06T17:43:56.8535882Z goose_test.go:107: Failed to unmarshal flight search results, retrying...: json: cannot unmarshal number into Go struct field kiwiFlightSearchResult.price of type string 2025-10-06T17:43:56.8547799Z goose_test.go:48: ``` Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com> Co-authored-by: Erica Hughberg <erica.sundberg.90@gmail.com>
Turns out that a sendLocalReply() will unconditionally send a body after errors such as 504, gRPC
client performs a reset before this and had self-destructed already.