Skip to content

http: fix segfault discovered during Lyft's smoketest of #1267.#1282

Merged
mattklein123 merged 5 commits intoenvoyproxy:masterfrom
htuch:ratelimit-segfault
Jul 19, 2017
Merged

http: fix segfault discovered during Lyft's smoketest of #1267.#1282
mattklein123 merged 5 commits intoenvoyproxy:masterfrom
htuch:ratelimit-segfault

Conversation

@htuch
Copy link
Copy Markdown
Member

@htuch htuch commented Jul 19, 2017

Turns out that a sendLocalReply() will unconditionally send a body after errors such as 504, gRPC
client performs a reset before this and had self-destructed already.

.

Turns out that a sendLocalReply() will unconditionally send a body after errors such as 504, gRPC
client performs a reset before this and had self-destructed already.
@htuch
Copy link
Copy Markdown
Member Author

htuch commented Jul 19, 2017

@mattklein123 This should fix your segfault. I'd appreciate thoughts on whether we should also guard in encode{Headers,Trailers}(), as the fix seems fragile (it depends entirely on what router is doing in its internal implementation). I originally had stream closed guards for these, but couldn't see an easy way to get code coverage via tests.

Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this. I agree this whole thing is a bit sketchy. I'm flying to NY now but I should be online in the morning and we can figure out the right solution.

},
nullptr);
#endif
if (remote_closed_) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can this one happen? I can see how data can happen. I think trailers can also happen. We need to guard against that also I think.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this PR is stale, I think I had some dirty client state locally that had remove from encode{Headers,Trailers} the guard.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we want a guard in data/trailers, but not headers. I don't think there is any reasonable way headers can happen after reset with a substantial bug beyond what we have today.

// return to Utility::sendLocalReply() it unconditionally sends a body with
// encodeData().
// TODO(htuch): This is the only case we catch today, but conceivably we could
// see encodeTrailers() as well (or encodeHeaders()?) from router on its way
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per above I think we need to guard encodeTrailers() if we do the fix like this. I don't understand the comment about encodeHeaders() ? You guard that above, but I don't think that can happen?

In general, I agree this is sketchy. Arguably, this is a bug in Utility::secondLocalReply(). Basically, even local replies should adhere to a reset and stop sending data. Fixing that is probably out of scope for this change, but I might put in a TODO/NOTE that we should look at this more.

re: coverage you should be able to hit this via unit tests I think?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we should ship the fix and test, will add a TODO and look into fixing the broader problem today, it seems dodgy to leave it for much longer.

For coverage, to do unit tests I need to make the stream decoder callbacks on Http::AsycnStreamImpl public to poke at via the unit test. I notice we do this in various places, seems not so great though to have to change visibility of entire interfaces just for unit tests.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: testing, I don't follow. In the tests you control the async client that gRPC async client uses. You can mock this and do whatever you want? (Fire callbacks on the decoder, etc.). I think you can do whatever you want without any other changes? (We have lots of tests like this).

@msample
Copy link
Copy Markdown

msample commented Jul 19, 2017

data point: new async client unary send works great for my filter so long as gRPC injection service can be connected to. As soon that connection cannot be established (was TLS issue), it bails at the assertion (even with the 3 async client changes in this PR)

@mattklein123
Copy link
Copy Markdown
Member

Yup @htuch is fixing. If you give us a few more days all of this code will be running in prod at Lyft and we can say with reasonable assurance that it works.

@mattklein123
Copy link
Copy Markdown
Member

Hmm sorry read that too quickly @msample. If it still fails with this PR there may be some other issue still.

@htuch
Copy link
Copy Markdown
Member Author

htuch commented Jul 19, 2017

@msample I'll add an integration test for connection failure today as well, hopefully we can trigger this behavior as well in test.

@htuch
Copy link
Copy Markdown
Member Author

htuch commented Jul 19, 2017 via email

@mattklein123
Copy link
Copy Markdown
Member

mattklein123 commented Jul 19, 2017 via email

@htuch
Copy link
Copy Markdown
Member Author

htuch commented Jul 19, 2017

Actually, with suitable downcasting and upcasting, we can avoid visibility changes on the callbacks to do unit tests.

@htuch
Copy link
Copy Markdown
Member Author

htuch commented Jul 19, 2017

@msample I just added two integration tests for initial connection issues, I don't see the assertion triggered. Do you think there is some code path that isn't being exercised here?

@msample
Copy link
Copy Markdown

msample commented Jul 19, 2017

@htuch I will give it another try with the 3 async_client_impl.cc changes in this PR changes later today. In my filter's case, if it can't talk to the gRPC service to get headers, it lets the original request continue upstream as-is. That may be a different flow than your test.

@mattklein123
Copy link
Copy Markdown
Member

Per offline convo given simplicity I would be in favor of just fixing #1283 when fixing this. But I can go either way.

Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This looks great.

@mattklein123 mattklein123 merged commit 5bec340 into envoyproxy:master Jul 19, 2017
mathetake added a commit that referenced this pull request Mar 3, 2026
**Description**

This commit makes the Goose E2E test less flaky by not assuming any
specific type on the response json. Previously, it assumed, for example,
the flight Price field is in String but sometimes LLM returns Number as
you see in the following

```
2025-10-06T17:23:43.1493910Z === RUN   TestMCPGooseRecipe/kiwi_recipe.yaml
2025-10-06T17:23:43.1494608Z     goose_test.go:51: Executing goose recipe: kiwi_recipe.yaml
2025-10-06T17:34:36.2820059Z     goose_test.go:107: Failed to unmarshal flight search results, retrying...: json: cannot unmarshal number into Go struct field kiwiFlightSearchResult.price of type string
2025-10-06T17:34:36.2822007Z     goose_test.go:51: Executing goose recipe: kiwi_recipe.yaml
2025-10-06T17:43:56.8535882Z     goose_test.go:107: Failed to unmarshal flight search results, retrying...: json: cannot unmarshal number into Go struct field kiwiFlightSearchResult.price of type string
2025-10-06T17:43:56.8547799Z     goose_test.go:48: 
```

Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
Co-authored-by: Erica Hughberg <erica.sundberg.90@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants