Skip to content

config: change mistaken use of deltaGrpcMethod to sotw#8935

Closed
fredlas wants to merge 1 commit intoenvoyproxy:masterfrom
fredlas:FIX_silly_mistake
Closed

config: change mistaken use of deltaGrpcMethod to sotw#8935
fredlas wants to merge 1 commit intoenvoyproxy:masterfrom
fredlas:FIX_silly_mistake

Conversation

@fredlas
Copy link
Copy Markdown
Contributor

@fredlas fredlas commented Nov 7, 2019

Description: change mistaken use of deltaGrpcMethod to sotw
Risk Level: low
Testing:

Fixes #8927

Signed-off-by: Fred Douglas <fredlas@google.com>
@mattklein123
Copy link
Copy Markdown
Member

@fredlas please add a test that covers this failure? Thank you.

/wait

@mattklein123
Copy link
Copy Markdown
Member

re: fixing on master or reverting, I will defer to you and @wgallagher. Can we please get the other issue fixed also?

@fredlas
Copy link
Copy Markdown
Contributor Author

fredlas commented Nov 7, 2019

Working on adding a test; just wanted to make the change itself visible ASAP. I believe that the other issue (#8923) is obsolete, based on Bill's investigation. The big refactor itself is the fix to that problem.

@kyessenov
Copy link
Copy Markdown
Contributor

kyessenov commented Nov 7, 2019

The other issue is not obsolete, but it should be dealt with separately. We regularly see it in our test clusters. We can confirm if the refactor fixed it once we can fix the immediate issue.

@fredlas
Copy link
Copy Markdown
Contributor Author

fredlas commented Nov 7, 2019

I guess it depends on frame of reference, and what counts as "current" code. Certainly, because the big refactor introduced #8927, it would not yet have been possible to observe #8923 having been fixed at head of master. But, the big refactor very likely does fix it - the destructor that is the immediate cause of the segfault belongs to a class that was entirely deleted by the refactor, and replaced with something less entangled than what was there before. The root cause of the bug is the ownership structure, and a main motivation of that refactor was getting a cleaner ownership structure.

@fredlas
Copy link
Copy Markdown
Contributor Author

fredlas commented Nov 7, 2019

This is surprisingly difficult to test. The bug was due to SubscriptionFactoryImpl passing the wrong gRPC method descriptor into a GrpcMux. But, that is immediately swallowed into a GrpcSubscription, and since the factory just returns the Subscription interface, there's no way to refer to the GrpcMux. I think this might end up needing to be in the integration tests, to avoid having it just be completely artificial and not able to catch this problem.

@mattklein123
Copy link
Copy Markdown
Member

@fredlas huge +1 for an integration test please. LMK if you want to revert in the meantime. Up to you.

@fredlas
Copy link
Copy Markdown
Contributor Author

fredlas commented Nov 7, 2019

If anyone is blocked on this - i.e. just using a slightly older commit of master would be an issue - then I think the best thing to do is to commit this fix without the test. The problem is pretty easy to see, and to see the absence of, so I think we can be pretty confident it's a real fix. I am not confident that unreverting the change wouldn't be significant work, so the simpler way - which is in no way a hack, just will have its regression test coming later - seems like the best choice.

Also, since it seems likely that #8923 is fixed by the refactor, reverting to fix the one bug would just be reintroducing another.

@mattklein123
Copy link
Copy Markdown
Member

@fredlas apologies but we keep master at RC quality at all times and we can't let this commit sit with this type of issue. I'm also not willing to take a fix without a test. I'm going to revert your PR and we can re-apply when there is an integration test. If the other issue is fixed by this work and was already present, let's just resolve that issue also when you resubmit the PR.

@mattklein123
Copy link
Copy Markdown
Member

Revert in #8939. Please resubmit with a fix for this issue + integration test. Thank you!

@fredlas
Copy link
Copy Markdown
Contributor Author

fredlas commented Nov 7, 2019

@mattklein123 is it the regression test, or the actual "the behavior is fixed" test you are concerned about? With the bug, it is not possible to start an Envoy with a simple gRPC-xDS-using bootstrap config; this change makes it possible again. So, simply being able to start Envoy is the actual test. The regression test would be for something much more subtle, and in fact pre-existing to my change: the fact that Envoy is willing to declare a gRPC service method for a stream that does not match the request+response protos. The regression test is one that, if I hadn't accidentally copied the same string twice, no one would think to ask for.

Actually, I'm now not sure writing a regression test for just this particular breakage ("StreamXYZ" vs "DeltaXYZ") is even worth it - it would be a half-measure. The same problem could crop up anywhere else that Envoy uses gRPC. The only useful remedy would be to overhaul the whole gRPC client setup to get service method and request+response proto types declared just once, as an inextricable bundle.

@kyessenov
Copy link
Copy Markdown
Contributor

The test should simply run envoy with GRPC subscription and check that Stream methods are called and not Delta once. That's like half of xDS servers out there, so it deserves to have a regression test here.

@fredlas
Copy link
Copy Markdown
Contributor Author

fredlas commented Nov 7, 2019

What I'm saying is that there's not much point in testing specifically the xDS case, because it's extremely unlikely for that particular mistake to be made again. The real problem is that it was possible to make the mistake in the first place. Just checking that a certain usage of gRPC sends a certain string doesn't fix that problem. Making it so that it would have been impossible for me to make this mistake (while also making it impossible for future people to make a mistake with some FooBar and FooBaz services), would fix that problem.

@mattklein123
Copy link
Copy Markdown
Member

The test should simply run envoy with GRPC subscription and check that Stream methods are called and not Delta once. That's like half of xDS servers out there, so it deserves to have a regression test here.

Please add this integration test. If you don't want to do it, @wgallagher can take over your other PR and resubmit it in pieces and the missing test. Thanks!

@fredlas fredlas closed this Nov 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DELTA_GRPC used for GRPC api type

4 participants