Skip to content

Service: address back to back execution race#118

Merged
htuch merged 13 commits intoenvoyproxy:masterfrom
oschaaf:service-back-to-back-execution-race-fix
Aug 29, 2019
Merged

Service: address back to back execution race#118
htuch merged 13 commits intoenvoyproxy:masterfrom
oschaaf:service-back-to-back-execution-race-fix

Conversation

@oschaaf
Copy link
Copy Markdown
Member

@oschaaf oschaaf commented Aug 19, 2019

Address a race discovered via repeated parallel execution of
ServiceTest, BackToBackExecution by adding a short timeout
before concluding that there's still a previous future active and handling
the earlier test.

Theoretically it's still possible to run into this, but in practice I think this
is fair enough for actual usage, and it is sufficient to deflake the
test which detected it on my machine.

Thanks @htuch for discovering the issue

Signed-off-by: Otto van der Schaaf oschaaf@we-amp.com

Address a race discovered via repeated parallel execution of
`ServiceTest, BackToBackExecution` by adding a short timeout
before concluding that there's still a previous future active and handling
the earlier test.

Theoretically it's still possible to run into this, but  in practice I think this
is fair enough for actual usage, and it is sufficient to deflake the
test which detected it on my machine.

Thanks @htuch for discovering the issue

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
@oschaaf oschaaf requested a review from htuch August 19, 2019 21:11
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
@oschaaf
Copy link
Copy Markdown
Member Author

oschaaf commented Aug 19, 2019

Updated to eliminate the race altogether. Ready for another look!

/assign htuch

@htuch htuch mentioned this pull request Aug 19, 2019
4 tasks
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Copy link
Copy Markdown
Member Author

@oschaaf oschaaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready for another pass
/assign htuch

// ServiceTest.BackToBackExecution.
if (busy_) {
if (future_.valid() &&
future_.wait_for(std::chrono::seconds(0)) != std::future_status::ready && busy_) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks. One last question; why do we still need busy_? Isn't knowing that the future isn't ready sufficient to know that it's still in-flight?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the thing is that the future may have send the reply and still not be ready when the next request comes in. So we may even receive the next request before the future has torn down.
To illustrate: the print I added in the following diff will be seen in the out when reproducing the failure in parallell/repeated tests (note that the busy flag is no longer checked):

diff --git a/source/client/service_impl.cc b/source/client/service_impl.cc
index fca8290..853cb77 100644
--- a/source/client/service_impl.cc
+++ b/source/client/service_impl.cc
@@ -75,7 +75,8 @@ void ServiceImpl::writeResponse(const nighthawk::client::ExecutionResponse& resp
       // future has progressed in a state where we can do another one. This avoids the odd flake in
       // ServiceTest.BackToBackExecution.
       if (future_.valid() &&
-          future_.wait_for(std::chrono::seconds(0)) != std::future_status::ready && busy_) {
+          future_.wait_for(std::chrono::seconds(0)) != std::future_status::ready) {
+        std::cerr << "expected to not be busy" << std::endl;
         return finishGrpcStream(false, "Only a single benchmark session is allowed at a time.");
       } else {
         busy_ = true;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that seems entirely possible. I think busy_ is OK to unblock the import, but it's also super icky IMHO, since we're doing manual concurrency control via atomics. Ideally we would have something more like a giant select on the message loop, rather than a gRPC Read. Combining futures and gRPC APIs is kind of messy.Did we explore maybe using something like https://en.cppreference.com/w/cpp/thread/mutex/try_lock and a mutex?

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
- Uses mutexes and a condvar, removing the atomic
  bool.
- Adds handling for declining concurrenct clients
- Adds a test for concurrency clients

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
@oschaaf
Copy link
Copy Markdown
Member Author

oschaaf commented Aug 28, 2019

To tackle this in a clean way, I investigated the async api's for the grpc server, but unfortunately I wan't able to find any examples for that when it comes to bidirectional streaming.

So instead, I updated the approach here to rely on CondVar & BasicLockableMutex (with a few comments to describe workings and gotcha's).

Also added handling and a test for concurrent clients. This is ready for another look!

/assign htuch

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Copy link
Copy Markdown
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but more complicated than it would be ideally, I really wish gRPC had better event loop or std::futures interop.

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
@oschaaf
Copy link
Copy Markdown
Member Author

oschaaf commented Aug 28, 2019

@htuch re: complexity
On idea: maybe someday we could use Envoy's grpc client implementation and run it on NH's main dispatcher to connect back to a new internal grpc service method.
That client would then be running on our own event loop, and the grpc service would lean up and just be passing messages back and forth.

@oschaaf
Copy link
Copy Markdown
Member Author

oschaaf commented Aug 28, 2019

Oh, also, this is ready for another look :)

Envoy::Thread::MutexBasicLockable log_lock_;
::grpc::ServerReaderWriter<::nighthawk::client::ExecutionResponse,
::nighthawk::client::ExecutionRequest>* stream_;
static ::grpc::ServerReaderWriter<::nighthawk::client::ExecutionResponse,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this one static?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's static so we can assert when multiple service instances are created because of multiple clients connecting; we assert when we do so.. any suggestions?

Copy link
Copy Markdown
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for the static stream. Re: using Envoy's gRPC implementation, that's a very definite option, you might want to file an issue. OTOH, if this works, we can leave as is for now.

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Copy link
Copy Markdown
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@htuch htuch merged commit 62fc3cf into envoyproxy:master Aug 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants