Service: address back to back execution race#118
Conversation
Address a race discovered via repeated parallel execution of `ServiceTest, BackToBackExecution` by adding a short timeout before concluding that there's still a previous future active and handling the earlier test. Theoretically it's still possible to run into this, but in practice I think this is fair enough for actual usage, and it is sufficient to deflake the test which detected it on my machine. Thanks @htuch for discovering the issue Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
|
Updated to eliminate the race altogether. Ready for another look! /assign htuch |
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
oschaaf
left a comment
There was a problem hiding this comment.
Ready for another pass
/assign htuch
source/client/service_impl.cc
Outdated
| // ServiceTest.BackToBackExecution. | ||
| if (busy_) { | ||
| if (future_.valid() && | ||
| future_.wait_for(std::chrono::seconds(0)) != std::future_status::ready && busy_) { |
There was a problem hiding this comment.
Looks good, thanks. One last question; why do we still need busy_? Isn't knowing that the future isn't ready sufficient to know that it's still in-flight?
There was a problem hiding this comment.
I think the thing is that the future may have send the reply and still not be ready when the next request comes in. So we may even receive the next request before the future has torn down.
To illustrate: the print I added in the following diff will be seen in the out when reproducing the failure in parallell/repeated tests (note that the busy flag is no longer checked):
diff --git a/source/client/service_impl.cc b/source/client/service_impl.cc
index fca8290..853cb77 100644
--- a/source/client/service_impl.cc
+++ b/source/client/service_impl.cc
@@ -75,7 +75,8 @@ void ServiceImpl::writeResponse(const nighthawk::client::ExecutionResponse& resp
// future has progressed in a state where we can do another one. This avoids the odd flake in
// ServiceTest.BackToBackExecution.
if (future_.valid() &&
- future_.wait_for(std::chrono::seconds(0)) != std::future_status::ready && busy_) {
+ future_.wait_for(std::chrono::seconds(0)) != std::future_status::ready) {
+ std::cerr << "expected to not be busy" << std::endl;
return finishGrpcStream(false, "Only a single benchmark session is allowed at a time.");
} else {
busy_ = true;There was a problem hiding this comment.
Yes, that seems entirely possible. I think busy_ is OK to unblock the import, but it's also super icky IMHO, since we're doing manual concurrency control via atomics. Ideally we would have something more like a giant select on the message loop, rather than a gRPC Read. Combining futures and gRPC APIs is kind of messy.Did we explore maybe using something like https://en.cppreference.com/w/cpp/thread/mutex/try_lock and a mutex?
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
…ack-execution-race-fix
…ack-execution-race-fix
- Uses mutexes and a condvar, removing the atomic bool. - Adds handling for declining concurrenct clients - Adds a test for concurrency clients Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
|
To tackle this in a clean way, I investigated the async api's for the grpc server, but unfortunately I wan't able to find any examples for that when it comes to bidirectional streaming. So instead, I updated the approach here to rely on Also added handling and a test for concurrent clients. This is ready for another look! /assign htuch |
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
htuch
left a comment
There was a problem hiding this comment.
Looks good, but more complicated than it would be ideally, I really wish gRPC had better event loop or std::futures interop.
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
|
@htuch re: complexity |
|
Oh, also, this is ready for another look :) |
source/client/service_impl.h
Outdated
| Envoy::Thread::MutexBasicLockable log_lock_; | ||
| ::grpc::ServerReaderWriter<::nighthawk::client::ExecutionResponse, | ||
| ::nighthawk::client::ExecutionRequest>* stream_; | ||
| static ::grpc::ServerReaderWriter<::nighthawk::client::ExecutionResponse, |
There was a problem hiding this comment.
It's static so we can assert when multiple service instances are created because of multiple clients connecting; we assert when we do so.. any suggestions?
htuch
left a comment
There was a problem hiding this comment.
LGTM except for the static stream. Re: using Envoy's gRPC implementation, that's a very definite option, you might want to file an issue. OTOH, if this works, we can leave as is for now.
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Address a race discovered via repeated parallel execution of
ServiceTest, BackToBackExecutionby adding a short timeoutbefore concluding that there's still a previous future active and handling
the earlier test.
Theoretically it's still possible to run into this, but in practice I think this
is fair enough for actual usage, and it is sufficient to deflake the
test which detected it on my machine.
Thanks @htuch for discovering the issue
Signed-off-by: Otto van der Schaaf oschaaf@we-amp.com