Globally limit the number of outstanding ALTS handshakes#20722
Globally limit the number of outstanding ALTS handshakes#20722apolcyn merged 4 commits intogrpc:masterfrom
Conversation
37358a1 to
a18dd86
Compare
ca4e7ba to
c5dacc9
Compare
|
Labelling WIP since there are a few fixes to |
55163e7 to
ef5d712
Compare
ef5d712 to
81e18e4
Compare
| * mutual client/server handshakes outstanding at the same time and | ||
| * able to make progress. */ | ||
| HandshakeQueue* g_client_handshake_queue; | ||
| HandshakeQueue* g_server_handshake_queue; |
There was a problem hiding this comment.
It seems weird to keep an unused data structure in non-test code. If we increase the queue size in the test, can the problem be solved?
There was a problem hiding this comment.
Note that this data structure is used in the real code (i.e. outside the test) - g_client_handshake_queue is used by client side handshakes and g_server_handshake_queue is used by server side handshakes.
The problem being addressed here is basically the same problem that
currently addresses in the test - so the workaround isn't fundamentally new.For the purpose of keeping alts_concurrent_connectivity_test passing, my reason for preferring the existing workaround over increasing the handshake queue size to a very large value, is the latter workaround would IMO make the test less realistic of production situations than the former workaround by using an artificial value for the queue size.
There was a problem hiding this comment.
I should add that another approach could be to add some way to plumb down a bit to the ALTS handshake client to e.g. not use this queue, and then to e.g. set such a bit in the server-side ALTS creds within the alts_concurrent_connectivity_test.
That would require some new plumbing to the ALTS creds though, so I leaned away from this for that reason, but I could also change this PR to do that if preferred.
There was a problem hiding this comment.
Understood. I meant that g_client_handshake_queue is not used when the code is running by the server side and vice versa. I agree that choosing a very large queue size is also hacky, and the current approach seems more reasonable.
| arg->self = this; | ||
| GRPC_CLOSURE_INIT(&arg->closure, EnqueueHandshakeAndMaybeStartSomeLocked, | ||
| arg, nullptr); | ||
| combiner_->Run(&arg->closure, GRPC_ERROR_NONE); |
There was a problem hiding this comment.
Do we switch to grpc_core::Combiner from grpc_schedule_on_exec_ctx in ALTS code?
There was a problem hiding this comment.
Because the only things we need to synchronize here are the fields of HandshakeQueue (which are only accessed within EnqueueHandshakeAndMaybeStartSomeLocked), this is the only thing that needs to use a combiner. Switching to use a combiner throughout the rest of the handshake isn't needed for this change.
| void EnqueueHandshakeAndMaybeStartSomeInnerLocked( | ||
| alts_grpc_handshaker_client* client) { | ||
| if (client == nullptr) { | ||
| outstanding_handshakes_--; |
There was a problem hiding this comment.
Please add a comment explaining when client is nullptr.
There was a problem hiding this comment.
Done, I added more description on top of the EnqueueHandshakeAndMaybeStartSome API
| EnqueueHandshakeAndMaybeStartSome(client, client->is_client); | ||
| return TSI_OK; | ||
| } else { | ||
| return continue_make_grpc_call(client, is_start); |
There was a problem hiding this comment.
Why do not we treat first (is_start = true) and second (is_start = false) ALTS handshake message differently? i.e., why do not we insert the second request into the queue as well?
There was a problem hiding this comment.
The reason for this is just based on the goal for the meaning of the queue: the queue is meant not to necessarily limit the number of outstanding handshakes messages, but instead it's meant to limit the number of outstanding handshake RPCs (note that this has the same queuing behavior of the MAX_CONCURRENT_STREAMS http/2 setting).
Thus, we want to gate the starting of new RPCs on available slots in the handshake queue. But once a handshake has been admitted to start by the queue, then we proceed with it as fast as we can.
There was a problem hiding this comment.
That makes sense to me (especially the part mimicking MAX_CONCURRENT_STREAMS in http/2 setting).
| } | ||
| maybe_complete_tsi_next(client, true /* receive_status_finished */, | ||
| nullptr /* pending_recv_message_result */); | ||
| EnqueueHandshakeAndMaybeStartSome(nullptr, client->is_client); |
There was a problem hiding this comment.
I believe this serves to free up one slot in the outstanding queue?
There was a problem hiding this comment.
Correct, added a comment to this function to explain.
| GPR_ASSERT(alts_handshaker_client_start_client(nullptr) == | ||
| TSI_INVALID_ARGUMENT); | ||
| { | ||
| grpc_core::ExecCtx exec_ctx; |
There was a problem hiding this comment.
Is there any reason to make these call sites a part of gRPC core (i.e., by creating grpc_core::ExecCtx)?
There was a problem hiding this comment.
I unfortunately have trouble seeing a good way of doing that, the main thing being that these APIs are normally called from within core, and so can expect to have an ExecCtx already available.
A couple of ideas:
- We could change them to perhaps check if an ExecCtx exists and only instantiate one if so. This would add a decent amount of complication to the code though only for the purpose of this unit test's use case.
- We could add some wrappers around these, could be exposed with e.g. the rest of the
_test_onlyutilities, that could in turn call this APIs but with an ExecCtx
There was a problem hiding this comment.
Sorry for the confusion. Let me phrase my question differently - what will happen if we do not create an ExecCtx before these call sites (e.g., alts_handshaker_client_start_client)?
There was a problem hiding this comment.
In general, the problem this pattern addresses is that we need to make sure that if we're ever scheduling a C-core closure, we need to make sure that we have an ExecCtx active for the current thread, otherwise we'll crash.
Before this PR, it just happened that APIs such as alts_handshaker_client_start_client didn't schedule any closures internally. However, since this PR starts scheduling the EnqueueHandshakeAndMaybeStartSomeLocked call as a closure on a combiner, as a first step of creating a new handshake RPC (from within alts_handshaker_client_start), we just need to ensure that we have an ExecCtx active.
yihuazhang
left a comment
There was a problem hiding this comment.
The PR looks pretty good, and thanks for doing it!
markdroth
left a comment
There was a problem hiding this comment.
Overall, this looks pretty good. Most of my comments are about ways to simplify the code.
Please let me know if you have any questions.
Reviewed 3 of 3 files at r1.
Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @apolcyn and @yihuazhang)
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 345 at r1 (raw file):
class HandshakeQueue { public: HandshakeQueue(size_t max_outstanding_handshakes)
explicit
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 382 at r1 (raw file):
Previously, apolcyn wrote…
Done, I added more description on top of the
EnqueueHandshakeAndMaybeStartSomeAPI
I don't see any description here. It's not clear to me why this would be null.
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 384 at r1 (raw file):
outstanding_handshakes_--; } else { Node* node = grpc_core::New<Node>();
It seems strange to have separate Arg and Node data structures and do two different allocations here. I think we can combine them into a single struct, so that when we bounce into the combiner, if we decide not to immediately start the call, we can just reuse the struct that we've already allocated.
This issue will go away if you take either of my suggestions below (using a mutex instead of a combiner and using std::list<> instead of managing the list by hand).
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 421 at r1 (raw file):
} grpc_core::Combiner* combiner_ = grpc_combiner_create();
Using a combiner seems like overkill here. Combiners are fairly complex data structures and should be used only when the extra complexity is justified, either by performance or code simplification. In this case, unless there is something going on here that isn't clear to me, I think a simple mutex should suffice.
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 422 at r1 (raw file):
grpc_core::Combiner* combiner_ = grpc_combiner_create(); Node* head_ = nullptr;
Please consider using std::list<> instead of doing this by hand.
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 450 at r1 (raw file):
bool is_client) { gpr_once_init(&g_queued_handshakes_init, DoHandshakeQueuesInit); if (is_client) {
Could write this as:
HandshakeQueue* queue = is_client ? g_client_handshake_queue : g_server_handshake_queue;
queue->EnqueueHandshakeAndMaybeStartSome(client);
test/core/tsi/alts/handshaker/alts_handshaker_client_test.cc, line 447 at r1 (raw file):
} static void schedule_request_grpc_call_failure_test() {
I don't actually see any test that proves that there won't be more than 40 outstanding handshaker clients at any given time.
apolcyn
left a comment
There was a problem hiding this comment.
Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @apolcyn, @markdroth, and @yihuazhang)
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 345 at r1 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
explicit
Done.
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 382 at r1 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
I don't see any description here. It's not clear to me why this would be null.
I moved the comment about nullptr down into the conditional that checks it.
PLMK (nullptr is used here as a way to signal completion of a handshake; so we don't add anything new to the queue and go straight to checking if we can resume any)
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 384 at r1 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
It seems strange to have separate
ArgandNodedata structures and do two different allocations here. I think we can combine them into a single struct, so that when we bounce into the combiner, if we decide not to immediately start the call, we can just reuse the struct that we've already allocated.This issue will go away if you take either of my suggestions below (using a mutex instead of a combiner and using
std::list<>instead of managing the list by hand).
Simplified code by taking this suggestion to use std::list and a mutex. Thanks for the suggestions!
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 421 at r1 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Using a combiner seems like overkill here. Combiners are fairly complex data structures and should be used only when the extra complexity is justified, either by performance or code simplification. In this case, unless there is something going on here that isn't clear to me, I think a simple mutex should suffice.
My motivation for using a combiner was originally just to be sure that we don't do a re-entrant mutex lock on the handshake queue mutex.
But now that I think of it, it's simple to guarantee that that can never happen, because the only thing we call into while holding the handshake queue mutex is effectively a call to call_start_batch_and_execute, which will complete with a schedule of the closure rather than an inline call.
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 422 at r1 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Please consider using
std::list<>instead of doing this by hand.
Done, significantly simplified code
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 450 at r1 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Could write this as:
HandshakeQueue* queue = is_client ? g_client_handshake_queue : g_server_handshake_queue; queue->EnqueueHandshakeAndMaybeStartSome(client);
Done.
test/core/tsi/alts/handshaker/alts_handshaker_client_test.cc, line 447 at r1 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
I don't actually see any test that proves that there won't be more than 40 outstanding handshaker clients at any given time.
I originally omitted since I was relying on a kind of load test that lives outside of this repo to verify that things were working as a whole.
I do think that adding a test here can make things easier though - added some checking in the fake handshake server to ensure this.
markdroth
left a comment
There was a problem hiding this comment.
Just one noteworthy restructuring suggestion.
Reviewed 5 of 5 files at r3.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @apolcyn and @yihuazhang)
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 382 at r1 (raw file):
Previously, apolcyn wrote…
I moved the comment about nullptr down into the conditional that checks it.
PLMK (
nullptris used here as a way to signal completion of a handshake; so we don't add anything new to the queue and go straight to checking if we can resume any)
The API of passing null here to signal completion is a little counter-intuitive. Instead, how about the following API:
void RequestHandshake(alts_grpc_handshaker_client* client) {
{
MutexLock lock(&mu_);
if (outstanding_handshakes_ == max_outstanding_handshakes_) {
// Max number already running, add to queue.
queued_handshakes_.push_back(client);
return;
}
// Start the handshake immediately.
++outstanding_handshakes_;
}
continue_make_grpc_call(client, true /* is_start */);
}
void HandshakeDone() {
alts_grpc_handshaker_client* client = nullptr;
{
MutexLock lock(&mu_);
if (queued_handshakes_.empty()) {
// Nothing more in queue. Decrement count and return immediately.
--outstanding_handshakes_;
return;
}
// Remove next entry from queue and start the handshake.
client = queued_handshakes_.front();
queued_handshakes_.pop_front();
}
continue_make_grpc_call(client, true /* is_start */);
}
Note that this structure also makes sure that we call continue_make_grpc_call() while not holding the lock, thus minimizing the chances of lock inversion.
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 421 at r1 (raw file):
Previously, apolcyn wrote…
My motivation for using a combiner was originally just to be sure that we don't do a re-entrant mutex lock on the handshake queue mutex.
But now that I think of it, it's simple to guarantee that that can never happen, because the only thing we call into while holding the handshake queue mutex is effectively a call to
call_start_batch_and_execute, which will complete with a schedule of the closure rather than an inline call.
See my suggestion above for how to avoid holding the lock when starting the handshake, just for extra safety.
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 371 at r3 (raw file):
queued_handshakes_.push_back(client); } while (outstanding_handshakes_ < max_outstanding_handshakes_ &&
I don't think we will ever start more than one handshake at a time here, because this function will be called only when requesting or finishing one handshake. So this can be an if instead of a while.
See my code suggestion elsewhere for a better way to structure this.
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 387 at r3 (raw file):
private: gpr_mu mu_;
Please use grpc_core::Mutex instead of gpr_mu in C++ objects.
test/core/tsi/alts/fake_handshaker/fake_handshaker_server.cc, line 250 at r3 (raw file):
explicit ConcurrentRpcsCheck(FakeHandshakerService* parent) : parent_(parent) { grpc::internal::MutexLock lock(&parent->expected_max_concurrent_rpcs_mu_);
If parent->expected_max_concurrent_rpcs_ is 0, this object should do nothing. It does not even need to acquire the lock in that case.
781129d to
0acdc89
Compare
apolcyn
left a comment
There was a problem hiding this comment.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @apolcyn, @markdroth, and @yihuazhang)
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 382 at r1 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
The API of passing null here to signal completion is a little counter-intuitive. Instead, how about the following API:
void RequestHandshake(alts_grpc_handshaker_client* client) { { MutexLock lock(&mu_); if (outstanding_handshakes_ == max_outstanding_handshakes_) { // Max number already running, add to queue. queued_handshakes_.push_back(client); return; } // Start the handshake immediately. ++outstanding_handshakes_; } continue_make_grpc_call(client, true /* is_start */); } void HandshakeDone() { alts_grpc_handshaker_client* client = nullptr; { MutexLock lock(&mu_); if (queued_handshakes_.empty()) { // Nothing more in queue. Decrement count and return immediately. --outstanding_handshakes_; return; } // Remove next entry from queue and start the handshake. client = queued_handshakes_.front(); queued_handshakes_.pop_front(); } continue_make_grpc_call(client, true /* is_start */); }Note that this structure also makes sure that we call
continue_make_grpc_call()while not holding the lock, thus minimizing the chances of lock inversion.
This looks great, thanks! I've changed the code to use this.
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 387 at r3 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Please use
grpc_core::Mutexinstead ofgpr_muin C++ objects.
Done.
test/core/tsi/alts/fake_handshaker/fake_handshaker_server.cc, line 250 at r3 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
If
parent->expected_max_concurrent_rpcs_is 0, this object should do nothing. It does not even need to acquire the lock in that case.
Whoops, good catch, fixed
markdroth
left a comment
There was a problem hiding this comment.
Reviewed 2 of 2 files at r4.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @apolcyn and @yihuazhang)
|
Sorry, I had to make one post-review change to prevent a potentially flakey test:; The most recent commit disables the checking of the number of active RPCs at the fake handshake server for the test case that involves handshake RPC cancellation. I believe this check would was inherently racey for that test case, because there isn't explicit synchronization between a client's RECV_STATUS op completing after cancellation, and the corresponding server's sync application method handler returning. |
| grpc_core::MutexLock lock(&mu_); | ||
| if (queued_handshakes_.empty()) { | ||
| // Nothing more in queue. Decrement count and return immediately. | ||
| --outstanding_handshakes_; |
There was a problem hiding this comment.
Should not this statement be executed regardless of queued_handshakes_ is empty or not?
There was a problem hiding this comment.
Ah never mind. it is not necessary because we immediately start another handshake.
|
|
||
| void DoHandshakeQueuesInit(void) { | ||
| g_client_handshake_queue = | ||
| new HandshakeQueue(40 /* max outstanding handshakes */); |
There was a problem hiding this comment.
Could you please define this constant (i.e., 40) in the beginning as a variable?
| client = queued_handshakes_.front(); | ||
| queued_handshakes_.pop_front(); | ||
| } | ||
| if (client != nullptr) { |
There was a problem hiding this comment.
It is not clear to me how client becomes nullptr because the ones that we pushed into the queue is always not nullptr in RequestHandshake.
There was a problem hiding this comment.
Good catch, I agree this isn't needed. Removed
|
LGTM and Approval |
|
Thanks for the reviews, squashing down commits before merging |
…ts_tsi_handshaker_test
a6b36af to
df4801a
Compare
|
Test failures: Basic Tests C++ iOS: #19627 |
Based on #20687
This is roughly a client-side max_concurrent_streams setting for ALTS handshakes. This is needed in order to reduce handshake service load in certain situations.
This PR has a couple of changes:
alts_handshaker_clientcode. This change is meant to be reverted when we find that we can safely rely on a "max concurrent streams" setting in ALTS gRPC servers.2)) mostly mechanical changes to
alts_tsi_handshaker_testandalts_handshaker_clienttest to ensure that some of the alts handshaker object methods are called with anExecCtxactive.This change is