Here is a sequence of events that can lead to the ADS stream level flow control blocking forever.
- T-1: Listener resource is subscribed to and the request has been sent out
- T0:
recv() receives the Listener resource from the wire:
|
resources, url, version, nonce, err := s.recvMessage(stream) |
- T1:
recv() sets the pending bit of the flow control to true and invokes the response handler and passes it the onDone callback:
|
resourceNames, nackErr = s.eventHandler.onResponse(resp, s.fc.onDone) |
- T2: Response handler runs, and as part of handling the update, it subscribes to an RouteConfiguration resource. This results in
subscribe() being called on the ADS stream, which queues the request:
- T3: Response handler invokes the
onDone callback to release flow control. This writes to the readyCh to unblock goroutines waiting for flow control. It hasn't yet set the pending bit to false:
|
case fc.readyCh <- struct{}{}: |
- T4: Meanwhile, the
send() goroutine gets to run and processes the request for the RouteConfiguration. It calls sendNew() to send this request out:
|
if err := s.sendNew(stream, typ); err != nil { |
- T5:
sendNew() checks the pending bit of the flow control. This is not yet set to false by the onDone callback. It will try to buffer this request, but before that happens, it loses CPU:
- T6: Meanwhile
recv() is in the next iteration of the for loop and has gotten unblocked on the call to fc.wait():
- T7:
recv() attempts to send out any buffered requests by calling sendBuffered, but that method does not find any buffered requests, because sendNew() hasn't yet written to the bufferedRequests channel.
|
func (s *adsStreamImpl) sendBuffered(stream clients.Stream) error { |
- T8:
sendNew() now writes to the bufferedRequests channel.
- Anytime after T5: the
onDone callback sets the pending bit to false.
But this request (buffered at T8) never gets sent out, because recv() is blocked waiting for some response from the management server, but no response is expected because the ADS stream has not requested any new resource.This will eventually lead to the RDS resource watch timer expiring, and being reported to the watcher as a resource-not-found error.
Here is a sequence of events that can lead to the ADS stream level flow control blocking forever.
recv()receives the Listener resource from the wire:grpc-go/internal/xds/clients/xdsclient/ads_stream.go
Line 494 in e350804
recv()sets thependingbit of the flow control totrueand invokes the response handler and passes it theonDonecallback:grpc-go/internal/xds/clients/xdsclient/ads_stream.go
Line 512 in e350804
subscribe()being called on the ADS stream, which queues the request:grpc-go/internal/xds/clients/xdsclient/ads_stream.go
Line 187 in e350804
onDonecallback to release flow control. This writes to thereadyChto unblock goroutines waiting for flow control. It hasn't yet set thependingbit tofalse:grpc-go/internal/xds/clients/xdsclient/ads_stream.go
Line 768 in e350804
send()goroutine gets to run and processes the request for the RouteConfiguration. It callssendNew()to send this request out:grpc-go/internal/xds/clients/xdsclient/ads_stream.go
Line 292 in e350804
sendNew()checks thependingbit of the flow control. This is not yet set tofalseby theonDonecallback. It will try to buffer this request, but before that happens, it loses CPU:grpc-go/internal/xds/clients/xdsclient/ads_stream.go
Line 321 in e350804
recv()is in the next iteration of theforloop and has gotten unblocked on the call tofc.wait():grpc-go/internal/xds/clients/xdsclient/ads_stream.go
Line 486 in e350804
recv()attempts to send out any buffered requests by callingsendBuffered, but that method does not find any buffered requests, becausesendNew()hasn't yet written to thebufferedRequestschannel.grpc-go/internal/xds/clients/xdsclient/ads_stream.go
Line 371 in e350804
sendNew()now writes to thebufferedRequestschannel.onDonecallback sets thependingbit tofalse.But this request (buffered at T8) never gets sent out, because
recv()is blocked waiting for some response from the management server, but no response is expected because the ADS stream has not requested any new resource.This will eventually lead to the RDS resource watch timer expiring, and being reported to the watcher as a resource-not-found error.