Fix WSClient deadlock in the readRoutine after Stop() is called#777
Fix WSClient deadlock in the readRoutine after Stop() is called#777ebuchman merged 3 commits intotendermint:developfrom
Conversation
…write to ResultsCh
6abfa48 to
01be6fa
Compare
rpc/lib/client/ws_client.go
Outdated
| } | ||
| c.Logger.Info("got response", "resp", response.Result) | ||
| c.ResultsCh <- *response.Result | ||
| // Combine a non-blocking read on writeRoutineQuit with a non-blocking write on ResultsCh to avoid blocking |
There was a problem hiding this comment.
what about the c.ErrorsCh?
There was a problem hiding this comment.
Actually this reminds me of an issue with WSClient, which might make sense to do something about now. It really would be better if it just exposed a chan RPCResponse rather than filtering them into ResultsCh and ErrorsCh because:
- It forces callers to maintain and synchronise two goroutines to read a single logical stream of messages.
- It would be easy to accomplish the same as a caller by launching a goroutine that reads from a single
chan RPCResponse - (And by far the most important) currently callers have no access to the request/response correlation ID
RPCResponse.IDthat is set by the client and is meant to allow the client to map theResponses.
There's actually only 13 usages of ResultsCh in the Tendermint codebase. Maybe now would be a good time to fix this issue and just provide a ResponseCh. I think I could be persuaded to do the work to port it over in the Tendermint codebase, but are there other users we need to worry about a breaking change with?
There was a problem hiding this comment.
can't agree more! I would be in favor of doing that. @ebuchman any objections?
|
Thanks! See my comment about ErrorsCh. We'll need to add the same select there. Maybe add a func, which will wrap this. |
Codecov Report
@@ Coverage Diff @@
## develop #777 +/- ##
===========================================
- Coverage 57.74% 57.54% -0.21%
===========================================
Files 82 82
Lines 8440 8434 -6
===========================================
- Hits 4874 4853 -21
- Misses 3150 3160 +10
- Partials 416 421 +5 |
|
@melekes well it was quick enough to do so the second commit does that, if you guys want to go that way (which would suit me, since I'd like to get at ID) I converted
|
e4dc707 to
c308dbe
Compare
c308dbe to
f6adddb
Compare
|
I like the unified responses channel, but it feels wrong to have so many quit channels. Can this not be simplified somehow to just use c.Quit? |
|
Why yes, we can... At least my additional channel is unnecessary if we can depend on I had a look to see if we could do without the Last commit removes my extra channel. Let me know if you want a squash. |
…ting quit signals to close both readRoutine and writeRoutine
007b100 to
4cb02d0
Compare
In
WSClientwhenreadRoutinereads a message it blocks sending it toResultsCh. When you callStop()on theWSClientthis cause it to block indefinitely. This is becausereadRoutinenever callsDone()on the wait group inWSClientbecause it is blocking on sending toResultsChandStop()callsc.wg.Wait().This fixes it by introducing a back channel
writeRoutineQuitwhich works in a similar way to the existingreadRoutineQuit. In both cases the close idiom is used so that any subsequent reads will immediately return which should eliminate any issues if both routines are trying to close each other. We need another channel (rather than using 'readRoutineQuit' for both) because only one goroutine can be responsible for closing a channel without there being a race.