Skip to content

Fix WSClient deadlock in the readRoutine after Stop() is called#777

Merged
ebuchman merged 3 commits intotendermint:developfrom
silasdavis:fix-blocking-ws-client
Oct 25, 2017
Merged

Fix WSClient deadlock in the readRoutine after Stop() is called#777
ebuchman merged 3 commits intotendermint:developfrom
silasdavis:fix-blocking-ws-client

Conversation

@silasdavis
Copy link
Contributor

In WSClient when readRoutine reads a message it blocks sending it to ResultsCh. When you call Stop() on the WSClient this cause it to block indefinitely. This is because readRoutine never calls Done() on the wait group in WSClient because it is blocking on sending to ResultsCh and Stop() calls c.wg.Wait().

This fixes it by introducing a back channel writeRoutineQuit which works in a similar way to the existing readRoutineQuit. In both cases the close idiom is used so that any subsequent reads will immediately return which should eliminate any issues if both routines are trying to close each other. We need another channel (rather than using 'readRoutineQuit' for both) because only one goroutine can be responsible for closing a channel without there being a race.

@silasdavis silasdavis requested a review from ebuchman as a code owner October 24, 2017 12:29
@silasdavis silasdavis changed the title Fix WSClient deadlock in the readRoutine after Stop() Fix WSClient deadlock in the readRoutine after Stop() is called Oct 24, 2017
@silasdavis silasdavis force-pushed the fix-blocking-ws-client branch from 6abfa48 to 01be6fa Compare October 24, 2017 12:31
}
c.Logger.Info("got response", "resp", response.Result)
c.ResultsCh <- *response.Result
// Combine a non-blocking read on writeRoutineQuit with a non-blocking write on ResultsCh to avoid blocking
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about the c.ErrorsCh?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this reminds me of an issue with WSClient, which might make sense to do something about now. It really would be better if it just exposed a chan RPCResponse rather than filtering them into ResultsCh and ErrorsCh because:

  1. It forces callers to maintain and synchronise two goroutines to read a single logical stream of messages.
  2. It would be easy to accomplish the same as a caller by launching a goroutine that reads from a single chan RPCResponse
  3. (And by far the most important) currently callers have no access to the request/response correlation ID RPCResponse.ID that is set by the client and is meant to allow the client to map the Responses.

There's actually only 13 usages of ResultsCh in the Tendermint codebase. Maybe now would be a good time to fix this issue and just provide a ResponseCh. I think I could be persuaded to do the work to port it over in the Tendermint codebase, but are there other users we need to worry about a breaking change with?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't agree more! I would be in favor of doing that. @ebuchman any objections?

@melekes
Copy link
Contributor

melekes commented Oct 24, 2017

Thanks! See my comment about ErrorsCh. We'll need to add the same select there. Maybe add a func, which will wrap this.

@codecov-io
Copy link

codecov-io commented Oct 24, 2017

Codecov Report

Merging #777 into develop will decrease coverage by 0.2%.
The diff coverage is 70%.

@@             Coverage Diff             @@
##           develop     #777      +/-   ##
===========================================
- Coverage    57.74%   57.54%   -0.21%     
===========================================
  Files           82       82              
  Lines         8440     8434       -6     
===========================================
- Hits          4874     4853      -21     
- Misses        3150     3160      +10     
- Partials       416      421       +5

@silasdavis
Copy link
Contributor Author

silasdavis commented Oct 24, 2017

@melekes well it was quick enough to do so the second commit does that, if you guys want to go that way (which would suit me, since I'd like to get at ID)

I converted ResultsCh into ResponsesCh. To surface the justification from an outdated diff:

Actually this reminds me of an issue with WSClient, which might make sense to do something about now. It really would be better if it just exposed a chan RPCResponse rather than filtering them into ResultsCh and ErrorsCh because:

  1. It forces callers to maintain and synchronise two goroutines to read a single logical stream of messages.
  2. It would be easy to accomplish the same as a caller by launching a goroutine that reads from a single chan RPCResponse
  3. (And by far the most important) currently callers have no access to the request/response correlation ID RPCResponse.ID that is set by the client and is meant to allow the client to map the Responses.

There's actually only 13 usages of ResultsCh in the Tendermint codebase. Maybe now would be a good time to fix this issue and just provide a ResponseCh. I think I could be persuaded to do the work to port it over in the Tendermint codebase, but are there other users we need to worry about a breaking change with?

@silasdavis silasdavis force-pushed the fix-blocking-ws-client branch from e4dc707 to c308dbe Compare October 24, 2017 16:43
@silasdavis silasdavis force-pushed the fix-blocking-ws-client branch from c308dbe to f6adddb Compare October 24, 2017 16:45
@ebuchman
Copy link
Contributor

I like the unified responses channel, but it feels wrong to have so many quit channels. Can this not be simplified somehow to just use c.Quit?

@silasdavis
Copy link
Contributor Author

silasdavis commented Oct 24, 2017

Why yes, we can...

At least my additional channel is unnecessary if we can depend on BaseService closing its Quit channel as it currently does (probably for this reason) so once it is closed it will emit unlimited quit signals, so unlike what I had in mind where a single signal would need to be distributed to readRoutine and writeRoutine, instead they can both be signalled to quit from Quit.

I had a look to see if we could do without the readRoutineQuit channel, but it is legitimately a different signal to Quit since it is really a 'reset' signal clearing the way for a reconnect, whereas Quit signals permanent irrevocable stopping of the service. We could probably piggy back on some other state to unambiguously detect the reconnect attempt in writeRoutine and use that to shut it down (like maybe the reconnectAfter channel), but I think it is plainer the way it is.

Last commit removes my extra channel. Let me know if you want a squash.

…ting quit signals to close both readRoutine and writeRoutine
@silasdavis silasdavis force-pushed the fix-blocking-ws-client branch from 007b100 to 4cb02d0 Compare October 25, 2017 09:19
@ebuchman ebuchman merged commit b2d5546 into tendermint:develop Oct 25, 2017
@silasdavis silasdavis deleted the fix-blocking-ws-client branch October 26, 2017 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants