blockchain/v2 concurrency issues

These are issues found for commit b43da179 on master.

# Issues/concerns on Routine

## Queue vs chan

Nothing wrong with using a priority queue or a queue to model a channel, but the fact that it grows without bound (as of the current implementation) is an issue for general usage, otherwise it's not addressing the problem of dealing with capacity, and memory usage may blow up in a non-reproducible way.  If what we're after is confidence of correctness, memory usage and bounding must be addressed.

We could modify this to have a capacity, but then either we must make send() blocking (but that appears to defeat the point of using a queue), or we need to have send return an error if the queue is full (e.g. trySend()).  My experience tells me that trySend should only be reserved for operations that are allowed to fail/be-ignored, generally something you wouldn't want to try again (except maybe a limited number of times before giving up). 

## PriorityQueue starves

By using a single priority queue, we create the potential problem of priority queue items starving lower priority items.  If there are two types of events/messages that need to be processed, it would be more correct to use two chans or two queues, but the problem with two queues is that you end up replicating logic that is better expressed using two channels.

```
select {
  case msg <- chan1:
  case msg <- chan2:
}
```
vs
```
for { // this is still broken, and not optimal.
  q := randomQueue(queue1, queue2)
  msg := q.Get(1)
  if msg != nil { break }
}
```

While a starving priority queue may be OK in some situations, maybe for the blockchain reactor, in general it's a difficult thing to get right.  For example, in the consensus reactor with a variety of messages including votes, proposals, hasVotes, blockParts etc, the programmer will probably want to set different priorities on them (reasonable since they really do have different urgencies), but this may cause unwanted starvation.  For example, with enough validators, the HasVote and Vote messages become a bottleneck, and this may cause other non-priority messages to starve, which may have unintended side-effects.  Like, a blockPart is a large thing and maybe should be lower priority than a vote, but blockParts would stop propagating if votes become a bottleneck.  Maybe this is OK, but it complicates things.

The way that Tendermint's Reactor + channels w/ priority design solves this, is to relatively throttle the bandwidth of different message types going into each channel (multiplexed in p2p/connection).  This lets the multiplexed channels of the network pipe itself serve as a prioritization mechanism.  For that, one needs to call conn.Send(channelID, msg).

## Routine should use BaseService

It's surprisingly difficult to get the starting/stopping right, is what was learned from BaseService.
Here are some of the problems that I see:

* Starting an already started service silently returns.  This masks what is programmer error.  An error should be returned instead, or it should panic.  BaseService returns an error.
* Same with stopping.  While it may seem convenient to be able to tear down a Routine/Service from any goroutine and not worry about it, this also ends up masking issues, because there are probably other functions to call in addition to Stop(), and it's easy to forget to do it everywhere.  This is why starting and stopping Routines/Services should only be done in one place, just as go enforces that closing a closed channel panics.

```
func() routine1() {
  ...
  myroutine.Stop()
  log(...)
  close(something)
}
func() routine2() {
  ...
  if err {
    myroutine.Stop()
    // oops forgot to close something.
  }
}
```

BaseService is meant to be extended, so maybe it could be called RoutineService, but I think a better name would be PriorityQueueService.

## Comment on comments

```
// Routines are a structure which model a finite state machine as serialized
// stream of events processed by a handle function. This Routine structure
// handles the concurrency and messaging guarantees. Events are sent via
// `send` are handled by the `handle` function to produce an iterator
// `next()`. Calling `close()` on a routine will conclude processing of all
// sent events and produce `final()` event representing the terminal state.
```

* Routine will eventually be abused by the user to provide a non-pure handler, so I don't think we should say that this models an FSM.  Otherwise how do we enforce it?  Do we really need to?  We could say that it is intended to.
* minor nit pick, Routine as a name conflicts with go-routines.  Also, a structure is not a (go)routine.
* what concurrency/messaging guarantee does it handle?
* `to produce an iterator next()`: it produces output events from handler.  just wording issues.  --> "input events are sent to the Routine via `send()` and output events are produced asynchronously by calling the `handle()` function in order.  the output events are consumed from the Routine.next() chan."
* there is no close() function.
* "terminal state" implies it's a return value from the FSM (represented by handle), but that's not what it is.

## Issues with final()

`final()` returns a chan which returns the error, but generally what we want is a channel which closes to signal many listeners to carry on after the Routine is stopped/terminated.  This is called .Quit() as a convention in Tendermint, and is implemented in BaseService.

Also, the final error is best provided by a function Routine.Error() which doesn't return a channel, but always returns the latest error.  Many things may want that value, but returning a channel makes it so that only the first caller receives it.

```
myroutine := newRoutine()
func () {
  ...
  <- myroutine.Quit() // wait until myroutine is closed/terminated
  // more cleanup
  log(myroutine.Error()) // <-- instead of `<-Routine.final()`, call .Error().
}()
func () {
  ...
  select {
    case foo:
    case bar:
    case baz:
    case myroutine.Quit():
      // other kinds of cleanup
      log(myroutine.Error()) // <-- can also read the error here.
  }
}()
```

## Naming convention for channels

Except for common convention names like ".Quit()" which always returns a channel, I find it helpful to name functions that return channels, with the "Ch" suffix.  This is/was used in a lot of places in Tendermint.  E.g. .final() should return a final value, whereas .finalCh() should return a channel that produces a final value.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

blockchain/v2 concurrency issues #4142

Issues/concerns on Routine

Queue vs chan

PriorityQueue starves

Routine should use BaseService

Comment on comments

Issues with final()

Naming convention for channels

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

blockchain/v2 concurrency issues #4142

Description

Issues/concerns on Routine

Queue vs chan

PriorityQueue starves

Routine should use BaseService

Comment on comments

Issues with final()

Naming convention for channels

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions