Skip to content

Peer.Send() failures are not checked, especially in consensus/reactor #174

@jaekwon

Description

@jaekwon

peer.Send() can fail and return an error upon timeout. This return value is not checked, and may lead to an unnecessary halt state.

// From a good node, the peer state of the hung node
// http://52.91.18.251:46657/dump_consensus_state

{
  "Height": 1,
  "Round": 0,
  "Step": 4,
  "StartTime": "2015-12-31T04:10:13.916Z",
  "Proposal": false,
  "ProposalBlockPartsHeader": {
    "total": 0,
    "hash": ""
  },
  "ProposalBlockParts": null,
  "ProposalPOLRound": -1,
  "ProposalPOL": {
    "bits": 4,
    "elems": [
      0
    ]
  },
  "Prevotes": {
    "bits": 4,
    "elems": [
      0
    ]
  },
  "Precommits": {
    "bits": 4,
    "elems": [
      0
    ]
  },
  "LastCommitRound": -1,
  "LastCommit": null,
  "CatchupCommitRound": 2,
  "CatchupCommit": {
    "bits": 4,
    "elems": [
      14
    ]
  }
}

and

// From the hung node, the consensus state:
RoundState{
  H:1 R:0 S:RoundStepPrevote
  StartTime:     2015-12-31 04:10:13.746229076 +0000 UTC
  CommitTime:    0001-01-01 00:00:00 +0000 UTC
  Validators:    ValidatorSet{
      Proposer: Validator{1EF06943F4BF19A210672D97C1AA9918D5544443 PubKeyEd25519{15C27E4A78AB260BC87903BCD3A84B88491387443B24A96B15647E3C1B430861} 0-0-0 VP:21000000 A:0}
      Validators:
        Validator{1EF06943F4BF19A210672D97C1AA9918D5544443 PubKeyEd25519{15C27E4A78AB260BC87903BCD3A84B88491387443B24A96B15647E3C1B430861} 0-0-0 VP:21000000 A:0}
        Validator{3AF5BF8915C109223E0A009F49470D12D0419E62 PubKeyEd25519{2D2885E36D7E8D9434032892917069855F1C24DC0B76D0FF043C5032407D3F68} 0-0-0 VP:21000000 A:0}
        Validator{956E95DEEBF1D80889677F86B56FEFECC1F62082 PubKeyEd25519{DC4D76559214D573B269C95EAB5CAB5F87FC549D320C1DD1C522C44BDA000AC3} 0-0-0 VP:21000000 A:0}
        Validator{DAFE04C5E432C47086C7D6ACA6BFCFC8556D4E3E PubKeyEd25519{C46E8B4CEF3C1F1E78614EE6B73C65313ACCF0ADB2D8FAFEC61FAC8427979BBD} 0-0-0 VP:21000000 A:0}
    }
  Proposal:      \u003cnil\u003e
  ProposalBlock: nil-PartSet nil-Block
  LockedRound:   0
  LockedBlock:   nil-PartSet nil-Block
  Votes:         HeightVoteSet{H:1 R:0~1
      VoteSet{H:1 R:0 T:1 +2/3:false BA{4:X___}}
      VoteSet{H:1 R:0 T:2 +2/3:false BA{4:____}}
      VoteSet{H:1 R:1 T:1 +2/3:false BA{4:____}}
      VoteSet{H:1 R:1 T:2 +2/3:false BA{4:____}}
      VoteSet{H:1 R:2 T:1 +2/3:false BA{4:____}}
      VoteSet{H:1 R:2 T:2 +2/3:false BA{4:__XX}}
    }
  LastCommit: nil-VoteSet
  LastValidators:    ValidatorSet{
      Proposer: nil-Validator
      Validators:

    }
}
​```

It seems that the second validator's H:1 R:2 T:2 vote dropped, not received, or was invalid. Note that CatchupCommit bitarray element of “14” is binary 0111 (little endian). So we should be seeing _XXX on the hung node, but we’re seeing __XX.

If the sending of that vote failed (timed out), it would mark the vote as having been sent. https://github.com/eris-ltd/eris-db/blob/master/Godeps/_workspace/src/github.com/tendermint/tendermint/consensus/reactor.go#L617

This problem would manifest in poor network conditions with sparse connections.

This problem exists in consensus/reactor, but there may be similar issues in other reactors.

Metadata

Metadata

Assignees

Labels

T:bugType Bug (Confirmed)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions