Skip to content

p2p.seeds: node exits if one of the seeds cannot be resolved to an ip addr #880

@sbellem

Description

@sbellem

QUESTION

Tendermint version: 0.12.0-'e236302
ABCI app: py-abci (https://pypi.python.org/pypi/abci)

Environment:
This is using docker-compose using the docker image provided on dockerhub: https://github.com/tendermint/tendermint/blob/70d8afa6e952e24c573ece345560a5971bf2cc0e/DOCKER/Dockerfile

  • OS (e.g. from /etc/os-release):
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.6.2
PRETTY_NAME="Alpine Linux v3.6"

What happened:
When running a cluster, if one of the p2p.seeds of a node is not yet running, that node will exit on startup with an error such as the following:

tendermint-2_1  | ERROR: Failed to start node: Error in address tendermint-3:46656: lookup tendermint-3 on 127.0.0.11:53: no such host
pyabci_tendermint-2_1 exited with code 1

In that example, node 1, and 4 are running, and then node 2 is started.

the p2p.seeds config for node 2 (exiting) is:

seeds = "tendermint-1:46656,tendermint-3:46656,tendermint-4:46656"

What you expected to happen:
When a node starts, it shouldn't exit if one of the peers is not yet available.

How to reproduce it:
Start two nodes such that the second node which is being started has a peer that is not running yet and has another peer which is running.

The setup that was used to report this issue can be found under the branch https://github.com/sbellem/py-abci/tree/four-node-cluser-with-docker-compose

You can clone the repo and checkout the branch four-node-cluser-with-docker-compose, and then using docker-compose:

`$ docker-compose -f docker-compose.network.yml up -d tendermint-1
`$ docker-compose -f docker-compose.network.yml up tendermint-2
Starting pyabci_abci-2_1 ... 
Starting pyabci_abci-2_1 ... done
Starting pyabci_tendermint-2_1 ... 
Starting pyabci_tendermint-2_1 ... done
Attaching to pyabci_tendermint-2_1
tendermint-2_1  | ERROR: Failed to start node: Error in address tendermint-3:46656: lookup tendermint-3 on 127.0.0.11:53: no such host
pyabci_tendermint-2_1 exited with code 1

tendermint-2 should exit

Anything else do we need to know:
I don't understand why we currently need to start the first node without any peers defined in the p2p.seeds config parameter. If we do specify a peer, then that first node will exit because that peer is not running.

The problem described in this issue is concerned with a node being capable to start when at least one peer is up, but I don't understand why this is even needed.

This seems to be a very fragile situation, in which one needs to be sure that one peer is super available at all time, otherwise if that super available peer is specified as a seed, and it turns out to be down at the moment of startup of the other peer, that other peer will fail starting.

I wonder why not have some fault tolerance with a reasonable timeout in which a node will try reconnecting for a reasonable duration of time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    C:p2pComponent: P2P pkgT:bugType Bug (Confirmed)

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions