-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Closed
Labels
Milestone
Description
The switch runs a go routine called acceptRoutine responsible for accepting new peer connections.
This routine exits when the transport is closed, but also if we receive an error on Accept. In this case, the node still runs normally, but it will no longer accept incoming connections. It does log an error, but otherwise there is no indication that the node no longer accepts connections, nor is there a way to get it to start again, other than by restarting it.
An example causing this error looks like:
E[10116-11-10|14:37:02.305] Accept on transport errored module=p2p err="accept tcp [::]:26656: accept4: too many open files" numPeers=50
See cosmos/cosmos-sdk#2787 for an issue about too many open files - we should probably have a corresponding issue in this repo.
Trying to connect to a peer where the acceptRoutine has exited looks like:
I[12116-11-12|14:19:45.468] Error reconnecting to peer. Trying again module=p2p tries=17 err="auth failure: secrect conn failed: read tcp 192.168.0.92:64943->13.56.78.160:26656: i/o timeout" addr=6a04069a2736f096404df2bbc31d7c2807c080c0@13.56.78.160:26656
We should fix this by either:
- killing the node if the acceptRoutine exits
- having a retry loop around the acceptRoutine to restart it some number of times before killing the node
Reactions are currently unavailable