Skip to content

p2p: acceptRoutine may exit and never restart without stopping the node #2823

@ebuchman

Description

@ebuchman

The switch runs a go routine called acceptRoutine responsible for accepting new peer connections.

This routine exits when the transport is closed, but also if we receive an error on Accept. In this case, the node still runs normally, but it will no longer accept incoming connections. It does log an error, but otherwise there is no indication that the node no longer accepts connections, nor is there a way to get it to start again, other than by restarting it.

An example causing this error looks like:

E[10116-11-10|14:37:02.305] Accept on transport errored                  module=p2p err="accept tcp [::]:26656: accept4: too many open files" numPeers=50

See cosmos/cosmos-sdk#2787 for an issue about too many open files - we should probably have a corresponding issue in this repo.

Trying to connect to a peer where the acceptRoutine has exited looks like:

I[12116-11-12|14:19:45.468] Error reconnecting to peer. Trying again     module=p2p tries=17 err="auth failure: secrect conn failed: read tcp 192.168.0.92:64943->13.56.78.160:26656: i/o timeout" addr=6a04069a2736f096404df2bbc31d7c2807c080c0@13.56.78.160:26656

We should fix this by either:

  • killing the node if the acceptRoutine exits
  • having a retry loop around the acceptRoutine to restart it some number of times before killing the node

Metadata

Metadata

Assignees

No one assigned

    Labels

    C:p2pComponent: P2P pkgT:bugType Bug (Confirmed)

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions