Skip to content

Tendermint can hang forever if abci app dies #1890

@ethanfrey

Description

@ethanfrey

Is this a BUG REPORT or FEATURE REQUEST? (choose one):

BUG REPORT

Tendermint version (use tendermint version or git rev-parse --verify HEAD if installed from source):

0.20 (verified with unit test on 0.22+develop)

ABCI app (name for built-in, URL for self-written if it's publicly available):

Environment:

  • OS (e.g. from /etc/os-release):
  • Install tools:
  • Others:

What happened:

Some tendermint nodes got stuck, in that they responded to rpc for status, etc, but they did not add more blocks not did they pass on any proposed transactions to other nodes. These were non-validating proxy nodes exposing rpc to the outside.

Digging through the log files, I found this on the tendermint fullnode logs:

(There were many such websocket messages, but this seems to be the first one with an error)

I[07-01|18:54:21.044] We need more addresses. Sending pexRequest to random peer module=p2p peer="Peer{MConn{10.24.2.65:46656} 5c6e058cafccbdc156557b4570a52bda9c23f7a0 out}
I[07-01|18:54:51.041] We need more addresses. Sending pexRequest to random peer module=p2p peer="Peer{MConn{10.24.10.10:46656} 778cfa655a5d70a2c356947f90b19bfab83b8c7c out}
E[07-01|18:54:51.948] Error closing connection                     module=rpc-server protocol=websocket remote=127.0.0.1:60496 err="close tcp 127.0.0.1:46657->127.0.0.1:60496: use of closed network connection"
E[07-01|18:54:55.743] Stopping abci.socketClient for error: EOF    module=abci-client connection=consensus
E[07-01|18:54:55.743] Stopping abci.socketClient for error: EOF    module=abci-client connection=mempool
E[07-01|18:54:55.743] Stopping abci.socketClient for error: EOF    module=abci-client connection=query
...
E[07-01|18:55:37.139] Error closing connection                     module=rpc-server protocol=websocket remote=127.0.0.1:44472 err="close tcp 127.0.0.1:46657->127.0.0.1:44472: use of closed network connection" 
E[07-01|18:55:41.348] Connection failed @ sendRoutine              module=p2p peer=10.24.5.59:46656 conn=MConn{10.24.5.59:46656} err="pong timeout"
I[07-01|18:55:41.348] Stopping MConnection                         module=p2p peer=10.24.5.59:46656 impl=MConn{10.24.5.59:46656}  

I found this on the abci/bov side:

D[07-01|18:54:40.139] Commit synced                                module=bov height=139302 hash=955C604B97AFF17B6F0B39AD6F9E09D11824628E
D[07-01|18:54:41.939] Commit synced                                module=bov height=139303 hash=955C604B97AFF17B6F0B39AD6F9E09D11824628E
I[07-01|18:55:02.444] Starting ABCI app                            module=bov bind=unix:///socks/app.sock
I[07-01|18:55:02.444] Starting ABCIServer                          module=abci-server impl=ABCIServer    

Tendermint-ABCI connection broke and thus tendermint could not process any blocks. Normal behavior for tendermint is to exit in such a condition (I tested this locally, using both tcp:// and unix:// connections). Thus if both containers restart, then they reconnect, reestablish connection, and continue.

Note that "Stopping abci.socketClient for error: EOF" was 15 seconds after the last block was processed by bov.

What you expected to happen:

We were running the abci app with 32MB memory limit, which caused it to crash, but in all my experience tendermint commits suicide when it looses connection to the abci app. This is also reported behavior: https://tendermint.com/docs/running-in-production.html#what-happens-when-my-app-dies. Tendermint continuing with a broken abci connection is pathalogical.

How to reproduce it (as minimally and precisely as possible):

Kill abci app while some messages are being processed. A bit tricky in production (I think it was due to the behavior of file sockets over mounted docker volumes), but I have a nice unit test for you all :)

Logs (you can paste a small part showing an error or link a pastebin, gist, etc. containing more of the log file):

Config (you can paste only the changes you've made):

/dump_consensus_state output for consensus bugs

Anything else do we need to know:

PR coming up, along with an analysis of problematic abci code. Looking at this code, it seems due for a refactor, not up to the level of the rest of this repo.

Metadata

Metadata

Assignees

No one assigned

    Labels

    C:abciComponent: Application Blockchain InterfaceT:bugType Bug (Confirmed)

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions