Sync stuck on execution engine delayed restart

## Description

I had a Geth + Lighthouse combo running (n sync) on the Sepolia testnet. Stopped Geth for a couple hours, but left Lighthouse running (and printing errors). A few hours later I restarted Geth, but Lighthouse (or the combo) seems to be stuck in a bad state as neither progresses, and I assume both is waiting for the other?

On the Geth side, we have a clean startup:

```
$ geth --sepolia console
INFO [07-29|19:01:55.695] Starting Geth on Sepolia testnet... 
[...]
INFO [07-29|19:01:55.914] Loaded most recent local header          number=1,599,205 hash=c109a1..097c27 td=17,000,018,015,853,232 age=2h39m7s
INFO [07-29|19:01:55.914] Loaded most recent local full block      number=1,599,205 hash=c109a1..097c27 td=17,000,018,015,853,232 age=2h39m7s
INFO [07-29|19:01:55.914] Loaded most recent local fast block      number=1,599,205 hash=c109a1..097c27 td=17,000,018,015,853,232 age=2h39m7s
INFO [07-29|19:01:55.914] Loaded most recent local finalized block number=1,599,146 hash=e7c859..adfb33 td=17,000,018,015,853,232 age=2h53m7s
[...]
// And stuck here, occasional messages, logging the engine API calls, I only get the occasional transition config exchange call.
```

On the Lighthouse side, continuing from a previous stream of errors due to Geth being offline:

```
Jul 29 16:01:46.284 ERRO Error during execution engine upcheck   error: Reqwest(reqwest::Error { kind: Request, url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Ipv4(127.0.0.1)), port: Some(8551), path: "/", query: None, fragment: None }, source: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 111, kind: ConnectionRefused, message: "Connection refused" })) }), service: exec
Jul 29 16:01:54.001 INFO Searching for peers                     current_slot: 281409, head_slot: 280614, finalized_epoch: 8767, finalized_root: 0xd227…a385, peers: 2, service: slot_notifier
Jul 29 16:01:57.308 INFO Execution engine online                 service: exec
Jul 29 16:01:57.309 INFO Issuing forkchoiceUpdated               forkchoice_state: ForkChoiceState { head_block_hash: ExecutionBlockHash(0xc109a1d4196159380ce942733ba303ee0ba659a309db0ff9943bb547ee097c27), safe_block_hash: ExecutionBlockHash(0x54f3b86a04f14ef5319b9620bb84a54d54e00fcce81b8a23bdf2fd532405377d), finalized_block_hash: ExecutionBlockHash(0xe7c8594c9eef276dc86cb5d6fbaa336e488aa65b037af6ae08edcdcd11adfb33) }, service: exec
Jul 29 16:02:06.001 INFO Searching for peers                     current_slot: 281410, head_slot: 280614, finalized_epoch: 8767, finalized_root: 0xd227…a385, peers: 2, service: slot_notifier
Jul 29 16:02:18.000 INFO Searching for peers                     current_slot: 281411, head_slot: 280614, finalized_epoch: 8767, finalized_root: 0xd227…a385, peers: 2, service: slot_notifier
Jul 29 16:02:30.000 INFO Searching for peers                     current_slot: 281412, head_slot: 280614, finalized_epoch: 8767, finalized_root: 0xd227…a385, peers: 2, service: slot_notifier
Jul 29 16:02:42.001 INFO Searching for peers                     current_slot: 281413, head_slot: 280614, finalized_epoch: 8767, finalized_root: 0xd227…a385, peers: 2, service: slot_notifier
Jul 29 16:02:54.001 INFO Searching for peers                     current_slot: 281414, head_slot: 280614, finalized_epoch: 8767, finalized_root: 0xd227…a385, peers: 2, service: slot_notifier
Jul 29 16:03:06.001 INFO Searching for peers                     current_slot: 281415, head_slot: 280614, finalized_epoch: 8767, finalized_root: 0xd227…a385, peers: 2, service: slot_notifier
Jul 29 16:03:18.001 INFO Searching for peers                     current_slot: 281416, head_slot: 280614, finalized_epoch: 8767, finalized_root: 0xd227…a385, peers: 3, service: slot_notifier
Jul 29 16:03:30.001 INFO Searching for peers
[...]
// Repeating these idefinitely, with progressing block numbers, with the occational

Jul 29 17:13:55.829 INFO Sync state updated                      new_state: Evaluating known peers, old_state: Stalled, service: sync
Jul 29 17:13:55.829 INFO Sync state updated                      new_state: Stalled, old_state: Evaluating known peers, service: sync
```

You can find the full stream of Geth and Lighthouse logs across various restarts at https://gist.github.com/karalabe/6a87ebc2c2f525e2690f13e74d9c311b

## Version

`stable`

## Present Behaviour

The only ever actual update sent to the execution engine seems to be based on this log:

```
Jul 29 16:01:57.309 INFO Issuing forkchoiceUpdated               forkchoice_state: ForkChoiceState { head_block_hash: ExecutionBlockHash(0xc109a1d4196159380ce942733ba303ee0ba659a309db0ff9943bb547ee097c27), safe_block_hash: ExecutionBlockHash(0x54f3b86a04f14ef5319b9620bb84a54d54e00fcce81b8a23bdf2fd532405377d), finalized_block_hash: ExecutionBlockHash(0xe7c8594c9eef276dc86cb5d6fbaa336e488aa65b037af6ae08edcdcd11adfb33) }, service: exec
```

This forkchoice update is a noop for Geth, since it's already on this exact block, as signaled by the startup logs:

```
INFO [07-29|19:01:55.914] Loaded most recent local header          number=1,599,205 hash=c109a1..097c27 td=17,000,018,015,853,232 age=2h39m7s
INFO [07-29|19:01:55.914] Loaded most recent local full block      number=1,599,205 hash=c109a1..097c27 td=17,000,018,015,853,232 age=2h39m7s
INFO [07-29|19:01:55.914] Loaded most recent local fast block      number=1,599,205 hash=c109a1..097c27 td=17,000,018,015,853,232 age=2h39m7s
INFO [07-29|19:01:55.914] Loaded most recent local finalized block number=1,599,146 hash=e7c859..adfb33 td=17,000,018,015,853,232 age=2h53m7s
```

No further NewPayload or ForkchoiceUpdates are sent from Lighthouse to Geth, thus it cannot do anything. Lighthouse seems to not want to do anything either, waiting for Geth to do something?

## Expected Behaviour

Lighthouse should detect that Geth and itself are in sync and should send the next block to Geth.

## Steps to resolve

A) I've left in in this state for a few hours, it did not resolve by itself.

B) I've restarted Geth after a number of hours, Lighthouse now rejects Geth as being too far behind (well duh, waiting for you dumdum)

```
Jul 29 17:23:19.103 CRIT Could not connect to a suitable eth1 node. Please ensure that you have an eth1 http server running locally on http://localhost:8545 or specify one or more (remote) endpoints using `--eth1-endpoints <COMMA-SEPARATED-SERVER-ADDRESSES>`. Also ensure that `eth` and `net` apis are enabled on the eth1 http server, warning: BLOCK PROPOSALS WILL FAIL WITHOUT VALID, SYNCED ETH1 CONNECTION, service: eth1_rpc
Jul 29 17:23:19.103 ERRO Failed to update eth1 cache             error: Failed to update Eth1 service: "All fallback errored: http://127.0.0.1:8551/, auth=true => EndpointError(FarBehind)", retry_millis: 60000, service: eth1_rpc
```

C) Restarting Lighthouse starts sync properly

```
Jul 29 17:25:58.323 INFO Sync state updated                      new_state: Syncing Finalized Chain, old_state: Stalled, service: sync
Jul 29 17:26:06.001 INFO Syncing                                 est_time: --, distance: 966 slots (3 hrs 13 mins), peers: 2, service: slot_notifier
```

i.e. Seems Lighthouse went into some weird state that is not natural, since a restart instantly started syncing without issues with the same Geth execution client that it connected to multiple times before and always rejected.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync stuck on execution engine delayed restart #3390

Description

Version

Present Behaviour

Expected Behaviour

Steps to resolve

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sync stuck on execution engine delayed restart #3390

Description

Description

Version

Present Behaviour

Expected Behaviour

Steps to resolve

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions