Description
I had a Geth + Lighthouse combo running (n sync) on the Sepolia testnet. Stopped Geth for a couple hours, but left Lighthouse running (and printing errors). A few hours later I restarted Geth, but Lighthouse (or the combo) seems to be stuck in a bad state as neither progresses, and I assume both is waiting for the other?
On the Geth side, we have a clean startup:
$ geth --sepolia console
INFO [07-29|19:01:55.695] Starting Geth on Sepolia testnet...
[...]
INFO [07-29|19:01:55.914] Loaded most recent local header number=1,599,205 hash=c109a1..097c27 td=17,000,018,015,853,232 age=2h39m7s
INFO [07-29|19:01:55.914] Loaded most recent local full block number=1,599,205 hash=c109a1..097c27 td=17,000,018,015,853,232 age=2h39m7s
INFO [07-29|19:01:55.914] Loaded most recent local fast block number=1,599,205 hash=c109a1..097c27 td=17,000,018,015,853,232 age=2h39m7s
INFO [07-29|19:01:55.914] Loaded most recent local finalized block number=1,599,146 hash=e7c859..adfb33 td=17,000,018,015,853,232 age=2h53m7s
[...]
// And stuck here, occasional messages, logging the engine API calls, I only get the occasional transition config exchange call.
On the Lighthouse side, continuing from a previous stream of errors due to Geth being offline:
Jul 29 16:01:46.284 ERRO Error during execution engine upcheck error: Reqwest(reqwest::Error { kind: Request, url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Ipv4(127.0.0.1)), port: Some(8551), path: "/", query: None, fragment: None }, source: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 111, kind: ConnectionRefused, message: "Connection refused" })) }), service: exec
Jul 29 16:01:54.001 INFO Searching for peers current_slot: 281409, head_slot: 280614, finalized_epoch: 8767, finalized_root: 0xd227…a385, peers: 2, service: slot_notifier
Jul 29 16:01:57.308 INFO Execution engine online service: exec
Jul 29 16:01:57.309 INFO Issuing forkchoiceUpdated forkchoice_state: ForkChoiceState { head_block_hash: ExecutionBlockHash(0xc109a1d4196159380ce942733ba303ee0ba659a309db0ff9943bb547ee097c27), safe_block_hash: ExecutionBlockHash(0x54f3b86a04f14ef5319b9620bb84a54d54e00fcce81b8a23bdf2fd532405377d), finalized_block_hash: ExecutionBlockHash(0xe7c8594c9eef276dc86cb5d6fbaa336e488aa65b037af6ae08edcdcd11adfb33) }, service: exec
Jul 29 16:02:06.001 INFO Searching for peers current_slot: 281410, head_slot: 280614, finalized_epoch: 8767, finalized_root: 0xd227…a385, peers: 2, service: slot_notifier
Jul 29 16:02:18.000 INFO Searching for peers current_slot: 281411, head_slot: 280614, finalized_epoch: 8767, finalized_root: 0xd227…a385, peers: 2, service: slot_notifier
Jul 29 16:02:30.000 INFO Searching for peers current_slot: 281412, head_slot: 280614, finalized_epoch: 8767, finalized_root: 0xd227…a385, peers: 2, service: slot_notifier
Jul 29 16:02:42.001 INFO Searching for peers current_slot: 281413, head_slot: 280614, finalized_epoch: 8767, finalized_root: 0xd227…a385, peers: 2, service: slot_notifier
Jul 29 16:02:54.001 INFO Searching for peers current_slot: 281414, head_slot: 280614, finalized_epoch: 8767, finalized_root: 0xd227…a385, peers: 2, service: slot_notifier
Jul 29 16:03:06.001 INFO Searching for peers current_slot: 281415, head_slot: 280614, finalized_epoch: 8767, finalized_root: 0xd227…a385, peers: 2, service: slot_notifier
Jul 29 16:03:18.001 INFO Searching for peers current_slot: 281416, head_slot: 280614, finalized_epoch: 8767, finalized_root: 0xd227…a385, peers: 3, service: slot_notifier
Jul 29 16:03:30.001 INFO Searching for peers
[...]
// Repeating these idefinitely, with progressing block numbers, with the occational
Jul 29 17:13:55.829 INFO Sync state updated new_state: Evaluating known peers, old_state: Stalled, service: sync
Jul 29 17:13:55.829 INFO Sync state updated new_state: Stalled, old_state: Evaluating known peers, service: sync
You can find the full stream of Geth and Lighthouse logs across various restarts at https://gist.github.com/karalabe/6a87ebc2c2f525e2690f13e74d9c311b
Version
stable
Present Behaviour
The only ever actual update sent to the execution engine seems to be based on this log:
Jul 29 16:01:57.309 INFO Issuing forkchoiceUpdated forkchoice_state: ForkChoiceState { head_block_hash: ExecutionBlockHash(0xc109a1d4196159380ce942733ba303ee0ba659a309db0ff9943bb547ee097c27), safe_block_hash: ExecutionBlockHash(0x54f3b86a04f14ef5319b9620bb84a54d54e00fcce81b8a23bdf2fd532405377d), finalized_block_hash: ExecutionBlockHash(0xe7c8594c9eef276dc86cb5d6fbaa336e488aa65b037af6ae08edcdcd11adfb33) }, service: exec
This forkchoice update is a noop for Geth, since it's already on this exact block, as signaled by the startup logs:
INFO [07-29|19:01:55.914] Loaded most recent local header number=1,599,205 hash=c109a1..097c27 td=17,000,018,015,853,232 age=2h39m7s
INFO [07-29|19:01:55.914] Loaded most recent local full block number=1,599,205 hash=c109a1..097c27 td=17,000,018,015,853,232 age=2h39m7s
INFO [07-29|19:01:55.914] Loaded most recent local fast block number=1,599,205 hash=c109a1..097c27 td=17,000,018,015,853,232 age=2h39m7s
INFO [07-29|19:01:55.914] Loaded most recent local finalized block number=1,599,146 hash=e7c859..adfb33 td=17,000,018,015,853,232 age=2h53m7s
No further NewPayload or ForkchoiceUpdates are sent from Lighthouse to Geth, thus it cannot do anything. Lighthouse seems to not want to do anything either, waiting for Geth to do something?
Expected Behaviour
Lighthouse should detect that Geth and itself are in sync and should send the next block to Geth.
Steps to resolve
A) I've left in in this state for a few hours, it did not resolve by itself.
B) I've restarted Geth after a number of hours, Lighthouse now rejects Geth as being too far behind (well duh, waiting for you dumdum)
Jul 29 17:23:19.103 CRIT Could not connect to a suitable eth1 node. Please ensure that you have an eth1 http server running locally on http://localhost:8545 or specify one or more (remote) endpoints using `--eth1-endpoints <COMMA-SEPARATED-SERVER-ADDRESSES>`. Also ensure that `eth` and `net` apis are enabled on the eth1 http server, warning: BLOCK PROPOSALS WILL FAIL WITHOUT VALID, SYNCED ETH1 CONNECTION, service: eth1_rpc
Jul 29 17:23:19.103 ERRO Failed to update eth1 cache error: Failed to update Eth1 service: "All fallback errored: http://127.0.0.1:8551/, auth=true => EndpointError(FarBehind)", retry_millis: 60000, service: eth1_rpc
C) Restarting Lighthouse starts sync properly
Jul 29 17:25:58.323 INFO Sync state updated new_state: Syncing Finalized Chain, old_state: Stalled, service: sync
Jul 29 17:26:06.001 INFO Syncing est_time: --, distance: 966 slots (3 hrs 13 mins), peers: 2, service: slot_notifier
i.e. Seems Lighthouse went into some weird state that is not natural, since a restart instantly started syncing without issues with the same Geth execution client that it connected to multiple times before and always rejected.
Description
I had a Geth + Lighthouse combo running (n sync) on the Sepolia testnet. Stopped Geth for a couple hours, but left Lighthouse running (and printing errors). A few hours later I restarted Geth, but Lighthouse (or the combo) seems to be stuck in a bad state as neither progresses, and I assume both is waiting for the other?
On the Geth side, we have a clean startup:
On the Lighthouse side, continuing from a previous stream of errors due to Geth being offline:
You can find the full stream of Geth and Lighthouse logs across various restarts at https://gist.github.com/karalabe/6a87ebc2c2f525e2690f13e74d9c311b
Version
stablePresent Behaviour
The only ever actual update sent to the execution engine seems to be based on this log:
This forkchoice update is a noop for Geth, since it's already on this exact block, as signaled by the startup logs:
No further NewPayload or ForkchoiceUpdates are sent from Lighthouse to Geth, thus it cannot do anything. Lighthouse seems to not want to do anything either, waiting for Geth to do something?
Expected Behaviour
Lighthouse should detect that Geth and itself are in sync and should send the next block to Geth.
Steps to resolve
A) I've left in in this state for a few hours, it did not resolve by itself.
B) I've restarted Geth after a number of hours, Lighthouse now rejects Geth as being too far behind (well duh, waiting for you dumdum)
C) Restarting Lighthouse starts sync properly
i.e. Seems Lighthouse went into some weird state that is not natural, since a restart instantly started syncing without issues with the same Geth execution client that it connected to multiple times before and always rejected.