Skip to content

Improve beacon node failover in validator client [tracking issue] #3613

@michaelsproul

Description

@michaelsproul

Description

I think the beacon node failover feature is in need of some love, particularly now that we are post-merge and it is complicated by the addition of the execution node.

There are several issues:

  • Beacon nodes continue to self-report their status as synced when the execution node goes offline. This means they will be used by validator clients to produce sub-optimal attestations.
  • Beacon nodes self-report their status as synced for eight epochs after their internal sync state switches to syncing. Again this results in sub-optimal attestations.
  • Beacon nodes return nasty 500 internal server errors when the head is optimistic, e.g.
CRIT Error during attestation routine        slot: 4747474, committee_index: 11, error: "All endpoints failed http://localhost:5052/ => RequestFailed("Failed to produce attestation data: ServerMessage(ErrorMessage { code: 500, message: \"UNHANDLED_ERROR: HeadBlockNotFullyVerified { beacon_block_root: 0xbef1f36d129ca41de0a2da31962f5cb5025262f8813c6be2448524fc75be9947, execution_status: Optimistic(0x93219f971e74377cb48ba8050f1f193390d0313afd23c8cc4b1e8a119fb32fa1) }\", stacktraces: [] })")"

This error is probably handled most gracefully by a VC with redundant beacon nodes, as it should failover to the next BN (need to double check this, the log above is from a node without failover BNs).

Option 1: Fail Fast

One way to address all of these issues would be to add a flag to the beacon node like --fail-fast which makes it report its status as unsynced more readily:

The downside of this approach is that it adds configuration complexity: most beacon nodes used for redundancy should be configured with --fail-fast, but at least one beacon node per cluster should not be, in order to not cripple liveness in case of sync issues (which is what SYNC_TOLERANCE_EPOCHS is trying to guard against).

The --fail-fast solution may also still not fail fast enough for some users, Lighthouse will usually consider itself synced for a few epochs (2?) without any blocks, during which time attestations will still miss.

Option 2: Quality Control

Rather than changing the behaviour of the beacon node, we could change the behaviour of the validator client to make it smarter about which beacon nodes to try first. Instead of using a binary "synced" or "not synced" differentiation the VC could use other data to make its choice, e.g. sync distance, optimistic status, execution layer online/offline status.

The complication with this approach is that we need to work within the confines of standard APIs provided by beacon nodes (which e.g. doesn't include EL status), and we need to avoid opening ourselves to attacks where an attacker tricks us into following their chain by proposing blocks closer to the current wall clock slot than the canonical chain.

I think I have a preference for option 2 currently.

Metadata

Metadata

Assignees

Labels

RFCRequest for commentenhancementNew feature or requestval-clientRelates to the validator client binary

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions