Skip to content

op-{node|sync-tester}: EL Sync Correctness: EngineController Parallelism and EL Sync Tests #17564

@pcw109550

Description

@pcw109550

This issue claims that there is some synchronization issue in the op-node, especially in the EL sync code path.

Starting from #17565 first commit. The first commit simply prints debug.Stack() to nowhere:

if e.syncStatus == syncStatusFinishedELButNotFinalized {
fc.SafeBlockHash = envelope.ExecutionPayload.BlockHash
fc.FinalizedBlockHash = envelope.ExecutionPayload.BlockHash
e.SetUnsafeHead(ref) // ensure that the unsafe head stays ahead of safe/finalized labels.
e.emitter.Emit(ctx, UnsafeUpdateEvent{Ref: ref})
e.SetLocalSafeHead(ref)
e.SetSafeHead(ref)
e.onSafeUpdate(ctx, ref, ref)
// This must be placed before e.SetFinalizedHead
fmt.Fprintln(io.Discard, string(debug.Stack()))

if we run EL Sync acceptance tests using no parallelism,

cd op-acceptance-tests/tests/sync_tester/sync_tester_hfs_ext_granite
TAILSCALE_NETWORKING=true GOMAXPROCS=1 go test ./... -run ^TestSyncTesterHFS_Granite_ELSync$   -count=1 -v 

We get always success. However, if we enable parallelism:

TAILSCALE_NETWORKING=true GOMAXPROCS=2 go test ./... -run ^TestSyncTesterHFS_Granite_ELSync$   -count=1 -v 

We get consistent failure.

You may argue that this faulty behavior may be rooted from mis-implementation of sync-tester. However, if we re-run the test starting from the second commit:

if e.finalizedHead == (eth.L2BlockRef{}) {
for i := range 20 {
fmt.Printf("### %d initializeUnknowns finalizedRef %s\n", i, e.finalizedHead)
time.Sleep(time.Microsecond)
}

If the e.finalizedHead is correctly synchronized, if the if statement is true, e.finalizedHead will be empty until explictly updated. However if we re-run using GOMAXPROCS=2,

### 0 initializeUnknowns finalizedRef 0x0000000000000000000000000000000000000000000000000000000000000000:0
### 1 initializeUnknowns finalizedRef 0x0000000000000000000000000000000000000000000000000000000000000000:0
### 2 initializeUnknowns finalizedRef 0x0000000000000000000000000000000000000000000000000000000000000000:0
### 3 initializeUnknowns finalizedRef 0xa828ef159f137180cdf7eb9c21355cb36057ca8642b8a4ee5a0fe0f4aca37e7b:15837935
### 4 initializeUnknowns finalizedRef 0xa828ef159f137180cdf7eb9c21355cb36057ca8642b8a4ee5a0fe0f4aca37e7b:15837935
### 5 initializeUnknowns finalizedRef 
...

We can check that Engine Controller state is updated with race condition.

This cases the EL Sync to be flaky and unreliable, since after the race it fails to update the finalized head properly and be stucked.

Current mitigation for EL Sync tests are to explicitly disable parallelism, by using runtime.GOMAXPROCS(1).

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions