-
Notifications
You must be signed in to change notification settings - Fork 3.9k
op-{node|sync-tester}: EL Sync Correctness: EngineController Parallelism and EL Sync Tests #17564
Description
This issue claims that there is some synchronization issue in the op-node, especially in the EL sync code path.
Starting from #17565 first commit. The first commit simply prints debug.Stack() to nowhere:
optimism/op-node/rollup/engine/engine_controller.go
Lines 490 to 500 in 3c3944d
| if e.syncStatus == syncStatusFinishedELButNotFinalized { | |
| fc.SafeBlockHash = envelope.ExecutionPayload.BlockHash | |
| fc.FinalizedBlockHash = envelope.ExecutionPayload.BlockHash | |
| e.SetUnsafeHead(ref) // ensure that the unsafe head stays ahead of safe/finalized labels. | |
| e.emitter.Emit(ctx, UnsafeUpdateEvent{Ref: ref}) | |
| e.SetLocalSafeHead(ref) | |
| e.SetSafeHead(ref) | |
| e.onSafeUpdate(ctx, ref, ref) | |
| // This must be placed before e.SetFinalizedHead | |
| fmt.Fprintln(io.Discard, string(debug.Stack())) |
if we run EL Sync acceptance tests using no parallelism,
cd op-acceptance-tests/tests/sync_tester/sync_tester_hfs_ext_granite
TAILSCALE_NETWORKING=true GOMAXPROCS=1 go test ./... -run ^TestSyncTesterHFS_Granite_ELSync$ -count=1 -v We get always success. However, if we enable parallelism:
TAILSCALE_NETWORKING=true GOMAXPROCS=2 go test ./... -run ^TestSyncTesterHFS_Granite_ELSync$ -count=1 -v
We get consistent failure.
You may argue that this faulty behavior may be rooted from mis-implementation of sync-tester. However, if we re-run the test starting from the second commit:
optimism/op-node/rollup/engine/engine_controller.go
Lines 359 to 363 in 5378306
| if e.finalizedHead == (eth.L2BlockRef{}) { | |
| for i := range 20 { | |
| fmt.Printf("### %d initializeUnknowns finalizedRef %s\n", i, e.finalizedHead) | |
| time.Sleep(time.Microsecond) | |
| } |
If the e.finalizedHead is correctly synchronized, if the if statement is true, e.finalizedHead will be empty until explictly updated. However if we re-run using GOMAXPROCS=2,
### 0 initializeUnknowns finalizedRef 0x0000000000000000000000000000000000000000000000000000000000000000:0
### 1 initializeUnknowns finalizedRef 0x0000000000000000000000000000000000000000000000000000000000000000:0
### 2 initializeUnknowns finalizedRef 0x0000000000000000000000000000000000000000000000000000000000000000:0
### 3 initializeUnknowns finalizedRef 0xa828ef159f137180cdf7eb9c21355cb36057ca8642b8a4ee5a0fe0f4aca37e7b:15837935
### 4 initializeUnknowns finalizedRef 0xa828ef159f137180cdf7eb9c21355cb36057ca8642b8a4ee5a0fe0f4aca37e7b:15837935
### 5 initializeUnknowns finalizedRef
...
We can check that Engine Controller state is updated with race condition.
This cases the EL Sync to be flaky and unreliable, since after the race it fails to update the finalized head properly and be stucked.
Current mitigation for EL Sync tests are to explicitly disable parallelism, by using runtime.GOMAXPROCS(1).