node/app/event: use atomic.Pointer[handlerMap] for lock-free Publish#20584
Merged
Conversation
Replace the RWMutex+clone-on-publish model with copy-on-write on atomic.Pointer[handlerMap]. Publishers load the pointer atomically with no lock and no clone; writers (Subscribe/Unsubscribe) serialize on writerLock to compose the mutation, then clone the current root, mutate the clone, and Store the new root. Why: - Eliminates the RLock/Lock dance in every Publish call. Publish is the hot path; previously it took the lock to clone the entire map just to avoid holding the lock across handler execution. - Makes re-entrant Publish from inside handlers safe without any locking. Previously async handlers used RLock to read handler.bus, which conflicted with any concurrent Subscribe/Unsubscribe. - Once-handler cleanup after Publish uses a CAS loop so concurrent writers are not lost. Also: - eventHandler.bus is now atomic.Pointer[eventBus] so async callbacks can observe Unsubscribe (which calls bus.Store(nil)) without any lock. - Added newEventHandler helper so the atomic pointer is always initialized consistently at construction. Tested with `go test -race ./node/app/event/...` — clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 tasks
mh0lt
added a commit
that referenced
this pull request
Apr 15, 2026
… commit
The glamsterdam suite was tracking upstream
ethpandaops/ethereum-package@main unpinned, while other suites
(regular, pectra) pin to a specific version. Upstream commit 835dd9b
("feat: support gpu ere prover in zkboost", 2026-04-15) introduced an
undefined GpuConfig reference in zkboost_launcher.star:272, breaking
every Erigon PR that ran glamsterdam from that point on — regardless
of what the PR changed. Six unrelated PRs (#20471, #20472, #20526,
#20583, #20584, #20585) all failed identically.
Pin to e07503d16b (2026-04-13, the commit before the break) to stop
the bleeding. This was the last state under which our main successfully
ran glamsterdam. Revisit once upstream stabilises, or switch to a
tagged release that supports gloas_fork_epoch / fulu_fork_epoch which
our glamsterdam.io config requires.
2 tasks
AskAlexSharov
approved these changes
Apr 16, 2026
Contributor
There was a problem hiding this comment.
Pull request overview
This PR refactors node/app/event’s eventBus concurrency model to make Publish and HasCallback lock-free by switching the handler tree to an atomic.Pointer with copy-on-write updates from Subscribe/Unsubscribe.
Changes:
- Replace
sync.RWMutex-protected handler map withatomic.Pointer[handlerMap]snapshots for lock-free reads. - Serialize writers with a
writerLock, cloning and swapping in updated handler maps. - Make
eventHandler.busanatomic.Pointer[eventBus]so async callbacks can observeUnsubscribewithout locks; addnewEventHandlerhelper.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| bus.writerLock.Lock() | ||
| defer bus.writerLock.Unlock() | ||
|
|
||
| fnType := reflect.TypeOf(fn) |
Comment on lines
+63
to
+65
| // apply the mutation, and CAS-swap the new map in. This eliminates the | ||
| // previous read/write mutex and the clone-on-publish copy, and makes | ||
| // re-entrant Publish from inside handlers safe without any locking. |
Comment on lines
418
to
419
| return fmt.Errorf("handler %v not subscrbed", fn) | ||
| } |
Comment on lines
+444
to
+451
| // Remove once-handlers in a CAS loop so concurrent writers are not lost. | ||
| if onceHandlers := snapshot.collectOnceHandlers(); len(onceHandlers) > 0 { | ||
| bus.lock.Lock() | ||
| for _, h := range onceHandlers { | ||
| bus.handlerMap.removeOnceHandler(h.callBack) | ||
| for { | ||
| current := bus.handlerMap.Load() | ||
| next := current.clone() | ||
| for _, h := range onceHandlers { | ||
| next.removeOnceHandler(h.callBack) | ||
| } |
Comment on lines
421
to
428
| if len(currentMap.handlers)+len(currentMap.nextArgMap) == 0 { | ||
| for argIndex := argCount - 2; argIndex >= 0; argIndex-- { | ||
| prevMap := prevMaps[argIndex] | ||
| delete(prevMap.nextArgMap, fnType.In(argIndex+1)) | ||
| if len(prevMap.handlers)+len(prevMap.nextArgMap) == 0 { | ||
| break | ||
| } | ||
| } |
Comment on lines
295
to
297
| if !(fnType.Kind() == reflect.Func) { | ||
| return fmt.Errorf("%s is not of type reflect.Func", reflect.TypeOf(fn).Kind()) | ||
| } |
mh0lt
added a commit
that referenced
this pull request
Apr 16, 2026
Fixes the deadlock where concurrent activate/deactivate operations caused
lock-ordering violations between background goroutines fighting for the
component's RWMutex.
Design: each component has an inbox channel processed by a single actor
goroutine (runActor). All state-mutating lifecycle operations are
serialized through the inbox:
- activate → sends msgActivate, blocks on reply for synchronous
configure/initialize errors. Dependency activation is fire-and-forget
within the actor; completion arrives via stateChanged events.
- deactivate → sends msgDeactivate, returns immediately. Provider
deactivation and dependency cascade run in the actor.
- onComponentStateChanged → sends msgStateChanged. The event bus
callback no longer runs state logic directly; it routes through
the actor so state checks and transitions don't interleave with
activate/deactivate.
Key changes from previous design:
- Removed go func() from activate (lines 868/873) and deactivate
(line 1034) — these spawned background goroutines that caused the
deadlock. The actor goroutine replaces them.
- Removed errgroup.Wait() from activateDependencies and
deactivateDependencies — the actor must not block on cross-component
operations or its inbox fills up. Dependency cascade is now fully
event-driven: fire activate/deactivate on dependencies, let
ComponentStateChanged events signal completion.
- activate returns synchronous errors (configure/initialize failures)
via a reply channel. This preserves the existing API contract where
Activate() returns an error if setup fails.
Fuzz tests:
- TestFuzzLifecycleConcurrent: UNSKIPPED, passes (no deadlock)
- TestFuzzLifecycleSerial: UNSKIPPED, passes
- TestFuzzAddRemoveConcurrent: still skipped (AddDependency not yet
routed through actor — separate follow-up)
Remaining: event bus has a data race in handlerMap.publish when called
from multiple actor goroutines simultaneously — that's #20584's scope,
not this commit.
mh0lt
added a commit
that referenced
this pull request
Apr 16, 2026
Fixes the deadlock where concurrent activate/deactivate operations caused
lock-ordering violations between background goroutines fighting for the
component's RWMutex.
Design: each component has an inbox channel processed by a single actor
goroutine (runActor). All state-mutating lifecycle operations are
serialized through the inbox:
- activate → sends msgActivate, blocks on reply for synchronous
configure/initialize errors. Dependency activation is fire-and-forget
within the actor; completion arrives via stateChanged events.
- deactivate → sends msgDeactivate, returns immediately. Provider
deactivation and dependency cascade run in the actor.
- onComponentStateChanged → sends msgStateChanged. The event bus
callback no longer runs state logic directly; it routes through
the actor so state checks and transitions don't interleave with
activate/deactivate.
Key changes from previous design:
- Removed go func() from activate (lines 868/873) and deactivate
(line 1034) — these spawned background goroutines that caused the
deadlock. The actor goroutine replaces them.
- Removed errgroup.Wait() from activateDependencies and
deactivateDependencies — the actor must not block on cross-component
operations or its inbox fills up. Dependency cascade is now fully
event-driven: fire activate/deactivate on dependencies, let
ComponentStateChanged events signal completion.
- activate returns synchronous errors (configure/initialize failures)
via a reply channel. This preserves the existing API contract where
Activate() returns an error if setup fails.
Fuzz tests:
- TestFuzzLifecycleConcurrent: UNSKIPPED, passes (no deadlock)
- TestFuzzLifecycleSerial: UNSKIPPED, passes
- TestFuzzAddRemoveConcurrent: still skipped (AddDependency not yet
routed through actor — separate follow-up)
Remaining: event bus has a data race in handlerMap.publish when called
from multiple actor goroutines simultaneously — that's #20584's scope,
not this commit.
mh0lt
added a commit
that referenced
this pull request
Apr 16, 2026
Fixes the deadlock where concurrent activate/deactivate operations caused
lock-ordering violations between background goroutines fighting for the
component's RWMutex.
Design: each component has an inbox channel processed by a single actor
goroutine (runActor). All state-mutating lifecycle operations are
serialized through the inbox:
- activate → sends msgActivate, blocks on reply for synchronous
configure/initialize errors. Dependency activation is fire-and-forget
within the actor; completion arrives via stateChanged events.
- deactivate → sends msgDeactivate, returns immediately. Provider
deactivation and dependency cascade run in the actor.
- onComponentStateChanged → sends msgStateChanged. The event bus
callback no longer runs state logic directly; it routes through
the actor so state checks and transitions don't interleave with
activate/deactivate.
Key changes from previous design:
- Removed go func() from activate (lines 868/873) and deactivate
(line 1034) — these spawned background goroutines that caused the
deadlock. The actor goroutine replaces them.
- Removed errgroup.Wait() from activateDependencies and
deactivateDependencies — the actor must not block on cross-component
operations or its inbox fills up. Dependency cascade is now fully
event-driven: fire activate/deactivate on dependencies, let
ComponentStateChanged events signal completion.
- activate returns synchronous errors (configure/initialize failures)
via a reply channel. This preserves the existing API contract where
Activate() returns an error if setup fails.
Fuzz tests:
- TestFuzzLifecycleConcurrent: UNSKIPPED, passes (no deadlock)
- TestFuzzLifecycleSerial: UNSKIPPED, passes
- TestFuzzAddRemoveConcurrent: still skipped (AddDependency not yet
routed through actor — separate follow-up)
Remaining: event bus has a data race in handlerMap.publish when called
from multiple actor goroutines simultaneously — that's #20584's scope,
not this commit.
github-merge-queue Bot
pushed a commit
that referenced
this pull request
Apr 16, 2026
## Summary Follow-up PR addressing seven unresolved review comments from #20450 (componentization framework). All fixes are focused and localized. ### Fixes included - **`event/eventbus.go`**: `HasCallback` now uses `RLock`/`RUnlock` instead of `Lock`/`Unlock` — it only reads `handlerMap` and a full write lock blocks concurrent `Publish`/`Subscribe` unnecessarily. - **`component/componentdomain.go`**: Guard nil dereference of `opts.execPoolSize` in the root-domain fallback path. Also return `false` from the `WithDependentDomain` option callback when the passed `ComponentDomain` is not a `*componentDomain`, instead of panicking on the type assertion. - **`event/managedeventbus.go`**: Reject non-pointer inputs to `Register` and `UnregisterAll` with a descriptive error instead of panicking inside `reflect.Value.Pointer()`. - **`event/servicebus.go`**: Add `objectPointer` helper that validates `reflect.Value` kind. `Register`/`Unregister`/`UnregisterAll`/`Registrations` all use it. `Registrations` now returns a copy of the internal slice so callers cannot mutate shared state without the lock. - **`util/compare.go`**: `Equal` guards against `==` on uncomparable types (slices, maps, funcs). Falls back to `reflect.DeepEqual` when either operand has a non-comparable type, preventing runtime panics. - **`typeid.go`**: Initialize `LocalTypeDomain` to a real named domain at package init instead of leaving it as a nil `TODO`. Previously `TypeIdOf` / `newIdFor` would build IDs with a nil domain. ### Larger refactors split out Two review suggestions warranted their own PRs because they touch many call sites and benefit from independent review: - **#20584**: `Publish` → `atomic.Pointer[handlerMap]` — lock-free `Publish`/`HasCallback`, copy-on-write writers. Eliminates the `RWMutex` and clone-on-publish entirely. - **#20585**: `relations` → `[]*component` — removes interface-header allocation and `unsafe.Pointer` indirection in relations storage. ## Test plan - [x] `go build ./...` - [x] `go test ./node/app/...` - [ ] CI passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up from #20583 — addresses the deferred "Publish → atomic.Pointer" suggestion from the #20450 review.
Replaces the
sync.RWMutex+ clone-on-publish model with copy-on-write onatomic.Pointer[handlerMap]:Publish,HasCallback) load the pointer atomically with no lock and no clone — the loaded map is used as an immutable snapshot.Subscribe/Unsubscribe) serialize on a newwriterLockto compose the mutation, clone the current root, mutate the clone, andStorethe new root.Publishuses aCompareAndSwaploop so concurrent writers are not lost.Why
RLock/Lockdance in everyPublishcall.Publishis the hot path — previously it took the lock to clone the entire map just to avoid holding the lock across handler execution.Publishfrom inside handlers safe without any locking. Previously async handlers usedRLockto readhandler.bus, which conflicted with any concurrentSubscribe/Unsubscribe.Incidental
eventHandler.busis nowatomic.Pointer[eventBus]so async callbacks can observeUnsubscribe(which callsbus.Store(nil)) without any lock. Added anewEventHandlerhelper so the atomic pointer is always initialized consistently at construction.Test plan
go test ./node/app/event/...go test -race ./node/app/event/...— clean🤖 Generated with Claude Code