Skip to content

node/app/event: use atomic.Pointer[handlerMap] for lock-free Publish#20584

Merged
AskAlexSharov merged 2 commits into
mainfrom
feat/eventbus-atomic-handlermap
Apr 16, 2026
Merged

node/app/event: use atomic.Pointer[handlerMap] for lock-free Publish#20584
AskAlexSharov merged 2 commits into
mainfrom
feat/eventbus-atomic-handlermap

Conversation

@mh0lt

@mh0lt mh0lt commented Apr 15, 2026

Copy link
Copy Markdown
Contributor

Summary

Follow-up from #20583 — addresses the deferred "Publish → atomic.Pointer" suggestion from the #20450 review.

Replaces the sync.RWMutex + clone-on-publish model with copy-on-write on atomic.Pointer[handlerMap]:

  • Publishers (Publish, HasCallback) load the pointer atomically with no lock and no clone — the loaded map is used as an immutable snapshot.
  • Writers (Subscribe/Unsubscribe) serialize on a new writerLock to compose the mutation, clone the current root, mutate the clone, and Store the new root.
  • Once-handler cleanup after Publish uses a CompareAndSwap loop so concurrent writers are not lost.

Why

  • Eliminates the RLock/Lock dance in every Publish call. Publish is the hot path — previously it took the lock to clone the entire map just to avoid holding the lock across handler execution.
  • Makes re-entrant Publish from inside handlers safe without any locking. Previously async handlers used RLock to read handler.bus, which conflicted with any concurrent Subscribe/Unsubscribe.

Incidental

eventHandler.bus is now atomic.Pointer[eventBus] so async callbacks can observe Unsubscribe (which calls bus.Store(nil)) without any lock. Added a newEventHandler helper so the atomic pointer is always initialized consistently at construction.

Test plan

  • go test ./node/app/event/...
  • go test -race ./node/app/event/... — clean
  • CI passes

🤖 Generated with Claude Code

Replace the RWMutex+clone-on-publish model with copy-on-write on
atomic.Pointer[handlerMap]. Publishers load the pointer atomically
with no lock and no clone; writers (Subscribe/Unsubscribe) serialize
on writerLock to compose the mutation, then clone the current root,
mutate the clone, and Store the new root.

Why:
- Eliminates the RLock/Lock dance in every Publish call. Publish is
  the hot path; previously it took the lock to clone the entire map
  just to avoid holding the lock across handler execution.
- Makes re-entrant Publish from inside handlers safe without any
  locking. Previously async handlers used RLock to read handler.bus,
  which conflicted with any concurrent Subscribe/Unsubscribe.
- Once-handler cleanup after Publish uses a CAS loop so concurrent
  writers are not lost.

Also:
- eventHandler.bus is now atomic.Pointer[eventBus] so async callbacks
  can observe Unsubscribe (which calls bus.Store(nil)) without any
  lock.
- Added newEventHandler helper so the atomic pointer is always
  initialized consistently at construction.

Tested with `go test -race ./node/app/event/...` — clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mh0lt added a commit that referenced this pull request Apr 15, 2026
… commit

The glamsterdam suite was tracking upstream
ethpandaops/ethereum-package@main unpinned, while other suites
(regular, pectra) pin to a specific version. Upstream commit 835dd9b
("feat: support gpu ere prover in zkboost", 2026-04-15) introduced an
undefined GpuConfig reference in zkboost_launcher.star:272, breaking
every Erigon PR that ran glamsterdam from that point on — regardless
of what the PR changed. Six unrelated PRs (#20471, #20472, #20526,
#20583, #20584, #20585) all failed identically.

Pin to e07503d16b (2026-04-13, the commit before the break) to stop
the bleeding. This was the last state under which our main successfully
ran glamsterdam. Revisit once upstream stabilises, or switch to a
tagged release that supports gloas_fork_epoch / fulu_fork_epoch which
our glamsterdam.io config requires.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors node/app/event’s eventBus concurrency model to make Publish and HasCallback lock-free by switching the handler tree to an atomic.Pointer with copy-on-write updates from Subscribe/Unsubscribe.

Changes:

  • Replace sync.RWMutex-protected handler map with atomic.Pointer[handlerMap] snapshots for lock-free reads.
  • Serialize writers with a writerLock, cloning and swapping in updated handler maps.
  • Make eventHandler.bus an atomic.Pointer[eventBus] so async callbacks can observe Unsubscribe without locks; add newEventHandler helper.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

bus.writerLock.Lock()
defer bus.writerLock.Unlock()

fnType := reflect.TypeOf(fn)
Comment on lines +63 to +65
// apply the mutation, and CAS-swap the new map in. This eliminates the
// previous read/write mutex and the clone-on-publish copy, and makes
// re-entrant Publish from inside handlers safe without any locking.
Comment on lines 418 to 419
return fmt.Errorf("handler %v not subscrbed", fn)
}
Comment on lines +444 to +451
// Remove once-handlers in a CAS loop so concurrent writers are not lost.
if onceHandlers := snapshot.collectOnceHandlers(); len(onceHandlers) > 0 {
bus.lock.Lock()
for _, h := range onceHandlers {
bus.handlerMap.removeOnceHandler(h.callBack)
for {
current := bus.handlerMap.Load()
next := current.clone()
for _, h := range onceHandlers {
next.removeOnceHandler(h.callBack)
}
Comment on lines 421 to 428
if len(currentMap.handlers)+len(currentMap.nextArgMap) == 0 {
for argIndex := argCount - 2; argIndex >= 0; argIndex-- {
prevMap := prevMaps[argIndex]
delete(prevMap.nextArgMap, fnType.In(argIndex+1))
if len(prevMap.handlers)+len(prevMap.nextArgMap) == 0 {
break
}
}
Comment on lines 295 to 297
if !(fnType.Kind() == reflect.Func) {
return fmt.Errorf("%s is not of type reflect.Func", reflect.TypeOf(fn).Kind())
}
@AskAlexSharov AskAlexSharov added this pull request to the merge queue Apr 16, 2026
Merged via the queue into main with commit 3d1c72b Apr 16, 2026
36 checks passed
@AskAlexSharov AskAlexSharov deleted the feat/eventbus-atomic-handlermap branch April 16, 2026 08:27
mh0lt added a commit that referenced this pull request Apr 16, 2026
Fixes the deadlock where concurrent activate/deactivate operations caused
lock-ordering violations between background goroutines fighting for the
component's RWMutex.

Design: each component has an inbox channel processed by a single actor
goroutine (runActor). All state-mutating lifecycle operations are
serialized through the inbox:

  - activate → sends msgActivate, blocks on reply for synchronous
    configure/initialize errors. Dependency activation is fire-and-forget
    within the actor; completion arrives via stateChanged events.
  - deactivate → sends msgDeactivate, returns immediately. Provider
    deactivation and dependency cascade run in the actor.
  - onComponentStateChanged → sends msgStateChanged. The event bus
    callback no longer runs state logic directly; it routes through
    the actor so state checks and transitions don't interleave with
    activate/deactivate.

Key changes from previous design:
  - Removed go func() from activate (lines 868/873) and deactivate
    (line 1034) — these spawned background goroutines that caused the
    deadlock. The actor goroutine replaces them.
  - Removed errgroup.Wait() from activateDependencies and
    deactivateDependencies — the actor must not block on cross-component
    operations or its inbox fills up. Dependency cascade is now fully
    event-driven: fire activate/deactivate on dependencies, let
    ComponentStateChanged events signal completion.
  - activate returns synchronous errors (configure/initialize failures)
    via a reply channel. This preserves the existing API contract where
    Activate() returns an error if setup fails.

Fuzz tests:
  - TestFuzzLifecycleConcurrent: UNSKIPPED, passes (no deadlock)
  - TestFuzzLifecycleSerial: UNSKIPPED, passes
  - TestFuzzAddRemoveConcurrent: still skipped (AddDependency not yet
    routed through actor — separate follow-up)

Remaining: event bus has a data race in handlerMap.publish when called
from multiple actor goroutines simultaneously — that's #20584's scope,
not this commit.
mh0lt added a commit that referenced this pull request Apr 16, 2026
Fixes the deadlock where concurrent activate/deactivate operations caused
lock-ordering violations between background goroutines fighting for the
component's RWMutex.

Design: each component has an inbox channel processed by a single actor
goroutine (runActor). All state-mutating lifecycle operations are
serialized through the inbox:

  - activate → sends msgActivate, blocks on reply for synchronous
    configure/initialize errors. Dependency activation is fire-and-forget
    within the actor; completion arrives via stateChanged events.
  - deactivate → sends msgDeactivate, returns immediately. Provider
    deactivation and dependency cascade run in the actor.
  - onComponentStateChanged → sends msgStateChanged. The event bus
    callback no longer runs state logic directly; it routes through
    the actor so state checks and transitions don't interleave with
    activate/deactivate.

Key changes from previous design:
  - Removed go func() from activate (lines 868/873) and deactivate
    (line 1034) — these spawned background goroutines that caused the
    deadlock. The actor goroutine replaces them.
  - Removed errgroup.Wait() from activateDependencies and
    deactivateDependencies — the actor must not block on cross-component
    operations or its inbox fills up. Dependency cascade is now fully
    event-driven: fire activate/deactivate on dependencies, let
    ComponentStateChanged events signal completion.
  - activate returns synchronous errors (configure/initialize failures)
    via a reply channel. This preserves the existing API contract where
    Activate() returns an error if setup fails.

Fuzz tests:
  - TestFuzzLifecycleConcurrent: UNSKIPPED, passes (no deadlock)
  - TestFuzzLifecycleSerial: UNSKIPPED, passes
  - TestFuzzAddRemoveConcurrent: still skipped (AddDependency not yet
    routed through actor — separate follow-up)

Remaining: event bus has a data race in handlerMap.publish when called
from multiple actor goroutines simultaneously — that's #20584's scope,
not this commit.
mh0lt added a commit that referenced this pull request Apr 16, 2026
Fixes the deadlock where concurrent activate/deactivate operations caused
lock-ordering violations between background goroutines fighting for the
component's RWMutex.

Design: each component has an inbox channel processed by a single actor
goroutine (runActor). All state-mutating lifecycle operations are
serialized through the inbox:

  - activate → sends msgActivate, blocks on reply for synchronous
    configure/initialize errors. Dependency activation is fire-and-forget
    within the actor; completion arrives via stateChanged events.
  - deactivate → sends msgDeactivate, returns immediately. Provider
    deactivation and dependency cascade run in the actor.
  - onComponentStateChanged → sends msgStateChanged. The event bus
    callback no longer runs state logic directly; it routes through
    the actor so state checks and transitions don't interleave with
    activate/deactivate.

Key changes from previous design:
  - Removed go func() from activate (lines 868/873) and deactivate
    (line 1034) — these spawned background goroutines that caused the
    deadlock. The actor goroutine replaces them.
  - Removed errgroup.Wait() from activateDependencies and
    deactivateDependencies — the actor must not block on cross-component
    operations or its inbox fills up. Dependency cascade is now fully
    event-driven: fire activate/deactivate on dependencies, let
    ComponentStateChanged events signal completion.
  - activate returns synchronous errors (configure/initialize failures)
    via a reply channel. This preserves the existing API contract where
    Activate() returns an error if setup fails.

Fuzz tests:
  - TestFuzzLifecycleConcurrent: UNSKIPPED, passes (no deadlock)
  - TestFuzzLifecycleSerial: UNSKIPPED, passes
  - TestFuzzAddRemoveConcurrent: still skipped (AddDependency not yet
    routed through actor — separate follow-up)

Remaining: event bus has a data race in handlerMap.publish when called
from multiple actor goroutines simultaneously — that's #20584's scope,
not this commit.
github-merge-queue Bot pushed a commit that referenced this pull request Apr 16, 2026
## Summary

Follow-up PR addressing seven unresolved review comments from #20450
(componentization framework). All fixes are focused and localized.

### Fixes included

- **`event/eventbus.go`**: `HasCallback` now uses `RLock`/`RUnlock`
instead of `Lock`/`Unlock` — it only reads `handlerMap` and a full write
lock blocks concurrent `Publish`/`Subscribe` unnecessarily.

- **`component/componentdomain.go`**: Guard nil dereference of
`opts.execPoolSize` in the root-domain fallback path. Also return
`false` from the `WithDependentDomain` option callback when the passed
`ComponentDomain` is not a `*componentDomain`, instead of panicking on
the type assertion.

- **`event/managedeventbus.go`**: Reject non-pointer inputs to
`Register` and `UnregisterAll` with a descriptive error instead of
panicking inside `reflect.Value.Pointer()`.

- **`event/servicebus.go`**: Add `objectPointer` helper that validates
`reflect.Value` kind.
`Register`/`Unregister`/`UnregisterAll`/`Registrations` all use it.
`Registrations` now returns a copy of the internal slice so callers
cannot mutate shared state without the lock.

- **`util/compare.go`**: `Equal` guards against `==` on uncomparable
types (slices, maps, funcs). Falls back to `reflect.DeepEqual` when
either operand has a non-comparable type, preventing runtime panics.

- **`typeid.go`**: Initialize `LocalTypeDomain` to a real named domain
at package init instead of leaving it as a nil `TODO`. Previously
`TypeIdOf` / `newIdFor` would build IDs with a nil domain.

### Larger refactors split out

Two review suggestions warranted their own PRs because they touch many
call sites and benefit from independent review:

- **#20584**: `Publish` → `atomic.Pointer[handlerMap]` — lock-free
`Publish`/`HasCallback`, copy-on-write writers. Eliminates the `RWMutex`
and clone-on-publish entirely.
- **#20585**: `relations` → `[]*component` — removes interface-header
allocation and `unsafe.Pointer` indirection in relations storage.

## Test plan

- [x] `go build ./...`
- [x] `go test ./node/app/...`
- [ ] CI passes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants