Skip to content

Fix navigation data race that crashes pages on ARM64#212

Merged
Aaronontheweb merged 4 commits into
devfrom
fix/navigation-thread-race
May 18, 2026
Merged

Fix navigation data race that crashes pages on ARM64#212
Aaronontheweb merged 4 commits into
devfrom
fix/navigation-thread-race

Conversation

@Aaronontheweb

Copy link
Copy Markdown
Owner

Summary

ViewModel- and page-initiated navigation called NavigateTo directly, running
NavigateToInternal on whatever thread the caller was on — commonly an async
continuation on the thread pool. That mutated _currentPage concurrently with the
render loop and published a not-yet-bound page, which the render loop could
render (BuildLayout()) before OnBound() had run.

x86-64's TSO memory model masks the unsafe publication, so it works by accident on
Linux/Windows. ARM64's weak memory model exposes it as a NullReferenceException
in the page's BuildLayout() — the root cause of
netclaw-dev/netclaw#1069
(netclaw init crashes on Apple Silicon when the wizard hands off to the chat page).

Fix

  • Route VM/page navigation through the event channel via a new RequestNavigation
    helper, which posts a NavigationRequested event. NavigateToInternal — and
    every read/write of _currentPage / _currentViewModel / _layoutRoot — now
    runs only on the render-loop thread. This mirrors the NavigationRequested
    plumbing already used for input-driven navigation, so the race is eliminated
    structurally (no locks, no volatile, no memory-model reasoning).
  • Public NavigateTo is unchanged and still used for the initial startup
    navigation (single-threaded, before the render loop exists).
  • Add macos-latest (Apple Silicon) to the Test matrix for ARM64 baseline coverage.

Behavior change

VM/page navigation is now processed on the next event-loop iteration rather than
synchronously. Every caller is fire-and-forget, so this is the correct model.

Tests

  • New NavigationThreadSafetyTests — 2 deterministic tests asserting VM- and
    page-initiated navigation is posted to the event channel and not run
    synchronously on the caller's thread.
  • Full suite: 1004 passed, 0 failed.

Follow-up

#211 — add a navigation concurrency stress test (deterministic tests cannot catch
race regressions; that needs a stress test run on the Apple Silicon leg).

Test plan

  • dotnet test — 1004 passed
  • Termina PR CI green on the new macos-latest leg
  • After release, bump Termina in netclaw Directory.Packages.props and confirm the Native Smoke (macOS) init-wizard leg passes

ViewModel- and page-initiated navigation called NavigateTo directly,
running NavigateToInternal on whatever thread the caller was on (often
an async continuation on the thread pool). That mutated _currentPage
concurrently with the render loop and published a not-yet-bound page,
which the render loop could render before OnBound() had run. x86-64's
TSO memory model masks the unsafe publication; ARM64's weak memory
model exposes it as a NullReferenceException in the page's
BuildLayout() (root cause of netclaw-dev/netclaw#1069 — netclaw init
crash on Apple Silicon).

Route VM/page navigation through the event channel via RequestNavigation
so NavigateToInternal — and every read/write of _currentPage — runs only
on the render-loop thread. This mirrors the existing NavigationRequested
plumbing already used for input-driven navigation.

Also add macos-latest (Apple Silicon) to the Test matrix for ARM64
baseline coverage.
@Aaronontheweb Aaronontheweb added the bug Something isn't working label May 18, 2026

@Aaronontheweb Aaronontheweb left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Design makes sense - make the navigation part of the event model we use for bubbling everything else

Comment thread .github/workflows/pr_validation.yml Outdated
netclaw pins macos-26 (Apple Silicon) across its workflows, including
the Native Smoke (macOS) leg that surfaced the navigation race. Pin the
same image here instead of the floating macos-latest for reproducible
ARM64 coverage aligned with netclaw.
The AOT job's osx-arm64 leg still used the floating macos-latest. Pin it
to macos-26 like the Test matrix for consistent ARM64 coverage.
@Aaronontheweb Aaronontheweb enabled auto-merge (squash) May 18, 2026 16:36
@Aaronontheweb Aaronontheweb merged commit 070f3a6 into dev May 18, 2026
8 checks passed
@Aaronontheweb Aaronontheweb deleted the fix/navigation-thread-race branch May 18, 2026 16:39
@Aaronontheweb Aaronontheweb mentioned this pull request May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant