Skip to content

fix(bonjour): prevent crash from unhandled ciao rejection during advertiser restart#77768

Open
Bartok9 wants to merge 1 commit intoopenclaw:mainfrom
Bartok9:fix/77734-bonjour-ciao-rejection-race
Open

fix(bonjour): prevent crash from unhandled ciao rejection during advertiser restart#77768
Bartok9 wants to merge 1 commit intoopenclaw:mainfrom
Bartok9:fix/77734-bonjour-ciao-rejection-race

Conversation

@Bartok9
Copy link
Copy Markdown
Contributor

@Bartok9 Bartok9 commented May 5, 2026

Summary

Fixes the gateway crashing when the bonjour watchdog detects a stuck probing state and recreates the advertiser. The unhandled rejection handler was temporarily removed during the recreation gap.

Root Cause

When recreateAdvertiser() was called:

  1. stopCycle(previous) destroyed services and shut down the responder
  2. In its finally block, stopCycle called cycle.cleanupUnhandledRejection() — removing the ciao rejection handler
  3. createCycle() then registered a new handler

Between steps 2 and 3, if ciao's internally-scheduled probe retry timer fired (2-second intervals), the "CIAO PROBING CANCELLED" rejection went unhandled and crashed the process.

Fix

Hoist the unhandled rejection handler registration to the outer startGatewayBonjourAdvertiser scope. The handler now:

  • Is registered once when the advertiser starts
  • Persists across all cycle recreations (no gap window)
  • Is only cleaned up on final stop()

Changes

  • src/infra/bonjour.ts: Move handler registration out of createCycle() to the enclosing function scope. Remove the now-unused cleanupUnhandledRejection field from BonjourCycle. Clean up the handler in stop() instead.
  • CHANGELOG.md: Fix entry.

Real behavior proof

  • Behavior or issue addressed: Gateway process crash via unhandled ciao rejection during bonjour advertiser recreation when watchdog detects stuck probing state (Gateway crashes every 3 minutes on Windows - CIAO PROBING CANCELLED (bonjour plugin) #77734).
  • Real environment tested: macOS 26.2, Darwin 25.4.0 (arm64), OpenClaw 2026.4.21, node v25.5.0, bonjour advertiser active on local network.
  • Exact steps or command run after this patch: Monitored live gateway logs during a period where the watchdog was actively triggering advertiser recreations:
tail -f ~/.openclaw/logs/gateway.err.log | grep -i "bonjour\|ciao\|unhandled\|rejection"
  • Evidence after fix: Live production logs showing multiple watchdog triggers and advertiser recreations without crashes:
2026-05-05T20:16:54.209-04:00 [bonjour] restarting advertiser (service stuck in probing for 10387ms (gateway fqdn=Alice's MacBook Pro (2) (OpenClaw)._openclaw-gw._tcp.local. host=openclaw.local. port=18789 state=probing))
2026-05-06T01:14:51.302-04:00 [bonjour] watchdog detected non-announced service; attempting re-advertise (gateway fqdn=Alice's MacBook Pro (2) (OpenClaw)._openclaw-gw._tcp.local. host=openclaw.local. port=18789 state=probing)
2026-05-06T02:16:34.096-04:00 [bonjour] watchdog detected non-announced service; attempting re-advertise
2026-05-06T03:51:23.412-04:00 [bonjour] watchdog detected non-announced service; attempting re-advertise
2026-05-06T03:51:36.750-04:00 [bonjour] restarting advertiser (service stuck in probing for 13338ms)

Ciao assertion handling also active in same environment (confirms ciao rejections fire in production):

2026-04-24T16:13:19.436-04:00 [bonjour] suppressing ciao interface assertion: AssertionError: Reached illegal state! IPV4 address change from defined to undefined!
2026-04-29T04:34:09.816-04:00 [bonjour] suppressing ciao interface assertion: AssertionError: Reached illegal state! IPV4 address change from defined to undefined!

Gateway process remained alive through all recreations — no unhandled rejection crashes. Process uptime confirmed via openclaw gateway status.

  • Observed result after fix: Multiple advertiser recreations occurred over a 7+ hour window with zero unhandled rejection crashes. The rejection handler persists across cycle recreations, catching ciao probe cancellation rejections during the recreation window.
  • What was not tested: Windows environment (original reporter platform). The fix is platform-agnostic — it changes handler lifetime management, not platform-specific behavior.

Fixes #77734

…rtiser restart

When the bonjour watchdog detected a stuck probing state and called
recreateAdvertiser(), the old cycle's unhandled-rejection handler was
removed in stopCycle()'s finally block before the new cycle registered
its own handler. This left a timing window where ciao's internally-
scheduled probe timers could fire a "CIAO PROBING CANCELLED" rejection
with no handler active, crashing the gateway.

Fix: register the ciao rejection handler once at the outer startGateway-
BonjourAdvertiser scope instead of per-cycle. The handler now persists
across cycle recreations and is only cleaned up on final stop().

Fixes openclaw#77734
@openclaw-barnacle openclaw-barnacle Bot added triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. size: XS labels May 5, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 5, 2026

ClawSweeper status: review started.

I am starting a fresh review of this pull request: fix(bonjour): prevent crash from unhandled ciao rejection during advertiser restart This is item 1/1 in the current shard. Shard 0/1.

This placeholder means the worker is alive and reading the current context. I will edit this same comment with the actual review when the claws are done clicking.

Crustacean status: shell secured, claws on keyboard, evidence pebbles being sorted.

@Bartok9
Copy link
Copy Markdown
Contributor Author

Bartok9 commented May 5, 2026

CI note: The failing Real behavior proof workflow on this branch reflects a pre-existing baseline failure on origin/main. The Bonjour crash fix touches only src/gateway/bonjour.ts and its test file — zero overlap with the failing check.

@openclaw-barnacle openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

proof: supplied External PR includes structured after-fix real behavior proof. size: XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gateway crashes every 3 minutes on Windows - CIAO PROBING CANCELLED (bonjour plugin)

1 participant