Skip to content

Fix “No Auto-Rejoin After Connection Error” and Add Recommended Resilience/Observability Improvements #259

@from2001

Description

@from2001

Issue Reference: ICON-070 disconnect investigation (2024-12-28)
Priority: High
Target: Prevent permanent auto-reconnect disablement after a NetMQ receive-thread exception and improve diagnostics.

  1. Objective
    1. Ensure that any connection error (including NetMQ receive-thread exceptions) triggers:
      • deterministic network teardown (StopNetworking()), and
      • deterministic auto-reconnect scheduling (_reconnectAt update),
      so the client can re-connect and re-join after transient Wi‑Fi/AP events.
    2. Implement the recommended safety pattern:
      • Do not execute teardown on the receive thread.
      • Hand off error handling to the main thread.
    3. Improve observability:
      • Log detailed exception context in ConnectionManager.
      • Persist the last exception for later inspection by NetSyncManager.
      • Keep verbose stack traces gated behind a debug flag.

  1. Scope

2.1 Files / Components
• Runtime/NetSyncManager.cs
• Fix HandleConnectionError() logic (previous early-return bug)
• Add main-thread handoff, re-entrancy protection, and deterministic cleanup/scheduling
• Runtime/Internal Scripts/ConnectionManager.cs
• Add exception detail logging
• Store last exception metadata
• Add a method to clear/reset connection error state for reconnect

2.2 Out of Scope
• Server-side timeout changes
• Wi‑Fi/AP configuration changes
• Test plan and release plan (explicitly excluded)

  1. Current Root Cause (must be eliminated)

In NetSyncManager.HandleConnectionError(), an early return exists:
• if (_connectionManager.IsConnectionError) { return; }

However, ConnectionManager sets _connectionError = true before raising OnConnectionError.
Therefore, when NetSyncManager receives the callback, IsConnectionError is already true and the handler always exits, causing:
• _reconnectAt not updated → no reconnection scheduling
• StopNetworking() not executed → _receiveThread remains alive
• future Connect() calls are rejected due to _receiveThread != null check
→ auto-reconnect becomes permanently disabled on that device

This early-return gate must be removed and replaced with proper re-entrancy control.

  1. Implementation Requirements

4.1 Functional Requirements
• R1: On any connection error, StopNetworking() MUST execute (idempotently) and MUST result in _receiveThread being cleared (null) so subsequent Connect() is not blocked.
• R2: On any connection error, _reconnectAt MUST be set/updated so reconnect attempts continue.
• R3: Connection error handling MUST be safe under multiple rapid error events (re-entrancy safe, no deadlocks).
• R4: Connection error handling MUST NOT call teardown logic directly from the receive thread; it MUST be executed on the main thread.
• R5: ConnectionManager MUST log exception details and retain “last exception” metadata accessible from NetSyncManager.
• R6: Verbose logging (stack trace, extended context) MUST be gated behind a debug flag.

4.2 Non-Functional / Safety Requirements
• Avoid deadlocks (e.g., joining the receive thread from itself).
• Avoid Unity API calls on background threads.
• Logging overhead must remain bounded (single-line summary by default; verbose only when enabled).

  1. Detailed Implementation Steps

5.1 NetSyncManager.cs Changes

5.1.1 Remove the faulty early-return gate

Action: In HandleConnectionError() remove:
• if (_connectionManager.IsConnectionError) { return; }

Reason: It prevents required cleanup/scheduling every time.

5.1.2 Add main-thread handoff for connection errors (recommended)

Action: Ensure the receive-thread callback does not perform teardown. Instead, it should only mark a pending flag that the main thread processes.

Add fields (names may be adjusted to match code style):
• private int _pendingConnectionError;
• use Interlocked to set/clear as atomic (0/1)
• private Exception _pendingConnectionException; (optional but recommended)
• private long _pendingConnectionErrorAtUnixMs; (optional but recommended)
• private bool _isHandlingConnectionError;
• re-entrancy guard for main-thread handler

Callback behavior (OnConnectionError target):
• Do not call StopNetworking() directly.
• Do:
• Interlocked.Exchange(ref _pendingConnectionError, 1);
• Copy exception context if available (see §5.2.2 / §5.2.3):
• _pendingConnectionException = _connectionManager.LastException;
• _pendingConnectionErrorAtUnixMs = _connectionManager.LastExceptionAtUnixMs;

Main-thread polling location:
• In Update() (preferred) or the existing per-frame / periodic method already used to drive reconnection logic:
• Call ProcessPendingConnectionErrorOnMainThread() before HandleReconnection() (or before any connect attempts).
• Rationale: cleanup must occur before a new Connect() attempt.

5.1.3 Implement ProcessPendingConnectionErrorOnMainThread()

Create a method with the following responsibilities:
1. Consume the pending flag atomically
• if (Interlocked.Exchange(ref _pendingConnectionError, 0) == 0) return;
2. Re-entrancy guard
• If _isHandlingConnectionError is already true, do not run teardown again.
• Set _isHandlingConnectionError = true at start; ensure it is reset in a finally block.
3. Schedule reconnection deterministically
• Always set/update _reconnectAt.
• Use the same time base/type already used by existing _reconnectAt / HandleReconnection() logic.
• Example rule:
• _reconnectAt = now + ReconnectDelaySeconds;
• This MUST happen regardless of prior IsConnectionError state.
4. Execute StopNetworking()
• Call StopNetworking() unconditionally (must be idempotent).
• This must terminate the receive loop, close sockets, and clear _receiveThread.
• If StopNetworking currently can be invoked from any thread, keep it main-thread only for this path.
5. Reset ConnectionManager error state before reconnect (recommended)
• Call a new ConnectionManager.ClearConnectionError() method (see §5.2.4), but do so after stopping and before reconnect attempts.
• This prevents stale “error” state from interfering with subsequent operations or diagnostics.
6. Logging
• Log a single-line summary:
• time
• reconnect schedule
• exception type/message if available
• Example (format guidance in §6):
• [NetSyncManager] ConnectionError detected. Scheduling reconnect. now=... reconnectAt=... exType=... exMsg=...

Important: If the application is quitting or NetSync has been intentionally shut down (if such flags exist), do not schedule reconnection. Respect existing “intentional stop” / “quit” signals.

5.1.4 Make StopNetworking() idempotent and ensure it clears _receiveThread

Action: Confirm or modify StopNetworking() to satisfy:
• Safe to call multiple times.
• Always clears _receiveThread reference after shutdown completes (or marks it as stopped such that Connect will not be blocked).
• Does not deadlock if called while receive thread is exiting.

Recommended safety rule (must implement if relevant):
• If StopNetworking joins the receive thread, ensure it never attempts to Join() from the receive thread itself:
• if (Thread.CurrentThread == _receiveThread) { skip join; }

Outcome required: After StopNetworking completes, a subsequent Connect() must not be blocked by if (_receiveThread != null) return;.

5.1.5 Ensure reconnect attempts run after cleanup

Action: In HandleReconnection() (or equivalent), ensure the order is:
1. process pending connection errors (cleanup + schedule)
2. if time ≥ _reconnectAt, attempt Connect()
3. log attempt start/success/failure (summary)

If Connect() can be called repeatedly per frame, add a simple guard if none exists (e.g., _isConnecting), to prevent spamming connection attempts while the server is down.

5.2 ConnectionManager.cs Changes

5.2.1 Add “last exception” retention (recommended)

Add properties:
• public Exception LastException { get; private set; }
• public long LastExceptionAtUnixMs { get; private set; }

Set these when catching any NetMQ / socket / receive loop exception that triggers OnConnectionError.

Timestamp source: DateTimeOffset.UtcNow.ToUnixTimeMilliseconds() (or equivalent consistent time source).

5.2.2 Add detailed exception logging (required)

In the exception catch path (where _connectionError = true is set):

Log at least:
• exception type
• exception message
• endpoint info (IP/port or URI, if available)
• thread id (if available)
• timestamp

Default behavior: single-line summary log.

5.2.3 Add verbose stack trace behind a debug flag (recommended)

Implement a compile-time flag (recommended) to control verbose logging:
• NETSYNC_DEBUG_CONNECTION

Behavior:
• If verbose enabled:
• output stack trace (and any extra internal state helpful for diagnosis)
• If not enabled:
• output only the single-line summary

Implementation approach example:
• #if NETSYNC_DEBUG_CONNECTION to include stack trace logging
• or a private const boolean computed from compile symbols

Do not emit large multi-line logs by default.

5.2.4 Add ClearConnectionError() (recommended)

Add a method:
• public void ClearConnectionError()

Responsibilities:
• _connectionError = false;
• optionally clear LastException and timestamp (or keep last exception; choose one behavior and document it in code comments)
• Must be thread-safe enough for being called from main thread while receive thread is not running (StopNetworking executed first).

Recommended sequence: NetSyncManager calls StopNetworking() first, then calls ClearConnectionError().

  1. Logging Conventions (Implementation Guidance)

6.1 NetSyncManager Logs (summary by default)
• [NetSyncManager] ConnectionError detected. scheduling reconnect. now=<...> reconnectAt=<...> exType=<...> exMsg=<...>
• [NetSyncManager] StopNetworking start. reason=ConnectionError
• [NetSyncManager] StopNetworking completed. receiveThreadCleared=<true/false>
• [NetSyncManager] Reconnect attempt start. attempt= endpoint=<...>
• [NetSyncManager] Reconnect success. endpoint=<...> (and join/room info if available)

6.2 ConnectionManager Logs
• [ConnectionManager] NetMQ exception caught. type=<...> msg=<...> endpoint=<...> thread=<...> at=<...>
• If NETSYNC_DEBUG_CONNECTION:
• Append stack trace (multi-line acceptable only in verbose mode)

  1. Definition of Done (No Test Plan; Implementation-Level Criteria Only)
    • The early-return gate in NetSyncManager.HandleConnectionError() is removed and replaced by correct logic.
    • Connection errors trigger main-thread handling that always:
    • schedules reconnect (_reconnectAt updated), and
    • calls StopNetworking() (cleanup), ensuring _receiveThread does not remain non-null indefinitely.
    • ConnectionManager logs exception details and stores the last exception metadata.
    • Verbose stack traces are emitted only when the debug flag is enabled.
    • Re-entrancy is controlled such that repeated error notifications do not cause deadlocks, duplicate teardown, or uncontrolled reconnect spam.

Metadata

Metadata

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions