-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Issue Reference: ICON-070 disconnect investigation (2024-12-28)
Priority: High
Target: Prevent permanent auto-reconnect disablement after a NetMQ receive-thread exception and improve diagnostics.
⸻
- Objective
- Ensure that any connection error (including NetMQ receive-thread exceptions) triggers:
• deterministic network teardown (StopNetworking()), and
• deterministic auto-reconnect scheduling (_reconnectAt update),
so the client can re-connect and re-join after transient Wi‑Fi/AP events. - Implement the recommended safety pattern:
• Do not execute teardown on the receive thread.
• Hand off error handling to the main thread. - Improve observability:
• Log detailed exception context in ConnectionManager.
• Persist the last exception for later inspection by NetSyncManager.
• Keep verbose stack traces gated behind a debug flag.
- Ensure that any connection error (including NetMQ receive-thread exceptions) triggers:
⸻
- Scope
2.1 Files / Components
• Runtime/NetSyncManager.cs
• Fix HandleConnectionError() logic (previous early-return bug)
• Add main-thread handoff, re-entrancy protection, and deterministic cleanup/scheduling
• Runtime/Internal Scripts/ConnectionManager.cs
• Add exception detail logging
• Store last exception metadata
• Add a method to clear/reset connection error state for reconnect
2.2 Out of Scope
• Server-side timeout changes
• Wi‑Fi/AP configuration changes
• Test plan and release plan (explicitly excluded)
⸻
- Current Root Cause (must be eliminated)
In NetSyncManager.HandleConnectionError(), an early return exists:
• if (_connectionManager.IsConnectionError) { return; }
However, ConnectionManager sets _connectionError = true before raising OnConnectionError.
Therefore, when NetSyncManager receives the callback, IsConnectionError is already true and the handler always exits, causing:
• _reconnectAt not updated → no reconnection scheduling
• StopNetworking() not executed → _receiveThread remains alive
• future Connect() calls are rejected due to _receiveThread != null check
→ auto-reconnect becomes permanently disabled on that device
This early-return gate must be removed and replaced with proper re-entrancy control.
⸻
- Implementation Requirements
4.1 Functional Requirements
• R1: On any connection error, StopNetworking() MUST execute (idempotently) and MUST result in _receiveThread being cleared (null) so subsequent Connect() is not blocked.
• R2: On any connection error, _reconnectAt MUST be set/updated so reconnect attempts continue.
• R3: Connection error handling MUST be safe under multiple rapid error events (re-entrancy safe, no deadlocks).
• R4: Connection error handling MUST NOT call teardown logic directly from the receive thread; it MUST be executed on the main thread.
• R5: ConnectionManager MUST log exception details and retain “last exception” metadata accessible from NetSyncManager.
• R6: Verbose logging (stack trace, extended context) MUST be gated behind a debug flag.
4.2 Non-Functional / Safety Requirements
• Avoid deadlocks (e.g., joining the receive thread from itself).
• Avoid Unity API calls on background threads.
• Logging overhead must remain bounded (single-line summary by default; verbose only when enabled).
⸻
- Detailed Implementation Steps
5.1 NetSyncManager.cs Changes
5.1.1 Remove the faulty early-return gate
Action: In HandleConnectionError() remove:
• if (_connectionManager.IsConnectionError) { return; }
Reason: It prevents required cleanup/scheduling every time.
⸻
5.1.2 Add main-thread handoff for connection errors (recommended)
Action: Ensure the receive-thread callback does not perform teardown. Instead, it should only mark a pending flag that the main thread processes.
Add fields (names may be adjusted to match code style):
• private int _pendingConnectionError;
• use Interlocked to set/clear as atomic (0/1)
• private Exception _pendingConnectionException; (optional but recommended)
• private long _pendingConnectionErrorAtUnixMs; (optional but recommended)
• private bool _isHandlingConnectionError;
• re-entrancy guard for main-thread handler
Callback behavior (OnConnectionError target):
• Do not call StopNetworking() directly.
• Do:
• Interlocked.Exchange(ref _pendingConnectionError, 1);
• Copy exception context if available (see §5.2.2 / §5.2.3):
• _pendingConnectionException = _connectionManager.LastException;
• _pendingConnectionErrorAtUnixMs = _connectionManager.LastExceptionAtUnixMs;
Main-thread polling location:
• In Update() (preferred) or the existing per-frame / periodic method already used to drive reconnection logic:
• Call ProcessPendingConnectionErrorOnMainThread() before HandleReconnection() (or before any connect attempts).
• Rationale: cleanup must occur before a new Connect() attempt.
⸻
5.1.3 Implement ProcessPendingConnectionErrorOnMainThread()
Create a method with the following responsibilities:
1. Consume the pending flag atomically
• if (Interlocked.Exchange(ref _pendingConnectionError, 0) == 0) return;
2. Re-entrancy guard
• If _isHandlingConnectionError is already true, do not run teardown again.
• Set _isHandlingConnectionError = true at start; ensure it is reset in a finally block.
3. Schedule reconnection deterministically
• Always set/update _reconnectAt.
• Use the same time base/type already used by existing _reconnectAt / HandleReconnection() logic.
• Example rule:
• _reconnectAt = now + ReconnectDelaySeconds;
• This MUST happen regardless of prior IsConnectionError state.
4. Execute StopNetworking()
• Call StopNetworking() unconditionally (must be idempotent).
• This must terminate the receive loop, close sockets, and clear _receiveThread.
• If StopNetworking currently can be invoked from any thread, keep it main-thread only for this path.
5. Reset ConnectionManager error state before reconnect (recommended)
• Call a new ConnectionManager.ClearConnectionError() method (see §5.2.4), but do so after stopping and before reconnect attempts.
• This prevents stale “error” state from interfering with subsequent operations or diagnostics.
6. Logging
• Log a single-line summary:
• time
• reconnect schedule
• exception type/message if available
• Example (format guidance in §6):
• [NetSyncManager] ConnectionError detected. Scheduling reconnect. now=... reconnectAt=... exType=... exMsg=...
Important: If the application is quitting or NetSync has been intentionally shut down (if such flags exist), do not schedule reconnection. Respect existing “intentional stop” / “quit” signals.
⸻
5.1.4 Make StopNetworking() idempotent and ensure it clears _receiveThread
Action: Confirm or modify StopNetworking() to satisfy:
• Safe to call multiple times.
• Always clears _receiveThread reference after shutdown completes (or marks it as stopped such that Connect will not be blocked).
• Does not deadlock if called while receive thread is exiting.
Recommended safety rule (must implement if relevant):
• If StopNetworking joins the receive thread, ensure it never attempts to Join() from the receive thread itself:
• if (Thread.CurrentThread == _receiveThread) { skip join; }
Outcome required: After StopNetworking completes, a subsequent Connect() must not be blocked by if (_receiveThread != null) return;.
⸻
5.1.5 Ensure reconnect attempts run after cleanup
Action: In HandleReconnection() (or equivalent), ensure the order is:
1. process pending connection errors (cleanup + schedule)
2. if time ≥ _reconnectAt, attempt Connect()
3. log attempt start/success/failure (summary)
If Connect() can be called repeatedly per frame, add a simple guard if none exists (e.g., _isConnecting), to prevent spamming connection attempts while the server is down.
⸻
5.2 ConnectionManager.cs Changes
5.2.1 Add “last exception” retention (recommended)
Add properties:
• public Exception LastException { get; private set; }
• public long LastExceptionAtUnixMs { get; private set; }
Set these when catching any NetMQ / socket / receive loop exception that triggers OnConnectionError.
Timestamp source: DateTimeOffset.UtcNow.ToUnixTimeMilliseconds() (or equivalent consistent time source).
⸻
5.2.2 Add detailed exception logging (required)
In the exception catch path (where _connectionError = true is set):
Log at least:
• exception type
• exception message
• endpoint info (IP/port or URI, if available)
• thread id (if available)
• timestamp
Default behavior: single-line summary log.
⸻
5.2.3 Add verbose stack trace behind a debug flag (recommended)
Implement a compile-time flag (recommended) to control verbose logging:
• NETSYNC_DEBUG_CONNECTION
Behavior:
• If verbose enabled:
• output stack trace (and any extra internal state helpful for diagnosis)
• If not enabled:
• output only the single-line summary
Implementation approach example:
• #if NETSYNC_DEBUG_CONNECTION to include stack trace logging
• or a private const boolean computed from compile symbols
Do not emit large multi-line logs by default.
⸻
5.2.4 Add ClearConnectionError() (recommended)
Add a method:
• public void ClearConnectionError()
Responsibilities:
• _connectionError = false;
• optionally clear LastException and timestamp (or keep last exception; choose one behavior and document it in code comments)
• Must be thread-safe enough for being called from main thread while receive thread is not running (StopNetworking executed first).
Recommended sequence: NetSyncManager calls StopNetworking() first, then calls ClearConnectionError().
⸻
- Logging Conventions (Implementation Guidance)
6.1 NetSyncManager Logs (summary by default)
• [NetSyncManager] ConnectionError detected. scheduling reconnect. now=<...> reconnectAt=<...> exType=<...> exMsg=<...>
• [NetSyncManager] StopNetworking start. reason=ConnectionError
• [NetSyncManager] StopNetworking completed. receiveThreadCleared=<true/false>
• [NetSyncManager] Reconnect attempt start. attempt= endpoint=<...>
• [NetSyncManager] Reconnect success. endpoint=<...> (and join/room info if available)
6.2 ConnectionManager Logs
• [ConnectionManager] NetMQ exception caught. type=<...> msg=<...> endpoint=<...> thread=<...> at=<...>
• If NETSYNC_DEBUG_CONNECTION:
• Append stack trace (multi-line acceptable only in verbose mode)
⸻
- Definition of Done (No Test Plan; Implementation-Level Criteria Only)
• The early-return gate in NetSyncManager.HandleConnectionError() is removed and replaced by correct logic.
• Connection errors trigger main-thread handling that always:
• schedules reconnect (_reconnectAt updated), and
• calls StopNetworking() (cleanup), ensuring _receiveThread does not remain non-null indefinitely.
• ConnectionManager logs exception details and stores the last exception metadata.
• Verbose stack traces are emitted only when the debug flag is enabled.
• Re-entrancy is controlled such that repeated error notifications do not cause deadlocks, duplicate teardown, or uncontrolled reconnect spam.