Fix auto-reconnect failure after receive-thread exceptions#260
Fix auto-reconnect failure after receive-thread exceptions#260
Conversation
Co-authored-by: from2001 <387880+from2001@users.noreply.github.com>
Co-authored-by: from2001 <387880+from2001@users.noreply.github.com>
Co-authored-by: from2001 <387880+from2001@users.noreply.github.com>
|
@claude Review this PR. |
|
Claude finished @from2001's task in 1m 29s —— View job PR Review: Fix auto-reconnect failure after receive-thread exceptionsTodo List
SummaryThis PR successfully addresses the auto-reconnect failure bug by implementing a main-thread handoff pattern for connection error handling. The implementation is well-designed with proper thread safety considerations and detailed logging. ✅ Strengths1. Correct Root Cause FixThe early-return gate ( 2. Strong Thread Safety Design
3. Excellent Observability
4. Proper Unity Patterns
5. Deadlock PreventionThe self-thread check in if (Thread.CurrentThread != _receiveThread)
{
WaitThreadExit(_receiveThread, 1000);
}🔍 Issues Found1. Missing Avatar Cleanup in Main-Thread Handler
|
|
@copilot Claude found some issues. Fix them if it’s reasonable. |
…improve docs Co-authored-by: from2001 <387880+from2001@users.noreply.github.com>
I've addressed the reasonable issues from Claude's review in commit 0838060: Fixed (High Priority):
Fixed (Medium Priority):
Fixed (Low Priority):
Not Changed:
|
|
@claude Review the latest code. |
|
Claude finished @from2001's task in 1m 33s —— View job Code Review: Latest ChangesTodo List
SummaryThe latest commit (0838060) successfully addresses all three high/medium priority issues from the previous review. The implementation is now production-ready with proper avatar cleanup, exception reference management, and clear documentation. ✅ Verified Fixes1. Avatar Cleanup Added (Issue #1 - High Priority) ✅Location: NetSyncManager.cs:969 Fixed: The missing // Cleanup avatars and presence
_avatarManager?.CleanupRemoteAvatars();
_humanPresenceManager?.CleanupAll();This now matches the cleanup pattern in Verification: ✅ Confirmed at lines 969-970 2. Exception Memory Leak Fixed (Issue #2 - Medium Priority) ✅Location: ConnectionManager.cs:259 Fixed: public void ClearConnectionError()
{
_connectionError = false;
// Clear exception reference to prevent memory retention
// (NetSyncManager has already copied it to _pendingConnectionException)
_lastException = null;
// Keep timestamp for basic diagnostics
}Rationale: The comment correctly explains that NetSyncManager already copied the exception to Verification: ✅ Confirmed at lines 256-261 3. Documentation Improved (Issue #4 - Low Priority) ✅Location: NetSyncManager.cs:982-983 Fixed: The comment for /// <summary>
/// Handles connection errors from both receive thread (via OnConnectionError callback)
/// and main thread (via send failures). Thread-safe: uses atomic handoff to main thread.
/// </summary>This clearly documents that the method can be called from either thread and explains the thread-safety mechanism. Verification: ✅ Confirmed at lines 981-984 🔍 Additional Code ReviewSince this is a fresh review of the complete implementation, I've checked for other potential issues beyond the previous review items: Thread Safety Analysis ✅1. Atomic Operations - Correct
2. Volatile Fields - Correct
3. Memory Ordering - Excellent _lastExceptionAtUnixMs = timestamp;
_lastException = ex_local;NetSyncManager reads in the same order (lines 995-996): var timestamp = _connectionManager.LastExceptionAtUnixMs;
var exception = _connectionManager.LastException;This ensures proper happens-before relationships in the memory model. 4. Re-entrancy Protection - Correct Unity-Specific Patterns ✅1. Main Thread Safety
2. Null Check Pattern
Error Handling Flow ✅Complete Error Path:
This flow is deterministic, thread-safe, and deadlock-free. ✅ Logging and Observability ✅1. Exception Details Logged
2. Stack Traces Gated 3. Reconnection Context Logged DebugLog($"ConnectionError detected. Scheduling reconnect. " +
$"now={Time.time:F2} reconnectAt={_reconnectAt:F2} " +
$"exType={exType} exMsg={exMsg}");📋 Minor Observations (Not Issues)1. Exception Handling Pattern Choice (Intentional Design)As noted in the previous review, Current approach benefits:
Trade-off: Main-thread errors go through an extra handoff step, but this is negligible overhead and keeps the design simple. Verdict: This is a valid design choice, not a bug. The unified approach is safer and easier to maintain. ✅ 2. Room Switching IntegrationThe error handler properly resets room switching state:
This ensures connection errors during room switching don't leave the system in an inconsistent state. ✅ 3. Discovery IntegrationThe error handler doesn't explicitly reset discovery state, but this is correct because:
Separation of concerns is maintained. ✅ 🎯 Final VerdictStatus: APPROVED ✅ All issues from the previous review have been properly addressed:
Implementation Quality
Recommended Next Steps1. Testing (As Previously Recommended)
2. Integration Testing
3. Documentation (Optional) 🚀 ConclusionThis PR successfully implements the main-thread handoff pattern as specified in the requirements. The implementation is production-ready with proper thread safety, comprehensive error handling, and excellent observability. The fixes in commit 0838060 address all outstanding concerns from the previous review. Great work on implementing a robust, thread-safe solution! 🎉 |
_avatarManager?.CleanupRemoteAvatars()call in error handler (Issue v0.0.2 #1)_lastExceptioninClearConnectionError()to prevent memory leak (Issue Add Claude Code GitHub Workflow #2)HandleConnectionErrorcomment to clarify both thread contexts (Issue Update LLM CLI GitHub Actions #4)Original prompt
This section details on the original issue you should resolve
<issue_title>Fix “No Auto-Rejoin After Connection Error” and Add Recommended Resilience/Observability Improvements</issue_title>
<issue_description>Issue Reference: ICON-070 disconnect investigation (2024-12-28)
Priority: High
Target: Prevent permanent auto-reconnect disablement after a NetMQ receive-thread exception and improve diagnostics.
⸻
• deterministic network teardown (StopNetworking()), and
• deterministic auto-reconnect scheduling (_reconnectAt update),
so the client can re-connect and re-join after transient Wi‑Fi/AP events.
• Do not execute teardown on the receive thread.
• Hand off error handling to the main thread.
• Log detailed exception context in ConnectionManager.
• Persist the last exception for later inspection by NetSyncManager.
• Keep verbose stack traces gated behind a debug flag.
⸻
2.1 Files / Components
• Runtime/NetSyncManager.cs
• Fix HandleConnectionError() logic (previous early-return bug)
• Add main-thread handoff, re-entrancy protection, and deterministic cleanup/scheduling
• Runtime/Internal Scripts/ConnectionManager.cs
• Add exception detail logging
• Store last exception metadata
• Add a method to clear/reset connection error state for reconnect
2.2 Out of Scope
• Server-side timeout changes
• Wi‑Fi/AP configuration changes
• Test plan and release plan (explicitly excluded)
⸻
In NetSyncManager.HandleConnectionError(), an early return exists:
• if (_connectionManager.IsConnectionError) { return; }
However, ConnectionManager sets _connectionError = true before raising OnConnectionError.
Therefore, when NetSyncManager receives the callback, IsConnectionError is already true and the handler always exits, causing:
• _reconnectAt not updated → no reconnection scheduling
• StopNetworking() not executed → _receiveThread remains alive
• future Connect() calls are rejected due to _receiveThread != null check
→ auto-reconnect becomes permanently disabled on that device
This early-return gate must be removed and replaced with proper re-entrancy control.
⸻
4.1 Functional Requirements
• R1: On any connection error, StopNetworking() MUST execute (idempotently) and MUST result in _receiveThread being cleared (null) so subsequent Connect() is not blocked.
• R2: On any connection error, _reconnectAt MUST be set/updated so reconnect attempts continue.
• R3: Connection error handling MUST be safe under multiple rapid error events (re-entrancy safe, no deadlocks).
• R4: Connection error handling MUST NOT call teardown logic directly from the receive thread; it MUST be executed on the main thread.
• R5: ConnectionManager MUST log exception details and retain “last exception” metadata accessible from NetSyncManager.
• R6: Verbose logging (stack trace, extended context) MUST be gated behind a debug flag.
4.2 Non-Functional / Safety Requirements
• Avoid deadlocks (e.g., joining the receive thread from itself).
• Avoid Unity API calls on background threads.
• Logging overhead must remain bounded (single-line summary by default; verbose only when enabled).
⸻
5.1 NetSyncManager.cs Changes
5.1.1 Remove the faulty early-return gate
Action: In HandleConnectionError() remove:
• if (_connectionManager.IsConnectionError) { return; }
Reason: It prevents required cleanup/scheduling every time.
⸻
5.1.2 Add main-thread handoff for connection errors (recommended)
Action: Ensure the receive-thread callback does not perform teardown. Instead, it should only mark a pending flag that the main thread processes.
Add fields (names may be adjusted to match code style):
• private int _pendingConnectionError;
• use Interlocked to set/clear as atomic (0/1)
• private Exception _pendingConnectionException; (optional but recommended)
• private long _pendingConnectionErrorAtUnixMs; (optional but recommended)
• private bool _isHandlingConnectionError;
• re-entrancy guard for main-thread handler
Callback behavior (OnConnectionError target):
• Do not call StopNetworking() directly.
• Do:
• Interlocked.Exchange(ref _pendingConnectionError, 1);
• Copy exception context if available (see §5.2.2 / §5.2.3):
• _pendingConnectionException = _connectionManager.LastException;
• _pendingConnectionErrorAtUnixMs = _connectionManager.LastExceptionAtUnixMs;
Main-thread polling location:
• In Update() (preferred) or the existing per-frame / periodic method already used to drive reconnection logic:
• Call ProcessPendingConnectionErrorOnMainThread() before HandleReconnection() (or before any connect attempts).
• Rationale: cleanup must occur before a new Connect() attempt.
⸻
5.1.3 Implement ProcessPendingConnec...
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.