-
Notifications
You must be signed in to change notification settings - Fork 36
Error/closing stability #79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error/closing stability #79
Conversation
When calling _tcp_close_api, guarantee that no events for that client remain in the queue, even if the pcb was shut down. This fixes a race where an error callback is processed by the LwIP thread, invalidating the pcb*, after the async thread has dequeued an event that will lead to a close (such as a fin or final ack) but before close() is called.
When close is invoked, make sure to interlock with the LwIP task and clear the queue; checks of pcb are not safe as it is owned by the LwIP task. Extend this checking to include abort(), as well.
When tcp_error is called, the pcb has already been freed by LwIP. Remove the attempt to write through the invalidated pointer.
If close is being run when the pcb is nulled, and we find an event on the queue, run dispose. This catches cases where fin or error is pending.
|
Could this be the cause of the error seen recently here ? |
mathieucarbou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for these big fixes 👍🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR addresses critical stability issues in AsyncTCP's error handling and connection teardown logic. The changes fix memory corruption bugs and race conditions that could lead to crashes when handling connection errors and closures.
Key Changes:
- Prevented memory corruption by avoiding callback reset on already-freed PCBs in error handlers
- Fixed race condition between async client destruction and queued FIN/ERROR events
- Enhanced event queue cleanup to ensure no stale events remain for destroyed clients
| removed_event_chain = t->next; | ||
| _free_event(t); | ||
| } | ||
| return count; |
Copilot
AI
Jul 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable 'count' is being returned but is never declared or initialized in the function. This will cause a compilation error.
| *msg->pcb = nullptr; // PCB is now the property of LwIP | ||
| } else { | ||
| // Ensure there is not an error event queued for this client | ||
| if (_remove_events_for_client(msg->close)) { |
Copilot
AI
Jul 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function _remove_events_for_client now returns a size_t (count), but this condition will be true for any non-zero count. Consider being explicit about what you're checking for, such as 'if (_remove_events_for_client(msg->close) > 0)' for clarity.
| if (_remove_events_for_client(msg->close)) { | |
| if (_remove_events_for_client(msg->close) > 0) { |
| msg->err = ERR_OK; | ||
| } else { | ||
| // Ensure there is not an error event queued for this client | ||
| _remove_events_for_client(msg->close); |
Copilot
AI
Jul 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The return value of _remove_events_for_client is ignored here, unlike in the close API where it's used to determine if dispose should be run. Consider documenting why the return value is not needed in the abort case or handle it consistently.
| _remove_events_for_client(msg->close); | |
| if (_remove_events_for_client(msg->close)) { | |
| msg->err = ERR_OK; // dispose needs to be run | |
| } |
|
@mathieucarbou I think we discussed this exact case a while back. Thanks for fixing it @willmmiles |
worth checking.... |
|
@willmmiles FYI a memory leak was detected by @gluca in this PR. See: ESP32Async/ESPAsyncWebServer#253 (reply in thread) Caused by Firefox triggering an SSL handshake on a non-SSL connection in background, which causes the request to be aborted. Maybe the disconnect cb is not called for this use case anymore ? |
Calling `AsyncClient::abort()` will now enqueue an error event, as it did prior to #79. This fixes memory leaks caused by missing dispose callbacks.
I found a couple of logic errors in the error event handling:
_reset_tcp_callbacks()must not be called in thetcp_err()callback, as the pcb has already beenfree()d by LwIP. This was writing zeros to memory potentially already recycled on the other core.AsyncClienton async task - say, after completing a send -- which callsAsyncClient::close()._pcbchecks pass;_tcp_close_api()is queued on LwIP task.AsyncClient::_pcbis nulled out, and the event is queued._tcp_close_api().*msg->pcbis null, so no action occurs. Event remains queued.AsyncClient::~AsyncClient()proceeds, oblivious to queued event.free()ing the client.)