Skip to content

Conversation

@willmmiles
Copy link

I found a couple of logic errors in the error event handling:

  • _reset_tcp_callbacks() must not be called in the tcp_err() callback, as the pcb has already been free()d by LwIP. This was writing zeros to memory potentially already recycled on the other core.
  • There was a potential race between a client closing or aborting a connection, and a FIN or ERROR event:
    1. Client decides to destruct an AsyncClient on async task - say, after completing a send -- which calls AsyncClient::close(). _pcb checks pass; _tcp_close_api() is queued on LwIP task.
    2. LwIP task process FIN or ERROR callback for that client. AsyncClient::_pcb is nulled out, and the event is queued.
    3. LwIP task services _tcp_close_api(). *msg->pcb is null, so no action occurs. Event remains queued.
    4. AsyncClient::~AsyncClient() proceeds, oblivious to queued event.
    5. FIN/ERROR event is processed for now-destroyed AsyncClient. Kaboom! (Common pattern was that old client memory is still valid, so the dispose callback got run again, double-free()ing the client.)

When calling _tcp_close_api, guarantee that no events for that client
remain in the queue, even if the pcb was shut down.  This fixes a race
where an error callback is processed by the LwIP thread, invalidating
the pcb*, after the async thread has dequeued an event that will lead
to a close (such as a fin or final ack) but before close() is called.
When close is invoked, make sure to interlock with the LwIP task and
clear the queue; checks of pcb are not safe as it is owned by the LwIP
task.   Extend this checking to include abort(), as well.
When tcp_error is called, the pcb has already been freed by LwIP.
Remove the attempt to write through the invalidated pointer.
If close is being run when the pcb is nulled, and we find an event on
the queue, run dispose.  This catches cases where fin or error is
pending.
@mathieucarbou
Copy link
Member

Could this be the cause of the error seen recently here ?
emsesp/EMS-ESP32#2581 (comment)

Copy link
Member

@mathieucarbou mathieucarbou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for these big fixes 👍🙏

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses critical stability issues in AsyncTCP's error handling and connection teardown logic. The changes fix memory corruption bugs and race conditions that could lead to crashes when handling connection errors and closures.

Key Changes:

  • Prevented memory corruption by avoiding callback reset on already-freed PCBs in error handlers
  • Fixed race condition between async client destruction and queued FIN/ERROR events
  • Enhanced event queue cleanup to ensure no stale events remain for destroyed clients

removed_event_chain = t->next;
_free_event(t);
}
return count;
Copy link

Copilot AI Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable 'count' is being returned but is never declared or initialized in the function. This will cause a compilation error.

Copilot uses AI. Check for mistakes.
*msg->pcb = nullptr; // PCB is now the property of LwIP
} else {
// Ensure there is not an error event queued for this client
if (_remove_events_for_client(msg->close)) {
Copy link

Copilot AI Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function _remove_events_for_client now returns a size_t (count), but this condition will be true for any non-zero count. Consider being explicit about what you're checking for, such as 'if (_remove_events_for_client(msg->close) > 0)' for clarity.

Suggested change
if (_remove_events_for_client(msg->close)) {
if (_remove_events_for_client(msg->close) > 0) {

Copilot uses AI. Check for mistakes.
msg->err = ERR_OK;
} else {
// Ensure there is not an error event queued for this client
_remove_events_for_client(msg->close);
Copy link

Copilot AI Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The return value of _remove_events_for_client is ignored here, unlike in the close API where it's used to determine if dispose should be run. Consider documenting why the return value is not needed in the abort case or handle it consistently.

Suggested change
_remove_events_for_client(msg->close);
if (_remove_events_for_client(msg->close)) {
msg->err = ERR_OK; // dispose needs to be run
}

Copilot uses AI. Check for mistakes.
@me-no-dev
Copy link
Member

@mathieucarbou I think we discussed this exact case a while back. Thanks for fixing it @willmmiles

@mathieucarbou mathieucarbou merged commit 22bae29 into ESP32Async:main Jul 30, 2025
28 checks passed
@proddy
Copy link

proddy commented Jul 30, 2025

Could this be the cause of the error seen recently here ? emsesp/EMS-ESP32#2581 (comment)

worth checking....

@mathieucarbou
Copy link
Member

@willmmiles FYI a memory leak was detected by @gluca in this PR.

See: ESP32Async/ESPAsyncWebServer#253 (reply in thread)

Caused by Firefox triggering an SSL handshake on a non-SSL connection in background, which causes the request to be aborted. Maybe the disconnect cb is not called for this use case anymore ?

mathieucarbou pushed a commit that referenced this pull request Aug 2, 2025
Calling `AsyncClient::abort()` will now enqueue an error event, as it
did prior to #79.  This fixes memory leaks caused by missing dispose
callbacks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants