Error/closing stability #79

willmmiles · 2025-07-30T03:20:34Z

I found a couple of logic errors in the error event handling:

_reset_tcp_callbacks() must not be called in the tcp_err() callback, as the pcb has already been free()d by LwIP. This was writing zeros to memory potentially already recycled on the other core.
There was a potential race between a client closing or aborting a connection, and a FIN or ERROR event:
1. Client decides to destruct an AsyncClient on async task - say, after completing a send -- which calls AsyncClient::close(). _pcb checks pass; _tcp_close_api() is queued on LwIP task.
2. LwIP task process FIN or ERROR callback for that client. AsyncClient::_pcb is nulled out, and the event is queued.
3. LwIP task services _tcp_close_api(). *msg->pcb is null, so no action occurs. Event remains queued.
4. AsyncClient::~AsyncClient() proceeds, oblivious to queued event.
5. FIN/ERROR event is processed for now-destroyed AsyncClient. Kaboom! (Common pattern was that old client memory is still valid, so the dispose callback got run again, double-free()ing the client.)

When calling _tcp_close_api, guarantee that no events for that client remain in the queue, even if the pcb was shut down. This fixes a race where an error callback is processed by the LwIP thread, invalidating the pcb*, after the async thread has dequeued an event that will lead to a close (such as a fin or final ack) but before close() is called.

When close is invoked, make sure to interlock with the LwIP task and clear the queue; checks of pcb are not safe as it is owned by the LwIP task. Extend this checking to include abort(), as well.

When tcp_error is called, the pcb has already been freed by LwIP. Remove the attempt to write through the invalidated pointer.

If close is being run when the pcb is nulled, and we find an event on the queue, run dispose. This catches cases where fin or error is pending.

mathieucarbou · 2025-07-30T05:48:44Z

Could this be the cause of the error seen recently here ?
emsesp/EMS-ESP32#2581 (comment)

mathieucarbou

Thank you for these big fixes 👍🙏

Copilot

Pull Request Overview

This PR addresses critical stability issues in AsyncTCP's error handling and connection teardown logic. The changes fix memory corruption bugs and race conditions that could lead to crashes when handling connection errors and closures.

Key Changes:

Prevented memory corruption by avoiding callback reset on already-freed PCBs in error handlers
Fixed race condition between async client destruction and queued FIN/ERROR events
Enhanced event queue cleanup to ensure no stale events remain for destroyed clients

Copilot · 2025-07-30T05:51:20Z

src/AsyncTCP.cpp

    removed_event_chain = t->next;
    _free_event(t);
  }
+  return count;


The variable 'count' is being returned but is never declared or initialized in the function. This will cause a compilation error.

Copilot · 2025-07-30T05:51:20Z

src/AsyncTCP.cpp

    *msg->pcb = nullptr;  // PCB is now the property of LwIP
+  } else {
+    // Ensure there is not an error event queued for this client
+    if (_remove_events_for_client(msg->close)) {


The function _remove_events_for_client now returns a size_t (count), but this condition will be true for any non-zero count. Consider being explicit about what you're checking for, such as 'if (_remove_events_for_client(msg->close) > 0)' for clarity.

Suggested change

if (_remove_events_for_client(msg->close)) {

if (_remove_events_for_client(msg->close) > 0) {

Copilot · 2025-07-30T05:51:21Z

src/AsyncTCP.cpp

+    msg->err = ERR_OK;
+  } else {
+    // Ensure there is not an error event queued for this client
+    _remove_events_for_client(msg->close);


[nitpick] The return value of _remove_events_for_client is ignored here, unlike in the close API where it's used to determine if dispose should be run. Consider documenting why the return value is not needed in the abort case or handle it consistently.

Suggested change

_remove_events_for_client(msg->close);

if (_remove_events_for_client(msg->close)) {

msg->err = ERR_OK; // dispose needs to be run

}

me-no-dev · 2025-07-30T07:40:14Z

@mathieucarbou I think we discussed this exact case a while back. Thanks for fixing it @willmmiles

proddy · 2025-07-30T08:45:18Z

Could this be the cause of the error seen recently here ? emsesp/EMS-ESP32#2581 (comment)

worth checking....

mathieucarbou · 2025-08-01T20:44:02Z

@willmmiles FYI a memory leak was detected by @gluca in this PR.

See: ESP32Async/ESPAsyncWebServer#253 (reply in thread)

Caused by Firefox triggering an SSL handshake on a non-SSL connection in background, which causes the request to be aborted. Maybe the disconnect cb is not called for this use case anymore ?

Calling `AsyncClient::abort()` will now enqueue an error event, as it did prior to #79. This fixes memory leaks caused by missing dispose callbacks.

willmmiles added 4 commits July 27, 2025 16:01

Additional close semantic fixes

f2ec361

When close is invoked, make sure to interlock with the LwIP task and clear the queue; checks of pcb are not safe as it is owned by the LwIP task. Extend this checking to include abort(), as well.

Fix use-after-free in tcp_error

97c6772

When tcp_error is called, the pcb has already been freed by LwIP. Remove the attempt to write through the invalidated pointer.

Ensure dispose is run in close if needed

f736219

If close is being run when the pcb is nulled, and we find an event on the queue, run dispose. This catches cases where fin or error is pending.

mathieucarbou approved these changes Jul 30, 2025

View reviewed changes

mathieucarbou assigned willmmiles Jul 30, 2025

mathieucarbou requested review from Copilot and me-no-dev July 30, 2025 05:50

mathieucarbou added the Status: Pending Merge label Jul 30, 2025

Copilot AI reviewed Jul 30, 2025

View reviewed changes

me-no-dev approved these changes Jul 30, 2025

View reviewed changes

mathieucarbou merged commit 22bae29 into ESP32Async:main Jul 30, 2025
28 checks passed

proddy mentioned this pull request Jul 30, 2025

Crashes with EMS-ESP-3_7_3-dev_12-ESP32-16MB+.bin emsesp/EMS-ESP32#2581

Closed

mathieucarbou mentioned this pull request Aug 1, 2025

discard callback must be called on abort() #81

Closed

willmmiles mentioned this pull request Aug 2, 2025

Restore abort dispose semantics #82

Merged

mathieucarbou pushed a commit that referenced this pull request Aug 2, 2025

Restore abort dispose semantics

6563e2e

Calling `AsyncClient::abort()` will now enqueue an error event, as it did prior to #79. This fixes memory leaks caused by missing dispose callbacks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error/closing stability #79

Error/closing stability #79

Uh oh!

willmmiles commented Jul 30, 2025

Uh oh!

mathieucarbou commented Jul 30, 2025

Uh oh!

mathieucarbou left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jul 30, 2025

Uh oh!

Copilot AI Jul 30, 2025

Uh oh!

Copilot AI Jul 30, 2025

Uh oh!

me-no-dev commented Jul 30, 2025

Uh oh!

Uh oh!

proddy commented Jul 30, 2025

Uh oh!

mathieucarbou commented Aug 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	if (_remove_events_for_client(msg->close)) {
	if (_remove_events_for_client(msg->close) > 0) {

Error/closing stability #79

Error/closing stability #79

Uh oh!

Conversation

willmmiles commented Jul 30, 2025

Uh oh!

mathieucarbou commented Jul 30, 2025

Uh oh!

mathieucarbou left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Key Changes:

Uh oh!

Copilot AI Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

me-no-dev commented Jul 30, 2025

Uh oh!

Uh oh!

proddy commented Jul 30, 2025

Uh oh!

mathieucarbou commented Aug 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants