Skip to content

[c-ares] fix spin loop bug when c-ares gives up on a socket that still has data left in its read buffer#34185

Merged
apolcyn merged 19 commits intogrpc:masterfrom
apolcyn:fix_spin_loop
Aug 30, 2023
Merged

[c-ares] fix spin loop bug when c-ares gives up on a socket that still has data left in its read buffer#34185
apolcyn merged 19 commits intogrpc:masterfrom
apolcyn:fix_spin_loop

Conversation

@apolcyn
Copy link
Copy Markdown
Contributor

@apolcyn apolcyn commented Aug 28, 2023

If we get a readable event on an fd and both the following happens:

  • c-ares does not read all bytes off the fd

  • c-ares removes the fd from the set ARES_GETSOCK_READABLE

... then we have a busy loop here, where we'd keep asking c-ares to process an fd that it no longer cares about.

This is indirectly related to a change in this code one month ago: #33942 - before that change, c-ares would close the socket when it called handle_error and so IsFdStillReadableLocked would start returning false, causing us to get away with this loop. Now, because IsFdStillReadableLocked will keep returning true (because of our overridden close API), we'll loop forever.

The test illustrates one concrete example of how this bug can be hit.

Note that the EE version of this code already gets this right.

Related: internal issue b/297538255

@apolcyn apolcyn requested a review from markdroth as a code owner August 28, 2023 18:15
@apolcyn apolcyn requested a review from yijiem August 28, 2023 18:15
@apolcyn apolcyn added lang/core release notes: yes Indicates if PR needs to be in release notes labels Aug 28, 2023
@apolcyn apolcyn changed the title [c-ares] fix hypothetical spin loop bug [c-ares] fix spin loop bug when c-ares gives up on a socket that still has data left in its read buffer Aug 29, 2023
Copy link
Copy Markdown
Member

@yijiem yijiem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a hack! Thanks Alex for adding the test to reproduce the issue!

// 4) Because the first two bytes were zero, c-ares attempts to malloc a
// zero-length buffer:
// https://github.com/c-ares/c-ares/blob/6360e96b5cf8e5980c887ce58ef727e53d77243a/src/lib/ares_process.c#L428.
// 5) Because malloc(0) returns NULL, c-ares invokes handle_error and stops
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small nit: maybe say c-ares' default_malloc(0) returns NULL instead? Since it seems like some systems may return a valid pointer on malloc(0): https://github.com/c-ares/c-ares/blob/main/src/lib/ares_library_init.c#L38-L42

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, done

@apolcyn apolcyn merged commit a35f282 into grpc:master Aug 30, 2023
@copybara-service copybara-service bot added the imported Specifies if the PR has been imported to the internal repository label Aug 30, 2023
apolcyn added a commit that referenced this pull request Sep 18, 2023
Now that we have #33942 (and another
follow-up [fix](#34185)), I think the
issue from #25289 is likely fixed
@ti-chi-bot ti-chi-bot bot mentioned this pull request Jul 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bloat/none imported Specifies if the PR has been imported to the internal repository lang/c++ lang/core per-call-memory/neutral per-channel-memory/neutral release notes: yes Indicates if PR needs to be in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants