fix(api-call): close FD-recycling race that wrote TLS bytes into kanban.db (#29507)#30858
Merged
Conversation
The helper used to call ``socket.shutdown(SHUT_RDWR)`` followed by ``socket.close()`` to drop CLOSE-WAIT entries immediately. On its own ``shutdown()`` is safe from any thread — it only sends FIN and breaks pending ``recv``/``send`` — but ``close()`` releases the FD integer to the kernel. When the helper runs on a stranger thread (the interrupt loop, the stale-call detector) the FD release races the owning httpx worker thread that still has the same integer cached inside the SSL BIO. The kernel then recycles that integer to the next ``open()`` call — in production, kanban dispatcher's ``kanban.db`` — and the worker's delayed TLS flush writes a 24-byte TLS application-data record on top of the SQLite header. Restrict the helper to ``shutdown(SHUT_RDWR)`` only. The owning httpx worker's own unwind will close the underlying socket via the same Python ``socket.socket`` object, which atomically swaps ``_fd`` to -1 before issuing ``close(2)`` — no FD-aliasing window. The log field ``tcp_force_closed=N`` is kept (now counts shutdowns) so existing dashboards / log parsers keep working.
…upt (#29507) Layer-2 defense for the FD-recycling race: even with ``force_close_tcp_sockets`` reduced to shutdown-only, the followup ``client.close()`` in ``_close_openai_client`` still walks the httpx pool and closes sockets — and if called from a stranger thread (the interrupt-check loop, the stale-call detector) it has the same FD-recycling exposure that wrote a TLS record on top of ``kanban.db``. Stamp the request_client_holder with the owning thread's ident at ``_set_request_client`` time. In ``_close_request_client_once``: * Owning thread (the worker's ``finally``) → pop + ``client.close()`` via ``_close_request_openai_client``, exactly as before. * Stranger thread → ``_abort_request_openai_client`` (new): only ``shutdown(SHUT_RDWR)`` the pool sockets and log a deferred-close marker. The holder stays populated so the worker's eventual ``finally`` performs the real close from its own thread context, where the FD release races nothing. Applied symmetrically to both the non-streaming ``interruptible_api_call`` and the streaming variant — both routinely get hit by stranger-thread interrupts. The log field ``tcp_force_closed=N`` keeps its existing shape; the new abort path adds ``deferred_close=stranger_thread`` so production triage can distinguish the two close kinds.
…#29507) Ten regressions across both prongs of the #29507 fix, organised so each test names exactly which way the bug could come back: Prong 1 — ``force_close_tcp_sockets``: * ``shutdown_only_no_close`` is the smoking-gun assertion. If a future refactor adds back ``sock.close()`` to this helper, the FD-recycling race that wrote TLS bytes on top of ``kanban.db`` is back, and this trips. * ``uses_shut_rdwr`` pins that both halves are shut down (a half-close wouldn't unblock a worker stuck in ``recv``). * ``swallows_oserror_on_shutdown`` covers the already-shutdown case. * ``handles_multiple_pool_entries`` walks all pool connections. Prong 2 — thread-aware ``_close_request_client_once``: * ``stranger_thread_aborts_only_no_close`` simulates the asyncio_0 → Thread-1616 interrupt path: stranger drives abort, holder stays populated for the worker's eventual finally. * ``owner_thread_pops_and_full_close`` is the worker-thread path: pops + full close. * ``stranger_then_owner_close_sequence_runs_full_close_exactly_once`` replays the reporter's exact timeline at object level: abort runs once, full close runs once, holder ends empty. Agent surface: * ``_abort_request_openai_client_does_not_call_client_close`` pins that the new entrypoint shuts sockets and emits the ``deferred_close=stranger_thread`` marker but never calls ``client.close()``. * ``_abort_request_openai_client_null_client_is_noop`` defensive. End-to-end: * ``fd_recycle_window_closed_by_shutdown_only`` reproduces the race at object level — runs the abort path from a stranger thread and asserts that no ``close()`` ever fires, so the kernel can never recycle the FD under the owner's still-active reference.
4 tasks
teknium1
added a commit
that referenced
this pull request
May 23, 2026
Gpapas
pushed a commit
to Gpapas/hermes-agent
that referenced
this pull request
May 23, 2026
exosyphon
pushed a commit
to exosyphon/hermes-agent
that referenced
this pull request
May 24, 2026
sahilm-ti
pushed a commit
to sahilm-ti/hermes-agent
that referenced
this pull request
May 25, 2026
sahilm-ti
pushed a commit
to sahilm-ti/hermes-agent
that referenced
this pull request
May 25, 2026
mathias3
pushed a commit
to mathias3/hermes-agent
that referenced
this pull request
May 28, 2026
Bryce-huang
pushed a commit
to wbkunlun/hermes-agent
that referenced
this pull request
May 29, 2026
mosaiq-systems
pushed a commit
to mosaiq-systems/hermes-agent
that referenced
this pull request
May 29, 2026
gweeteve
pushed a commit
to gweeteve/hermes-agent
that referenced
this pull request
Jun 2, 2026
Seven74AI
pushed a commit
to Seven74AI/hermes-agent
that referenced
this pull request
Jun 13, 2026
alt-glitch
pushed a commit
that referenced
this pull request
Jun 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Salvages #29544 (@xxxigm) onto current main. Original branch was 74 commits behind.
Summary
Closes the cross-thread FD-recycling race in interrupt/stale-call paths that let TLS application-data bytes overwrite the kanban.db SQLite header. Without this fix, an interrupted OpenAI/httpx request closed from a stranger thread released its TLS socket FD; the kernel could reuse that integer for the next
open()(kanban.db), and the worker's still-live SSL BIO then wrote a delayed TLS record into the wrong file.Changes (3 commits, contributor authorship preserved)
agent/agent_runtime_helpers.py:force_close_tcp_socketsis shut-down-only now — never callssock.close(). The owning httpx worker thread releases the FD via Python's atomic_fd→-1-then-os.closepath.agent/chat_completion_helpers.py: per-request client holder stamps owner thread ident;_close_request_client_oncedispatches owning-thread → fullclient.close(), stranger-thread → new_abort_request_openai_client(sockets only, leave client in holder for the worker'sfinallyto close). Symmetric across non-streaming and streaming variants.run_agent.py:_abort_request_openai_clientmethod emitsOpenAI client aborted (..., tcp_force_closed=N, deferred_close=stranger_thread)log line (existingtcp_force_closed=Nfield shape preserved for log parsers).tests/run_agent/test_tls_fd_recycle_corruption.py: 10 new regressions pinning both prongs.tests/run_agent/test_create_openai_client_reuse.py: existing FD-descent test updated to assertclose_calls == 0.Validation
tests/run_agent/test_tls_fd_recycle_corruption.py+test_create_openai_client_reuse.py: 13/13 pass.test_streaming.py test_stream_interrupt_retry.py test_primary_runtime_restore.py test_kanban_db.py: 231 pass; 3 failures are localanthropicSDK ImportErrors unrelated to this PR (CI has the package via[all]).Closes #29507.
Infographic