Skip to content

fix(api-call): close FD-recycling race that wrote TLS bytes into kanban.db (#29507)#30858

Merged
teknium1 merged 3 commits into
mainfrom
hermes/hermes-49e3731c
May 23, 2026
Merged

fix(api-call): close FD-recycling race that wrote TLS bytes into kanban.db (#29507)#30858
teknium1 merged 3 commits into
mainfrom
hermes/hermes-49e3731c

Conversation

@teknium1

@teknium1 teknium1 commented May 23, 2026

Copy link
Copy Markdown
Contributor

Salvages #29544 (@xxxigm) onto current main. Original branch was 74 commits behind.

Summary

Closes the cross-thread FD-recycling race in interrupt/stale-call paths that let TLS application-data bytes overwrite the kanban.db SQLite header. Without this fix, an interrupted OpenAI/httpx request closed from a stranger thread released its TLS socket FD; the kernel could reuse that integer for the next open() (kanban.db), and the worker's still-live SSL BIO then wrote a delayed TLS record into the wrong file.

Changes (3 commits, contributor authorship preserved)

  • agent/agent_runtime_helpers.py: force_close_tcp_sockets is shut-down-only now — never calls sock.close(). The owning httpx worker thread releases the FD via Python's atomic _fd→-1-then-os.close path.
  • agent/chat_completion_helpers.py: per-request client holder stamps owner thread ident; _close_request_client_once dispatches owning-thread → full client.close(), stranger-thread → new _abort_request_openai_client (sockets only, leave client in holder for the worker's finally to close). Symmetric across non-streaming and streaming variants.
  • run_agent.py: _abort_request_openai_client method emits OpenAI client aborted (..., tcp_force_closed=N, deferred_close=stranger_thread) log line (existing tcp_force_closed=N field shape preserved for log parsers).
  • tests/run_agent/test_tls_fd_recycle_corruption.py: 10 new regressions pinning both prongs.
  • tests/run_agent/test_create_openai_client_reuse.py: existing FD-descent test updated to assert close_calls == 0.

Validation

  • tests/run_agent/test_tls_fd_recycle_corruption.py + test_create_openai_client_reuse.py: 13/13 pass.
  • test_streaming.py test_stream_interrupt_retry.py test_primary_runtime_restore.py test_kanban_db.py: 231 pass; 3 failures are local anthropic SDK ImportErrors unrelated to this PR (CI has the package via [all]).

Closes #29507.

Infographic

kanban.db corruption defense

xxxigm added 3 commits May 23, 2026 02:23
The helper used to call ``socket.shutdown(SHUT_RDWR)`` followed by
``socket.close()`` to drop CLOSE-WAIT entries immediately. On its own
``shutdown()`` is safe from any thread — it only sends FIN and breaks
pending ``recv``/``send`` — but ``close()`` releases the FD integer to
the kernel. When the helper runs on a stranger thread (the interrupt
loop, the stale-call detector) the FD release races the owning httpx
worker thread that still has the same integer cached inside the SSL
BIO. The kernel then recycles that integer to the next ``open()`` call
— in production, kanban dispatcher's ``kanban.db`` — and the worker's
delayed TLS flush writes a 24-byte TLS application-data record on top
of the SQLite header.

Restrict the helper to ``shutdown(SHUT_RDWR)`` only. The owning httpx
worker's own unwind will close the underlying socket via the same
Python ``socket.socket`` object, which atomically swaps ``_fd`` to -1
before issuing ``close(2)`` — no FD-aliasing window.

The log field ``tcp_force_closed=N`` is kept (now counts shutdowns) so
existing dashboards / log parsers keep working.
…upt (#29507)

Layer-2 defense for the FD-recycling race: even with
``force_close_tcp_sockets`` reduced to shutdown-only, the followup
``client.close()`` in ``_close_openai_client`` still walks the httpx
pool and closes sockets — and if called from a stranger thread (the
interrupt-check loop, the stale-call detector) it has the same
FD-recycling exposure that wrote a TLS record on top of ``kanban.db``.

Stamp the request_client_holder with the owning thread's ident at
``_set_request_client`` time. In ``_close_request_client_once``:

* Owning thread (the worker's ``finally``) → pop + ``client.close()``
  via ``_close_request_openai_client``, exactly as before.
* Stranger thread → ``_abort_request_openai_client`` (new): only
  ``shutdown(SHUT_RDWR)`` the pool sockets and log a deferred-close
  marker. The holder stays populated so the worker's eventual
  ``finally`` performs the real close from its own thread context,
  where the FD release races nothing.

Applied symmetrically to both the non-streaming
``interruptible_api_call`` and the streaming variant — both routinely
get hit by stranger-thread interrupts.

The log field ``tcp_force_closed=N`` keeps its existing shape; the new
abort path adds ``deferred_close=stranger_thread`` so production
triage can distinguish the two close kinds.
…#29507)

Ten regressions across both prongs of the #29507 fix, organised so each
test names exactly which way the bug could come back:

Prong 1 — ``force_close_tcp_sockets``:
* ``shutdown_only_no_close`` is the smoking-gun assertion. If a future
  refactor adds back ``sock.close()`` to this helper, the FD-recycling
  race that wrote TLS bytes on top of ``kanban.db`` is back, and this
  trips.
* ``uses_shut_rdwr`` pins that both halves are shut down (a half-close
  wouldn't unblock a worker stuck in ``recv``).
* ``swallows_oserror_on_shutdown`` covers the already-shutdown case.
* ``handles_multiple_pool_entries`` walks all pool connections.

Prong 2 — thread-aware ``_close_request_client_once``:
* ``stranger_thread_aborts_only_no_close`` simulates the asyncio_0 →
  Thread-1616 interrupt path: stranger drives abort, holder stays
  populated for the worker's eventual finally.
* ``owner_thread_pops_and_full_close`` is the worker-thread path: pops
  + full close.
* ``stranger_then_owner_close_sequence_runs_full_close_exactly_once``
  replays the reporter's exact timeline at object level: abort runs
  once, full close runs once, holder ends empty.

Agent surface:
* ``_abort_request_openai_client_does_not_call_client_close`` pins
  that the new entrypoint shuts sockets and emits the
  ``deferred_close=stranger_thread`` marker but never calls
  ``client.close()``.
* ``_abort_request_openai_client_null_client_is_noop`` defensive.

End-to-end:
* ``fd_recycle_window_closed_by_shutdown_only`` reproduces the race
  at object level — runs the abort path from a stranger thread and
  asserts that no ``close()`` ever fires, so the kernel can never
  recycle the FD under the owner's still-active reference.
@teknium1 teknium1 merged commit 5b6f0b6 into main May 23, 2026
22 of 24 checks passed
@teknium1 teknium1 deleted the hermes/hermes-49e3731c branch May 23, 2026 09:31
@alt-glitch alt-glitch added type/bug Something isn't working comp/agent Core agent loop, run_agent.py, prompt builder P1 High — major feature broken, no workaround labels May 23, 2026
Bryce-huang pushed a commit to wbkunlun/hermes-agent that referenced this pull request May 29, 2026
mosaiq-systems pushed a commit to mosaiq-systems/hermes-agent that referenced this pull request May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Interrupted OpenAI/httpx request thread survives across turns and writes TLS record bytes to unrelated file descriptors on delayed close

3 participants