Skip to content

fix(api-call): close FD-recycling race that wrote TLS bytes into kanban.db (#29507)#29544

Closed
xxxigm wants to merge 3 commits into
NousResearch:mainfrom
xxxigm:fix/29507-tls-fd-recycle-corruption
Closed

fix(api-call): close FD-recycling race that wrote TLS bytes into kanban.db (#29507)#29544
xxxigm wants to merge 3 commits into
NousResearch:mainfrom
xxxigm:fix/29507-tls-fd-recycle-corruption

Conversation

@xxxigm

@xxxigm xxxigm commented May 21, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes #29507. P1 — TLS application-data bytes were being written on top of unrelated file descriptors (in production: kanban.db's SQLite header), causing silent persistent-state corruption and data loss on recovery.

Root cause

The reporter's forensic is conclusive — the 24 bytes that overwrote bytes 5..28 of kanban.db are exactly the shape of a TLS 1.2 application-data record (5-byte header 17 03 03 00 13 + 19-byte encrypted payload). The only producer of that byte pattern in the process was the OpenAI/httpx TLS stream from the interrupted Codex request.

Mechanism:

  1. Worker thread (Thread-1616, the _call thread inside interruptible_api_call) creates an OpenAI client, gets a TCP socket with FD X, hands it down through httpx → httpcore → ssl → OpenSSL BIO. The BIO caches X as a raw integer.
  2. The user interrupts. The interrupt-check loop runs on asyncio_0 (a stranger thread w.r.t. the worker) and calls _close_request_client_once(\"interrupt_abort\"). That walks force_close_tcp_sockets (shutdown + close) and then client.close() (which again closes pool sockets).
  3. socket.close() releases the integer FD to the kernel.
  4. The next open(2) in the same process — the kanban dispatcher opening kanban.db — gets FD X.
  5. The worker eventually unblocks (its blocked recv returns), httpx flushes one last TLS application-data record via the still-live BIO, which calls write(X, …). The bytes land in kanban.db because the kernel reassigned X.

This is FD aliasing across threads: the SSL layer's BIO caches the FD as an integer, so once a stranger thread releases that integer to the kernel, anything else opening a new FD wins the race against the worker's pending TLS flush.

Fix (two prongs, both required)

Prong 1 — force_close_tcp_sockets is shutdown-only

shutdown(SHUT_RDWR) from any thread is FD-safe — it sends FIN, breaks pending recv/send, but does NOT release the FD integer. The owning httpx worker's natural unwind closes the underlying socket.socket via the same Python object, which atomically swaps _fd-1 before issuing close(2) — so the FD release happens from exactly one thread that nobody else races.

Prong 2 — _close_request_client_once is thread-aware

Even with prong 1, the followup client.close() in _close_openai_client walks the httpx pool and closes sockets from whatever thread invoked it. We stamp the per-request client holder with the owning thread ident at _set_request_client time and:

  • Owning thread (the worker's finally) → pop holder + full client.close() as before.
  • Stranger thread (interrupt loop, stale detector) → new _abort_request_openai_client: shutdown sockets only, log deferred_close=stranger_thread, do not touch the holder. The worker's finally later sees the client still there and closes it from its own thread context — where FD release races nothing.

Applied symmetrically to both the non-streaming interruptible_api_call and the streaming variant.

Prong 3 (defense-in-depth) — already exists

hermes_cli.kanban_db._validate_sqlite_header already detects this exact TLS-shaped corruption at open time and refuses to operate on the file (that's the kanban dispatcher: ... is not a valid SQLite database line in the reporter's log). Kept intact — it's the reactive layer; this PR is the proactive layer that stops the bytes from being written in the first place.

Why three commits?

  • Commit 1 — Prong 1: the minimal surgical change in force_close_tcp_sockets (drop sock.close(), shutdown only) with a long docstring naming the failure mode by issue number.
  • Commit 2 — Prong 2: owner-thread tracking in both interruptible_api_call variants, plus the new _abort_request_openai_client agent method that does shutdown-only with a distinguishable log line.
  • Commit 3 — Tests: 10 regressions in a dedicated tests/run_agent/test_tls_fd_recycle_corruption.py, plus the existing test_force_close_tcp_sockets_descends_httpcore_1_connection_wrapper updated to assert close_calls == 0 (was 1).

Test plan

  • pytest tests/run_agent/test_tls_fd_recycle_corruption.py → 10 new tests pass.
  • pytest tests/run_agent/test_streaming.py tests/run_agent/test_stream_interrupt_retry.py tests/run_agent/test_create_openai_client_reuse.py tests/run_agent/test_primary_runtime_restore.py tests/hermes_cli/test_kanban_db.py → 247 existing tests still pass.
  • Linter clean.
  • Reviewer: in production, after deploying, look for the new OpenAI client aborted (... deferred_close=stranger_thread) log lines on /stop-interrupt scenarios — the absence of subsequent kanban dispatcher: ... is not a valid SQLite database errors confirms the race is closed.

Observability changes (heads-up for dashboards)

The interrupt-abort path now emits OpenAI client aborted (..., tcp_force_closed=N, deferred_close=stranger_thread) instead of OpenAI client closed (..., tcp_force_closed=N). The tcp_force_closed=N field shape is preserved so existing parsers keep working; the new deferred_close=stranger_thread marker is purely additive.

xxxigm added 3 commits May 21, 2026 07:20
…esearch#29507)

The helper used to call ``socket.shutdown(SHUT_RDWR)`` followed by
``socket.close()`` to drop CLOSE-WAIT entries immediately. On its own
``shutdown()`` is safe from any thread — it only sends FIN and breaks
pending ``recv``/``send`` — but ``close()`` releases the FD integer to
the kernel. When the helper runs on a stranger thread (the interrupt
loop, the stale-call detector) the FD release races the owning httpx
worker thread that still has the same integer cached inside the SSL
BIO. The kernel then recycles that integer to the next ``open()`` call
— in production, kanban dispatcher's ``kanban.db`` — and the worker's
delayed TLS flush writes a 24-byte TLS application-data record on top
of the SQLite header.

Restrict the helper to ``shutdown(SHUT_RDWR)`` only. The owning httpx
worker's own unwind will close the underlying socket via the same
Python ``socket.socket`` object, which atomically swaps ``_fd`` to -1
before issuing ``close(2)`` — no FD-aliasing window.

The log field ``tcp_force_closed=N`` is kept (now counts shutdowns) so
existing dashboards / log parsers keep working.
…upt (NousResearch#29507)

Layer-2 defense for the FD-recycling race: even with
``force_close_tcp_sockets`` reduced to shutdown-only, the followup
``client.close()`` in ``_close_openai_client`` still walks the httpx
pool and closes sockets — and if called from a stranger thread (the
interrupt-check loop, the stale-call detector) it has the same
FD-recycling exposure that wrote a TLS record on top of ``kanban.db``.

Stamp the request_client_holder with the owning thread's ident at
``_set_request_client`` time. In ``_close_request_client_once``:

* Owning thread (the worker's ``finally``) → pop + ``client.close()``
  via ``_close_request_openai_client``, exactly as before.
* Stranger thread → ``_abort_request_openai_client`` (new): only
  ``shutdown(SHUT_RDWR)`` the pool sockets and log a deferred-close
  marker. The holder stays populated so the worker's eventual
  ``finally`` performs the real close from its own thread context,
  where the FD release races nothing.

Applied symmetrically to both the non-streaming
``interruptible_api_call`` and the streaming variant — both routinely
get hit by stranger-thread interrupts.

The log field ``tcp_force_closed=N`` keeps its existing shape; the new
abort path adds ``deferred_close=stranger_thread`` so production
triage can distinguish the two close kinds.
…NousResearch#29507)

Ten regressions across both prongs of the NousResearch#29507 fix, organised so each
test names exactly which way the bug could come back:

Prong 1 — ``force_close_tcp_sockets``:
* ``shutdown_only_no_close`` is the smoking-gun assertion. If a future
  refactor adds back ``sock.close()`` to this helper, the FD-recycling
  race that wrote TLS bytes on top of ``kanban.db`` is back, and this
  trips.
* ``uses_shut_rdwr`` pins that both halves are shut down (a half-close
  wouldn't unblock a worker stuck in ``recv``).
* ``swallows_oserror_on_shutdown`` covers the already-shutdown case.
* ``handles_multiple_pool_entries`` walks all pool connections.

Prong 2 — thread-aware ``_close_request_client_once``:
* ``stranger_thread_aborts_only_no_close`` simulates the asyncio_0 →
  Thread-1616 interrupt path: stranger drives abort, holder stays
  populated for the worker's eventual finally.
* ``owner_thread_pops_and_full_close`` is the worker-thread path: pops
  + full close.
* ``stranger_then_owner_close_sequence_runs_full_close_exactly_once``
  replays the reporter's exact timeline at object level: abort runs
  once, full close runs once, holder ends empty.

Agent surface:
* ``_abort_request_openai_client_does_not_call_client_close`` pins
  that the new entrypoint shuts sockets and emits the
  ``deferred_close=stranger_thread`` marker but never calls
  ``client.close()``.
* ``_abort_request_openai_client_null_client_is_noop`` defensive.

End-to-end:
* ``fd_recycle_window_closed_by_shutdown_only`` reproduces the race
  at object level — runs the abort path from a stranger thread and
  asserts that no ``close()`` ever fires, so the kernel can never
  recycle the FD under the owner's still-active reference.
@teknium1

Copy link
Copy Markdown
Contributor

Salvaged onto current main as #30858 and merged (commit 5b6f0b6). Your 3 commits were cherry-picked with your authorship preserved. Thanks for the deep forensic and the two-prong fix!

@teknium1 teknium1 closed this May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Interrupted OpenAI/httpx request thread survives across turns and writes TLS record bytes to unrelated file descriptors on delayed close

3 participants