Skip to content

fix(run_agent): prevent _create_openai_client from mutating caller's kwargs#11056

Merged
kshitijk4poor merged 1 commit into
mainfrom
fix/kwargs-mutation-guard
Apr 16, 2026
Merged

fix(run_agent): prevent _create_openai_client from mutating caller's kwargs#11056
kshitijk4poor merged 1 commit into
mainfrom
fix/kwargs-mutation-guard

Conversation

@kshitijk4poor

Copy link
Copy Markdown
Collaborator

Cherry-picked from #10978 by @taeuk178.

Shallow-copy client_kwargs at the top of _create_openai_client() to prevent in-place mutation from leaking back into self._client_kwargs. Defensive fix that locks the contract for future httpx/transport work.

Test Results

tests/run_agent/test_create_openai_client_kwargs_isolation.py: 1 passed

…args

Shallow-copy client_kwargs at the top of _create_openai_client() to
prevent in-place mutation from leaking back into self._client_kwargs.
Defensive fix that locks the contract for future httpx/transport work.

Cherry-picked from #10978 by @taeuk178.
@swnb

swnb commented Apr 16, 2026

Copy link
Copy Markdown

I independently hit the same root cause while debugging gateway instability on macOS, and I wanted to add a few datapoints that may help reviewers validate the fix.

What I observed locally:

  • Long-running hermes gateway run --replace processes using openai-codex would start producing repeated APIConnectionError / APITimeoutError, even on tiny prompts (2-7 messages, ~5k-10k tokens).
  • In the same environment, a fresh-shell hermes chat -q ... often still succeeded.
  • Replaying the exact request-dump body from a failed gateway turn as a fresh streaming POST to https://chatgpt.com/backend-api/codex/responses returned HTTP 200, which strongly pointed away from payload content and toward daemon/client state.
  • lsof on the gateway daemons showed stale TCP state (CLOSE_WAIT) in the long-lived process.

That matched the exact issue fixed here: _create_openai_client() was mutating caller-owned kwargs, so gateway-held self._client_kwargs could retain a concrete http_client, and later request clients were not actually fresh.

Additional note: on my side, preserving OpenAI SDK timeout / connection-limit defaults alongside the keepalive client was also helpful for keeping gateway behavior aligned with fresh-shell behavior.

For traceability, I documented the full reproduction and validation notes in #11070, and closed my duplicate PR #11072 in favor of this one.

@kshitijk4poor kshitijk4poor merged commit 896e7b0 into main Apr 16, 2026
6 of 7 checks passed
@kshitijk4poor kshitijk4poor deleted the fix/kwargs-mutation-guard branch April 16, 2026 14:45
@kshitijk4poor

Copy link
Copy Markdown
Collaborator Author

I independently hit the same root cause while debugging gateway instability on macOS, and I wanted to add a few datapoints that may help reviewers validate the fix.

What I observed locally:

  • Long-running hermes gateway run --replace processes using openai-codex would start producing repeated APIConnectionError / APITimeoutError, even on tiny prompts (2-7 messages, ~5k-10k tokens).
  • In the same environment, a fresh-shell hermes chat -q ... often still succeeded.
  • Replaying the exact request-dump body from a failed gateway turn as a fresh streaming POST to https://chatgpt.com/backend-api/codex/responses returned HTTP 200, which strongly pointed away from payload content and toward daemon/client state.
  • lsof on the gateway daemons showed stale TCP state (CLOSE_WAIT) in the long-lived process.

That matched the exact issue fixed here: _create_openai_client() was mutating caller-owned kwargs, so gateway-held self._client_kwargs could retain a concrete http_client, and later request clients were not actually fresh.

Additional note: on my side, preserving OpenAI SDK timeout / connection-limit defaults alongside the keepalive client was also helpful for keeping gateway behavior aligned with fresh-shell behavior.

For traceability, I documented the full reproduction and validation notes in #11070, and closed my duplicate PR #11072 in favor of this one.

Thanks for the findings. Could you please verify if this PR fixes your issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants