Skip to content

fix(sdk-py): make tools_agent fake model stateless#7930

Merged
Nick Hollon (nick-hollon-lc) merged 1 commit into
mainfrom
nh/fix-tools-agent-flake
May 28, 2026
Merged

fix(sdk-py): make tools_agent fake model stateless#7930
Nick Hollon (nick-hollon-lc) merged 1 commit into
mainfrom
nh/fix-tools-agent-flake

Conversation

@nick-hollon-lc

@nick-hollon-lc Nick Hollon (nick-hollon-lc) commented May 28, 2026

Copy link
Copy Markdown
Contributor

Problem

test_tools.py::test_tools_async (sdk-py integration suite) flakes with AssertionError: expected at least one tool call handle (assert []), while test_tools_sync passes. Observed on #7927's CI but the test predates that PR (added in #7833) and is unchanged on its branch — this is a pre-existing flake.

Root cause

integration/graph/tools_agent.py drives create_agent with a module-global FakeMessagesListChatModel. That model keeps an instance counter i that persists and cycles 0 -> 1 -> 0 across runs. The graph only works if every run starts at an even index (first model call issues the search tool call). On the licensed server (multiple queue workers / graph warmup), a run can start mid-cycle at an odd index — the model then replies "done." first, create_agent ends after one model call, and zero tools-channel events reach the wire. thread.tool_calls correctly yields nothing and the assertion fails.

This is why only the async test flaked: it runs before the sync test, absorbs the odd-parity run (one model call), and flips the shared cursor back to even, so the sync test then passes.

Evidence

  • Reproduced the exact graph outside Docker: at even i the run emits 1 tool call + 1 ToolMessage; at odd i it emits 0 of each. Verified the state-based replacement emits exactly one tool call across repeated runs with i flipped each time.
  • The failing CI run's server logs show the tools_agent run succeeded (run_exec_ms=15) with no tool-channel events — the server emitted none because the graph made no tool call. Not an SDK delivery bug.

Fix

Derive the reply from conversation state rather than a cycling response list: issue the search tool call until a ToolMessage is present, then a terminating AIMessage. Order-independent, so every run emits exactly one tool call regardless of prior invocation count. Still subclasses FakeMessagesListChatModel (not GenericFakeChatModel) to preserve the _stream behavior that keeps tool_calls intact.

Test plan

  • sdk-py integration test green (the flaky check)
  • Local: ruff format / ruff check clean; graph compiles.

… flake

`tools_agent.py` drove `create_agent` with a module-global
`FakeMessagesListChatModel` whose response cursor `i` persists and cycles
`0 -> 1 -> 0` across runs. The test relies on every run starting at an even
index so the first model call issues the `search` tool call. On the licensed
server (multiple queue workers / graph warmup) a run can start mid-cycle at an
odd index, so the model replies "done." first and emits no tool call. The
`tools` channel then produces zero events and `test_tools_async` fails with
"expected at least one tool call handle".

This is order-dependent, which is why only `test_tools_async` flaked:
it runs before `test_tools_sync`, absorbs the odd-parity run (one model call),
and resets the shared cursor back to even so the sync test passes.

Derive the reply from conversation state instead: issue the `search` tool call
until a `ToolMessage` is present, then a terminating `AIMessage`. This is
order-independent, so every run emits exactly one tool call regardless of how
many times the model was previously invoked.
@mdrxy Mason Daugherty (mdrxy) changed the title fix(sdk-py): make tools_agent fake model stateless to fix integration flake fix(sdk-py): make tools_agent fake model stateless May 28, 2026
@nick-hollon-lc Nick Hollon (nick-hollon-lc) merged commit ea4aa79 into main May 28, 2026
132 of 134 checks passed
@nick-hollon-lc Nick Hollon (nick-hollon-lc) deleted the nh/fix-tools-agent-flake branch May 28, 2026 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants