fix(sdk-py): make tools_agent fake model stateless#7930
Merged
Conversation
… flake `tools_agent.py` drove `create_agent` with a module-global `FakeMessagesListChatModel` whose response cursor `i` persists and cycles `0 -> 1 -> 0` across runs. The test relies on every run starting at an even index so the first model call issues the `search` tool call. On the licensed server (multiple queue workers / graph warmup) a run can start mid-cycle at an odd index, so the model replies "done." first and emits no tool call. The `tools` channel then produces zero events and `test_tools_async` fails with "expected at least one tool call handle". This is order-dependent, which is why only `test_tools_async` flaked: it runs before `test_tools_sync`, absorbs the odd-parity run (one model call), and resets the shared cursor back to even so the sync test passes. Derive the reply from conversation state instead: issue the `search` tool call until a `ToolMessage` is present, then a terminating `AIMessage`. This is order-independent, so every run emits exactly one tool call regardless of how many times the model was previously invoked.
tools_agent fake model stateless
Mason Daugherty (mdrxy)
approved these changes
May 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
test_tools.py::test_tools_async(sdk-py integration suite) flakes withAssertionError: expected at least one tool call handle(assert []), whiletest_tools_syncpasses. Observed on #7927's CI but the test predates that PR (added in #7833) and is unchanged on its branch — this is a pre-existing flake.Root cause
integration/graph/tools_agent.pydrivescreate_agentwith a module-globalFakeMessagesListChatModel. That model keeps an instance counterithat persists and cycles0 -> 1 -> 0across runs. The graph only works if every run starts at an even index (first model call issues thesearchtool call). On the licensed server (multiple queue workers / graph warmup), a run can start mid-cycle at an odd index — the model then replies"done."first,create_agentends after one model call, and zerotools-channel events reach the wire.thread.tool_callscorrectly yields nothing and the assertion fails.This is why only the async test flaked: it runs before the sync test, absorbs the odd-parity run (one model call), and flips the shared cursor back to even, so the sync test then passes.
Evidence
ithe run emits 1 tool call + 1ToolMessage; at oddiit emits 0 of each. Verified the state-based replacement emits exactly one tool call across repeated runs withiflipped each time.tools_agentrun succeeded (run_exec_ms=15) with no tool-channel events — the server emitted none because the graph made no tool call. Not an SDK delivery bug.Fix
Derive the reply from conversation state rather than a cycling response list: issue the
searchtool call until aToolMessageis present, then a terminatingAIMessage. Order-independent, so every run emits exactly one tool call regardless of prior invocation count. Still subclassesFakeMessagesListChatModel(notGenericFakeChatModel) to preserve the_streambehavior that keepstool_callsintact.Test plan
sdk-py integration testgreen (the flaky check)ruff format/ruff checkclean; graph compiles.