Problem
In long-running agentic workflows (~30+ minutes), the Safe Outputs MCP HTTP session expires, causing all subsequent create_pull_request (and other safeoutputs-routed) tool calls to fail with:
MCP server 'safeoutputs': Error: Streamable HTTP error: Error POSTing to endpoint: session not found
The Copilot agent then enters a futile retry loop — sleeping and retrying with escalating delays — wasting ~15 minutes of compute before the workflow times out.
Reproduction
Workflow run: https://github.com/dsyme/fv-squad/actions/runs/23607702532/job/68754992942
Timeline:
17:10 — Agent starts, begins Lean 4 formal verification work (installing toolchain, merging branches, running lake build)
17:41 — Agent commits locally, first create_pull_request call → fails with "session not found"
17:42–17:56 — Agent retries create_pull_request 12+ times, interspersed with sleep 15, sleep 30, sleep 60, sleep 90, sleep 120, sleep 180 delays. Also tries push_repo_memory (fails), noop (fails), missing_tool (fails). None recover.
17:56 — Workflow eventually times out
The agent successfully completed all its substantive work (editing Lean proofs, building, committing) but could not create a PR because the MCP session was dead.
Root Cause Hypothesis
The Safe Outputs MCP server's Streamable HTTP transport appears to have a session timeout. The session was established at ~17:10 and the first tool call failure occurred at ~17:42 — roughly 32 minutes into the run. If the session has a fixed TTL (e.g., 30 minutes) or an inactivity timeout, it would expire during long lake build compilations where no MCP calls are made for 3-5 minutes at a time.
Impact
- Wasted compute: ~15 minutes of retry loops per affected run
- Lost work: The agent completed its task (code changes committed locally) but could not deliver the output (PR creation)
- User confusion: The workflow appears to hang with no actionable error
Suggested Fix
- Session keepalive/refresh: Implement periodic heartbeats or automatic session refresh so long-running agents don't lose their session
- Session reconnect: If a session expires, allow the client to establish a new session transparently rather than returning a fatal "session not found" error
- Graceful error: If reconnection is not possible, return a clear error indicating the session expired (with a suggested action) rather than a generic HTTP error that the agent interprets as transient
Environment
- Workflow: dsyme/fv-squad (Lean 4 formal verification)
- Agent runtime: ~47 minutes total
- Session lifetime before failure: ~32 minutes
- AWF firewall: Not involved — all 137 network requests were allowed, 0 blocked
Problem
In long-running agentic workflows (~30+ minutes), the Safe Outputs MCP HTTP session expires, causing all subsequent
create_pull_request(and other safeoutputs-routed) tool calls to fail with:The Copilot agent then enters a futile retry loop — sleeping and retrying with escalating delays — wasting ~15 minutes of compute before the workflow times out.
Reproduction
Workflow run: https://github.com/dsyme/fv-squad/actions/runs/23607702532/job/68754992942
Timeline:
17:10— Agent starts, begins Lean 4 formal verification work (installing toolchain, merging branches, runninglake build)17:41— Agent commits locally, firstcreate_pull_requestcall → fails with "session not found"17:42–17:56— Agent retriescreate_pull_request12+ times, interspersed withsleep 15,sleep 30,sleep 60,sleep 90,sleep 120,sleep 180delays. Also triespush_repo_memory(fails),noop(fails),missing_tool(fails). None recover.17:56— Workflow eventually times outThe agent successfully completed all its substantive work (editing Lean proofs, building, committing) but could not create a PR because the MCP session was dead.
Root Cause Hypothesis
The Safe Outputs MCP server's Streamable HTTP transport appears to have a session timeout. The session was established at ~17:10 and the first tool call failure occurred at ~17:42 — roughly 32 minutes into the run. If the session has a fixed TTL (e.g., 30 minutes) or an inactivity timeout, it would expire during long
lake buildcompilations where no MCP calls are made for 3-5 minutes at a time.Impact
Suggested Fix
Environment