Skip to content

Safe Outputs MCP session expires mid-run, causing agent retry loops #2596

@lpcox

Description

@lpcox

Problem

In long-running agentic workflows (~30+ minutes), the Safe Outputs MCP HTTP session expires, causing all subsequent create_pull_request (and other safeoutputs-routed) tool calls to fail with:

MCP server 'safeoutputs': Error: Streamable HTTP error: Error POSTing to endpoint: session not found

The Copilot agent then enters a futile retry loop — sleeping and retrying with escalating delays — wasting ~15 minutes of compute before the workflow times out.

Reproduction

Workflow run: https://github.com/dsyme/fv-squad/actions/runs/23607702532/job/68754992942

Timeline:

  • 17:10 — Agent starts, begins Lean 4 formal verification work (installing toolchain, merging branches, running lake build)
  • 17:41 — Agent commits locally, first create_pull_request call → fails with "session not found"
  • 17:42–17:56 — Agent retries create_pull_request 12+ times, interspersed with sleep 15, sleep 30, sleep 60, sleep 90, sleep 120, sleep 180 delays. Also tries push_repo_memory (fails), noop (fails), missing_tool (fails). None recover.
  • 17:56 — Workflow eventually times out

The agent successfully completed all its substantive work (editing Lean proofs, building, committing) but could not create a PR because the MCP session was dead.

Root Cause Hypothesis

The Safe Outputs MCP server's Streamable HTTP transport appears to have a session timeout. The session was established at ~17:10 and the first tool call failure occurred at ~17:42 — roughly 32 minutes into the run. If the session has a fixed TTL (e.g., 30 minutes) or an inactivity timeout, it would expire during long lake build compilations where no MCP calls are made for 3-5 minutes at a time.

Impact

  • Wasted compute: ~15 minutes of retry loops per affected run
  • Lost work: The agent completed its task (code changes committed locally) but could not deliver the output (PR creation)
  • User confusion: The workflow appears to hang with no actionable error

Suggested Fix

  1. Session keepalive/refresh: Implement periodic heartbeats or automatic session refresh so long-running agents don't lose their session
  2. Session reconnect: If a session expires, allow the client to establish a new session transparently rather than returning a fatal "session not found" error
  3. Graceful error: If reconnection is not possible, return a clear error indicating the session expired (with a suggested action) rather than a generic HTTP error that the agent interprets as transient

Environment

  • Workflow: dsyme/fv-squad (Lean 4 formal verification)
  • Agent runtime: ~47 minutes total
  • Session lifetime before failure: ~32 minutes
  • AWF firewall: Not involved — all 137 network requests were allowed, 0 blocked

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions