Safe Outputs MCP session expires mid-run, causing agent retry loops

## Problem

In long-running agentic workflows (~30+ minutes), the Safe Outputs MCP HTTP session expires, causing all subsequent `create_pull_request` (and other safeoutputs-routed) tool calls to fail with:

```
MCP server 'safeoutputs': Error: Streamable HTTP error: Error POSTing to endpoint: session not found
```

The Copilot agent then enters a futile retry loop — sleeping and retrying with escalating delays — wasting ~15 minutes of compute before the workflow times out.

## Reproduction

**Workflow run:** https://github.com/dsyme/fv-squad/actions/runs/23607702532/job/68754992942

**Timeline:**
- `17:10` — Agent starts, begins Lean 4 formal verification work (installing toolchain, merging branches, running `lake build`)
- `17:41` — Agent commits locally, first `create_pull_request` call → **fails** with "session not found"
- `17:42–17:56` — Agent retries `create_pull_request` **12+ times**, interspersed with `sleep 15`, `sleep 30`, `sleep 60`, `sleep 90`, `sleep 120`, `sleep 180` delays. Also tries `push_repo_memory` (fails), `noop` (fails), `missing_tool` (fails). None recover.
- `17:56` — Workflow eventually times out

The agent successfully completed all its substantive work (editing Lean proofs, building, committing) but could not create a PR because the MCP session was dead.

## Root Cause Hypothesis

The Safe Outputs MCP server's Streamable HTTP transport appears to have a session timeout. The session was established at ~17:10 and the first tool call failure occurred at ~17:42 — roughly **32 minutes** into the run. If the session has a fixed TTL (e.g., 30 minutes) or an inactivity timeout, it would expire during long `lake build` compilations where no MCP calls are made for 3-5 minutes at a time.

## Impact

- **Wasted compute:** ~15 minutes of retry loops per affected run
- **Lost work:** The agent completed its task (code changes committed locally) but could not deliver the output (PR creation)
- **User confusion:** The workflow appears to hang with no actionable error

## Suggested Fix

1. **Session keepalive/refresh:** Implement periodic heartbeats or automatic session refresh so long-running agents don't lose their session
2. **Session reconnect:** If a session expires, allow the client to establish a new session transparently rather than returning a fatal "session not found" error
3. **Graceful error:** If reconnection is not possible, return a clear error indicating the session expired (with a suggested action) rather than a generic HTTP error that the agent interprets as transient

## Environment

- **Workflow:** dsyme/fv-squad (Lean 4 formal verification)
- **Agent runtime:** ~47 minutes total
- **Session lifetime before failure:** ~32 minutes
- **AWF firewall:** Not involved — all 137 network requests were allowed, 0 blocked

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Safe Outputs MCP session expires mid-run, causing agent retry loops #2596

Problem

Reproduction

Root Cause Hypothesis

Impact

Suggested Fix

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Safe Outputs MCP session expires mid-run, causing agent retry loops #2596

Description

Problem

Reproduction

Root Cause Hypothesis

Impact

Suggested Fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions