Skip to content

Fix SSH agent forwarding for long-running sessions#74

Merged
JAORMX merged 5 commits intomainfrom
fix/ssh-agent-forwarding-resilience
Mar 20, 2026
Merged

Fix SSH agent forwarding for long-running sessions#74
JAORMX merged 5 commits intomainfrom
fix/ssh-agent-forwarding-resilience

Conversation

@JAORMX
Copy link
Copy Markdown
Contributor

@JAORMX JAORMX commented Mar 20, 2026

Summary

  • Bump go-microvm to v0.0.24 — picks up the root cause fix (go-microvm#49): a missing channel.CloseWrite() in the guest SSH server's agent proxy caused goroutine leaks that exhausted the maxAgentConns semaphore after ~8 agent connections, breaking SSH agent forwarding for the rest of the session. Symptom: git operations succeed for the first few minutes then fail with Permission denied (publickey).
  • Add SSH keepalive (30s interval) — sends keepalive@openssh.com requests to detect dead mux connections early and prevent idle timeouts.
  • Add agent dial retry (3 attempts, exponential backoff from 200ms) — handles transient agent unavailability during socket-activated restarts (e.g. gcr-ssh-agent).
  • Upgrade agent forwarding errors to warn level — previously all failures were logged at debug level, making them invisible without explicit debug logging.
  • Inject SSHAuthSock via SessionOpts — replaces os.Getenv("SSH_AUTH_SOCK") in the infrastructure layer with dependency injection through the domain layer, improving testability and DDD compliance.

Test plan

  • New unit tests: TestDialAgentWithRetry_{SuccessFirstAttempt,AllRetriesFail,SuccessAfterRetry,ContextCancelled}, TestSetupAgentForwarding_{EmptyAuthSock,UnreachableSocket}
  • task test — full suite passes with race detector
  • task lint — zero issues
  • End-to-end soak test: 15-minute git fetch loop every 2 minutes — 8/8 succeeded (was 2/8 before fix)

🤖 Generated with Claude Code

JAORMX and others added 5 commits March 20, 2026 10:58
Add retry with exponential backoff on agent socket dial (handles
transient gcr-ssh-agent restarts), SSH keepalive to detect dead
connections, and warn-level logging for agent failures that were
previously invisible at debug level. Inject SSHAuthSock via
SessionOpts instead of reading os.Getenv in infrastructure layer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Picks up the SSH agent forwarding half-close fix (go-microvm#49)
that prevents goroutine leaks exhausting the agent connection
semaphore during long-running sessions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
macOS limits Unix socket paths to 104 bytes. t.TempDir() combined
with long test names exceeded this limit causing bind failures.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Force --no-ff on git merge to ensure a merge commit is always
created regardless of git config or platform defaults.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@JAORMX JAORMX merged commit c39d56c into main Mar 20, 2026
8 checks passed
@JAORMX JAORMX deleted the fix/ssh-agent-forwarding-resilience branch March 20, 2026 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant