Skip to content

fix: capture partial output on timeout and kill process groups#33

Merged
marcus merged 1 commit intomarcus:mainfrom
gbasin:fix/capture-timeout-output
Feb 26, 2026
Merged

fix: capture partial output on timeout and kill process groups#33
marcus merged 1 commit intomarcus:mainfrom
gbasin:fix/capture-timeout-output

Conversation

@gbasin
Copy link
Contributor

@gbasin gbasin commented Feb 22, 2026

Summary

  • ExecRunner now uses Setpgid + process group kill (syscall.Kill(-pid, SIGKILL)) so context timeouts kill the entire process tree. Previously, exec.CommandContext only killed the direct child process — this was specifically a problem with the Codex provider, where the codex Node.js wrapper spawns a separate Rust binary as a child process. On timeout, Go killed the Node wrapper but the Rust child kept running, causing cmd.Run() to block on open stdout/stderr pipes until the orphaned process eventually exited. In practice this meant agent timeouts were completely non-functional for Codex.
  • Agent Execute() methods now preserve partial stdout/stderr on timeout and include stderr in error messages, instead of discarding all output.
  • Orchestrator plan(), implement(), and review() now log partial agent output as a warning when the agent errors but produced output, aiding debugging of timeout/crash scenarios.

Test plan

  • Verified with nightshift task run --timeout 15s: before fix, Codex process hung 2+ min past timeout (Node wrapper killed, Rust child alive); after fix, clean kill at exactly 15s with no orphaned processes
  • go test ./internal/agents/... ./internal/orchestrator/... passes
  • go vet ./... clean

🤖 Generated with Claude Code

- ExecRunner now uses Setpgid + process group kill so timeouts kill
  the entire process tree (fixes Node wrapper leaving Rust child alive)
- Agent Execute() preserves partial stdout/stderr on timeout instead
  of discarding it
- Orchestrator plan/implement/review log partial agent output on error

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@marcus
Copy link
Owner

marcus commented Feb 22, 2026

Hey @gbasin! Starling here (AI assistant on the project). 👋

Excellent diagnosis and fix — the Codex Node.js wrapper spawning a Rust child process that survives the parent kill is exactly the kind of thing that makes timeouts silently non-functional. Using Setpgid + process group kill is the correct solution here.

Preserving partial stdout/stderr on timeout and surfacing it as a warning log is a great addition too — debugging timeout scenarios is much easier when you can see what the agent actually produced before dying.

The before/after test with --timeout 15s is a strong proof of fix. Flagging for @marcus to review. ✦

@marcus marcus merged commit b13164a into marcus:main Feb 26, 2026
@marcus
Copy link
Owner

marcus commented Feb 26, 2026

Thanks @gbasin — clean fix for process group cleanup and partial output capture. Tests pass, merged! 🪶

Copy link
Owner

@marcus marcus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix — this addresses a real problem. The Codex/Node wrapper zombie-child issue is exactly the kind of thing that's easy to miss until you're staring at a hung process after a timeout.

What's good:

  • Process group approach is correct. Setpgid: true + syscall.Kill(-pid, SIGKILL) is the right POSIX pattern for killing a full process tree. It works correctly on both Linux and macOS.
  • cmd.Cancel is safe here. Go only calls it after cmd.Start() succeeds, so cmd.Process won't be nil. And Go 1.24 is in the module — no compat concerns.
  • Partial output on timeout is a great addition. Getting result.Output = stdout back when a job times out could save a lot of debugging pain — agents often produce meaningful output right up until they stall.
  • Surfacing stderr in error messages for both timeout and non-zero exit cases is a solid QoL improvement.

Minor notes (non-blocking):

  • There are now two identical helpers: truncate() in agent.go and truncateStr() in orchestrator.go. Easy to unify into one shared helper — but not a blocker.
  • No tests added, but the behavior is tricky to unit test without process spawning infrastructure. If you wanted to add coverage, a test that spawns a process that forks a child and verifies both are killed on timeout would be the gold standard. Not required for this PR though.

One thing to double-check (minor): The Setpgid change is only in ExecRunner.Run (claude.go). If the Codex agent uses ExecRunner too, it's already covered — just want to confirm that's the case rather than Codex having its own exec path that might need the same treatment.

Overall: this is a solid, well-scoped fix. Thanks for digging into this, @gbasin!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants