fix: capture partial output on timeout and kill process groups#33
fix: capture partial output on timeout and kill process groups#33marcus merged 1 commit intomarcus:mainfrom
Conversation
- ExecRunner now uses Setpgid + process group kill so timeouts kill the entire process tree (fixes Node wrapper leaving Rust child alive) - Agent Execute() preserves partial stdout/stderr on timeout instead of discarding it - Orchestrator plan/implement/review log partial agent output on error Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Hey @gbasin! Starling here (AI assistant on the project). 👋 Excellent diagnosis and fix — the Codex Node.js wrapper spawning a Rust child process that survives the parent kill is exactly the kind of thing that makes timeouts silently non-functional. Using Preserving partial stdout/stderr on timeout and surfacing it as a warning log is a great addition too — debugging timeout scenarios is much easier when you can see what the agent actually produced before dying. The before/after test with |
|
Thanks @gbasin — clean fix for process group cleanup and partial output capture. Tests pass, merged! 🪶 |
marcus
left a comment
There was a problem hiding this comment.
Nice fix — this addresses a real problem. The Codex/Node wrapper zombie-child issue is exactly the kind of thing that's easy to miss until you're staring at a hung process after a timeout.
What's good:
- Process group approach is correct.
Setpgid: true+syscall.Kill(-pid, SIGKILL)is the right POSIX pattern for killing a full process tree. It works correctly on both Linux and macOS. cmd.Cancelis safe here. Go only calls it aftercmd.Start()succeeds, socmd.Processwon't be nil. And Go 1.24 is in the module — no compat concerns.- Partial output on timeout is a great addition. Getting
result.Output = stdoutback when a job times out could save a lot of debugging pain — agents often produce meaningful output right up until they stall. - Surfacing stderr in error messages for both timeout and non-zero exit cases is a solid QoL improvement.
Minor notes (non-blocking):
- There are now two identical helpers:
truncate()inagent.goandtruncateStr()inorchestrator.go. Easy to unify into one shared helper — but not a blocker. - No tests added, but the behavior is tricky to unit test without process spawning infrastructure. If you wanted to add coverage, a test that spawns a process that forks a child and verifies both are killed on timeout would be the gold standard. Not required for this PR though.
One thing to double-check (minor): The Setpgid change is only in ExecRunner.Run (claude.go). If the Codex agent uses ExecRunner too, it's already covered — just want to confirm that's the case rather than Codex having its own exec path that might need the same treatment.
Overall: this is a solid, well-scoped fix. Thanks for digging into this, @gbasin!
Summary
ExecRunnernow usesSetpgid+ process group kill (syscall.Kill(-pid, SIGKILL)) so context timeouts kill the entire process tree. Previously,exec.CommandContextonly killed the direct child process — this was specifically a problem with the Codex provider, where thecodexNode.js wrapper spawns a separate Rust binary as a child process. On timeout, Go killed the Node wrapper but the Rust child kept running, causingcmd.Run()to block on open stdout/stderr pipes until the orphaned process eventually exited. In practice this meant agent timeouts were completely non-functional for Codex.Execute()methods now preserve partial stdout/stderr on timeout and include stderr in error messages, instead of discarding all output.plan(),implement(), andreview()now log partial agent output as a warning when the agent errors but produced output, aiding debugging of timeout/crash scenarios.Test plan
nightshift task run --timeout 15s: before fix, Codex process hung 2+ min past timeout (Node wrapper killed, Rust child alive); after fix, clean kill at exactly 15s with no orphaned processesgo test ./internal/agents/... ./internal/orchestrator/...passesgo vet ./...clean🤖 Generated with Claude Code