fix: capture partial output on timeout and kill process groups by gbasin · Pull Request #33 · marcus/nightshift

gbasin · 2026-02-22T13:39:40Z

Summary

ExecRunner now uses Setpgid + process group kill (syscall.Kill(-pid, SIGKILL)) so context timeouts kill the entire process tree. Previously, exec.CommandContext only killed the direct child process — this was specifically a problem with the Codex provider, where the codex Node.js wrapper spawns a separate Rust binary as a child process. On timeout, Go killed the Node wrapper but the Rust child kept running, causing cmd.Run() to block on open stdout/stderr pipes until the orphaned process eventually exited. In practice this meant agent timeouts were completely non-functional for Codex.
Agent Execute() methods now preserve partial stdout/stderr on timeout and include stderr in error messages, instead of discarding all output.
Orchestrator plan(), implement(), and review() now log partial agent output as a warning when the agent errors but produced output, aiding debugging of timeout/crash scenarios.

Test plan

Verified with nightshift task run --timeout 15s: before fix, Codex process hung 2+ min past timeout (Node wrapper killed, Rust child alive); after fix, clean kill at exactly 15s with no orphaned processes
go test ./internal/agents/... ./internal/orchestrator/... passes
go vet ./... clean

🤖 Generated with Claude Code

- ExecRunner now uses Setpgid + process group kill so timeouts kill the entire process tree (fixes Node wrapper leaving Rust child alive) - Agent Execute() preserves partial stdout/stderr on timeout instead of discarding it - Orchestrator plan/implement/review log partial agent output on error Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

marcus · 2026-02-22T16:06:52Z

Hey @gbasin! Starling here (AI assistant on the project). 👋

Excellent diagnosis and fix — the Codex Node.js wrapper spawning a Rust child process that survives the parent kill is exactly the kind of thing that makes timeouts silently non-functional. Using Setpgid + process group kill is the correct solution here.

Preserving partial stdout/stderr on timeout and surfacing it as a warning log is a great addition too — debugging timeout scenarios is much easier when you can see what the agent actually produced before dying.

The before/after test with --timeout 15s is a strong proof of fix. Flagging for @marcus to review. ✦

marcus · 2026-02-26T17:22:44Z

Thanks @gbasin — clean fix for process group cleanup and partial output capture. Tests pass, merged! 🪶

marcus

Nice fix — this addresses a real problem. The Codex/Node wrapper zombie-child issue is exactly the kind of thing that's easy to miss until you're staring at a hung process after a timeout.

What's good:

Process group approach is correct. Setpgid: true + syscall.Kill(-pid, SIGKILL) is the right POSIX pattern for killing a full process tree. It works correctly on both Linux and macOS.
cmd.Cancel is safe here. Go only calls it after cmd.Start() succeeds, so cmd.Process won't be nil. And Go 1.24 is in the module — no compat concerns.
Partial output on timeout is a great addition. Getting result.Output = stdout back when a job times out could save a lot of debugging pain — agents often produce meaningful output right up until they stall.
Surfacing stderr in error messages for both timeout and non-zero exit cases is a solid QoL improvement.

Minor notes (non-blocking):

There are now two identical helpers: truncate() in agent.go and truncateStr() in orchestrator.go. Easy to unify into one shared helper — but not a blocker.
No tests added, but the behavior is tricky to unit test without process spawning infrastructure. If you wanted to add coverage, a test that spawns a process that forks a child and verifies both are killed on timeout would be the gold standard. Not required for this PR though.

One thing to double-check (minor): The Setpgid change is only in ExecRunner.Run (claude.go). If the Codex agent uses ExecRunner too, it's already covered — just want to confirm that's the case rather than Codex having its own exec path that might need the same treatment.

Overall: this is a solid, well-scoped fix. Thanks for digging into this, @gbasin!

marcus merged commit b13164a into marcus:main Feb 26, 2026

marcus approved these changes Feb 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: capture partial output on timeout and kill process groups#33

fix: capture partial output on timeout and kill process groups#33
marcus merged 1 commit intomarcus:mainfrom
gbasin:fix/capture-timeout-output

gbasin commented Feb 22, 2026 •

edited

Loading

Uh oh!

marcus commented Feb 22, 2026

Uh oh!

marcus commented Feb 26, 2026

Uh oh!

marcus left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gbasin commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

marcus commented Feb 22, 2026

Uh oh!

marcus commented Feb 26, 2026

Uh oh!

marcus left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gbasin commented Feb 22, 2026 •

edited

Loading