Skip to content

Background processes accumulate zombies; process tool reports dead processes as running #6908

@malaiwah

Description

@malaiwah

Bug

Background processes started with terminal(background=true) become zombies (<defunct>) when they exit. The process tool continues to report them as "running" because it tracks the wrapper PID without checking if it's a zombie.

Reproduction

# Start several background processes with different durations
terminal("sleep 10", background=true)   # → PID 393
terminal("sleep 65", background=true)   # → PID 412
terminal("sleep 73", background=true)   # → PID 435
terminal("sleep 300", background=true)  # → PID 9

# Wait for the short ones to finish, then check:
process(action="list")
# Reports all 4 as "running" ✅

# But ps auxwww shows:
# PID 9    sleep 300     running ✅
# PID 393  [bash] <defunct>  zombie ❌
# PID 412  [bash] <defunct>  zombie ❌
# PID 435  [bash] <defunct>  zombie ❌

Root cause

Two compounding issues:

  1. No zombie reaping. The sandbox container's PID 1 is sleep 2h (from docker run ... sleep 2h), not an init process. It doesn't call wait() on orphaned children, so completed background processes become zombies instead of being reaped.

  2. process tool doesn't check process state. ProcessRegistry tracks wrapper PIDs and reports them as "running" based on whether the PID exists, without checking /proc/<pid>/status for zombie state. A zombie PID still exists (it's in the process table until reaped) so the registry incorrectly reports it as alive.

Expected behavior

  • Completed background processes should be reaped (not zombie).
  • process(action="list") should show completed processes as "exited" with their exit code.
  • process(action="poll") on a completed process should return its final output + exit status.

Possible fixes

  1. Use --init flag on docker run (adds tini as PID 1) so orphaned children are reaped automatically.
  2. Have the process registry call os.waitpid(pid, os.WNOHANG) on poll/list to detect and reap zombies.
  3. Both--init for the container-level fix, waitpid for the registry-level fix.

Environment

  • Docker backend (sandboxed containers)
  • Container entrypoint: sleep 2h (not an init process)
  • Hermes process registry: tools/process_registry.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions