Skip to content

fix(docker): reap orphaned subprocesses with tini; dedup smoke setup (#1287, #1303)#1306

Merged
Aaronontheweb merged 3 commits into
netclaw-dev:devfrom
Aaronontheweb:docker/container-hardening-1287-1303
Jun 3, 2026
Merged

fix(docker): reap orphaned subprocesses with tini; dedup smoke setup (#1287, #1303)#1306
Aaronontheweb merged 3 commits into
netclaw-dev:devfrom
Aaronontheweb:docker/container-hardening-1287-1303

Conversation

@Aaronontheweb

Copy link
Copy Markdown
Collaborator

Summary

Two follow-ups from the #1279 container-lifecycle rework (#1282):

  • Docker image PID 1 (entrypoint.sh) doesn't reap reparented zombies #1287 — reap orphaned subprocesses. Insert tini as PID 1 via the
    ENTRYPOINT. entrypoint.sh supervises netclawd with netclawd & wait $PID, so it only reaps its own direct child. netclawd's tool subprocesses
    that orphan (their parent exits) reparent to PID 1 and, with entrypoint.sh
    as PID 1, were never reaped — they piled up as <defunct> zombies over a
    long-running container's lifetime. tini is the canonical tiny init that
    reaps them; -g forwards signals to the whole process group so docker stop
    reaches netclawd even mid-backoff.

  • Dedup container smoke test bring-up/health-poll with validate_docker_image.yml #1303 — dedup the Docker smoke setup. The minimal-provider env contract
    (NETCLAW_Providers__validate__* / NETCLAW_Models__Main__*) and the
    /api/health/ready poll were copy-pasted across the two
    validate_docker_image.yml verify steps and test-daemon-lifecycle.sh.
    Extracted them into scripts/docker/lib/smoke-lib.sh so the port, health
    path, provider-env contract, and crash-bail logic live in one place and
    can't silently drift.

Changes

  • docker/Dockerfile: add tini to the apt set; ENTRYPOINT ["/usr/bin/tini", "-g", "--", "/opt/netclaw/entrypoint.sh"]; update the header comments to
    describe the tini → entrypoint.sh → netclawd tree.
  • scripts/docker/lib/smoke-lib.sh (new): netclaw_smoke_env_args and
    netclaw_wait_healthy (0 healthy / 1 timeout / 2 exited).
  • scripts/docker/test-daemon-lifecycle.sh: source the lib; add Phase D
    (orphan reaping — spawn a process that reparents to PID 1, kill it, assert it
    is reaped not left <defunct>); replace the bare PPID == 1 supervision
    assertion with a chain check (netclawd → entrypoint.sh → PID 1) that
    still expresses "supervised, not a detached exec-session daemon" now that
    tini, not entrypoint.sh, is PID 1.
  • .github/workflows/validate_docker_image.yml: both verify steps source the
    lib; broaden the path filter to scripts/docker/** so a change to the shared
    lib re-runs the image gate.

Validation

Built the image locally and ran the full lifecycle smoke test — all four
phases pass:

initial: count=1 pid=21 supervision=ok (port :5199)
Phase A: config write re-binds :5199 -> :5200 in-process (same pid, supervision ok)
Phase B: 'netclaw daemon start' defers — "Daemon managed by container supervisor (PID 21)"
Phase C: bad Daemon config fails loud (supervisor observes exit) + recovers
Phase D: orphan pid 346 reaped by PID 1 (tini)

The same validate_docker_image workflow runs this on the PR (it touches
docker/** and scripts/docker/**).

Closes #1287
Closes #1303

@Aaronontheweb Aaronontheweb added the docker Docker image packaging, publishing, and containerized workflows label Jun 3, 2026
Insert tini as PID 1 (ENTRYPOINT) so orphaned netclawd tool subprocesses
that reparent to PID 1 are reaped instead of piling up as <defunct>
zombies. entrypoint.sh only `wait`s its own direct child, so without a
reaping init these accumulated over a long-running container's lifetime
(netclaw-dev#1287). `tini -g` forwards signals to the whole process group so a
`docker stop` reaches netclawd even mid-backoff.

Extract the Docker smoke-test minimal-config env contract and the
health-poll into scripts/docker/lib/smoke-lib.sh so the two
validate_docker_image.yml verify steps and the lifecycle regression test
share one source of truth instead of 2-3 copies that silently drift when
the health route, provider-env contract, or startup budget changes (netclaw-dev#1303).

Extend the lifecycle regression test with Phase D (orphan reaping) and
replace its "PPID == 1" supervision assertion with a chain check
(netclawd -> entrypoint.sh -> PID 1) that still expresses "supervised, not
a detached exec-session daemon" now that tini, not entrypoint.sh, is PID 1.
Broaden the workflow path filter to scripts/docker/** so changes to the
shared lib re-run the image gate.
@Aaronontheweb

Copy link
Copy Markdown
Collaborator Author

Uses tini to kill off orphaned processes at startup https://github.com/krallin/tini

@Aaronontheweb Aaronontheweb enabled auto-merge (squash) June 3, 2026 07:15

@Aaronontheweb Aaronontheweb left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docker-only change, LGTM - smoke tests came back clean but we'll see how it performs in the wild

@Aaronontheweb Aaronontheweb merged commit 2a97c3d into netclaw-dev:dev Jun 3, 2026
14 checks passed
@Aaronontheweb Aaronontheweb deleted the docker/container-hardening-1287-1303 branch June 3, 2026 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docker Docker image packaging, publishing, and containerized workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dedup container smoke test bring-up/health-poll with validate_docker_image.yml Docker image PID 1 (entrypoint.sh) doesn't reap reparented zombies

1 participant