During high-frequency interaction tests with hermes dashboard on macOS, we observed a failure state that highlights critical gaps in subprocess lifecycle management. While #21370 identifies the leak, our experience suggests that these leaks contribute to a state of system fragility.
Observations
- Accumulation: Intensive UI usage orphans dozens of slash_worker processes (parented to PID 1).
- Update Collision: A routine "hermes update" was executed while orphans were present. The update became "stuck" and the environment was corrupted.
- Failure State: Subsequent commands failed with
.../venv/bin/python3: No module named pip.
- Recovery Issue: Orphans survived atexit hooks and remained active, complicating manual recovery and environment sync.
Architectural Gaps
- Unmanaged Resource: slash_worker is spawned outside tools.process_registry, making it invisible to global teardown logic.
- Fragile Cleanup: Reliance on atexit is insufficient for update-induced restarts or hard crashes.
- Lack of Self-Termination: The worker cannot detect when its parent has died.
Proposed Fix
I have verified a "Defense-in-Depth" solution on macOS following AGENTS.md:
- Unified Management: Extended ProcessRegistry with register_host_process to track these workers.
- Fingerprinted Watchdog: Added a thread to slash_worker.py monitoring parent PID + create_time.
I have the patch ready for PR. Please let me know if you would like me to submit it.
During high-frequency interaction tests with hermes dashboard on macOS, we observed a failure state that highlights critical gaps in subprocess lifecycle management. While #21370 identifies the leak, our experience suggests that these leaks contribute to a state of system fragility.
Observations
.../venv/bin/python3: No module named pip.Architectural Gaps
Proposed Fix
I have verified a "Defense-in-Depth" solution on macOS following AGENTS.md:
I have the patch ready for PR. Please let me know if you would like me to submit it.