Skip to content

fix(gateway): auto-recover stale runtime lock after Windows sleep/wake#22483

Open
wshshz wants to merge 1 commit into
NousResearch:mainfrom
wshshz:fix/gateway-stale-lock-recovery-windows
Open

fix(gateway): auto-recover stale runtime lock after Windows sleep/wake#22483
wshshz wants to merge 1 commit into
NousResearch:mainfrom
wshshz:fix/gateway-stale-lock-recovery-windows

Conversation

@wshshz

@wshshz wshshz commented May 9, 2026

Copy link
Copy Markdown

When Windows enters sleep mode, the gateway process is suspended but its byte-range lock (gateway.lock) persists. After wake-up, the old process may be alive but its connections are dead. The lock is mandatory on Windows (msvcrt.LK_NBLCK), so a new instance cannot start — hitting "Gateway runtime lock is already held by another instance. Exiting."

Changes in gateway/status.py:

  • Add Windows-compatible process detection via kernel32.OpenProcess
    • wmic fallback (the old code relied on /proc/pid/cmdline which does not exist on Windows)
  • Add force_recover_stale_gateway_lock() with three-tier detection:
    1. PID lookup (process dead → lock is stale)
    2. Gateway process identity check (not a gateway → lock is stale)
    3. Heartbeat staleness (last heartbeat > 10 min → sleep/wake)
  • When any tier flags staleness, force-kill the old process and clean up the lock file so the new instance can proceed

Changes in gateway/run.py:

  • On lock acquisition failure, call force_recover_stale_gateway_lock() before giving up
  • Add a periodic heartbeat thread (120 s) that updates the runtime status timestamp so the staleness detection is reliable

What does this PR do?

Related Issue

Fixes #

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

How to Test

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform:

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

For New Skills

  • This skill is broadly useful to most users (if bundled) — see Contributing Guide
  • SKILL.md follows the standard format (frontmatter, trigger conditions, steps, pitfalls)
  • No external dependencies that aren't already available (prefer stdlib, curl, existing Hermes tools)
  • I've tested the skill end-to-end: hermes --toolsets skills -q "Use the X skill to do Y"

Screenshots / Logs

When Windows enters sleep mode, the gateway process is suspended but
its byte-range lock (gateway.lock) persists. After wake-up, the old
process may be alive but its connections are dead. The lock is
mandatory on Windows (msvcrt.LK_NBLCK), so a new instance cannot
start — hitting "Gateway runtime lock is already held by another
instance. Exiting."

Changes in gateway/status.py:
- Add Windows-compatible process detection via kernel32.OpenProcess
  + wmic fallback (the old code relied on /proc/pid/cmdline which
  does not exist on Windows)
- Add force_recover_stale_gateway_lock() with three-tier detection:
  1. PID lookup (process dead → lock is stale)
  2. Gateway process identity check (not a gateway → lock is stale)
  3. Heartbeat staleness (last heartbeat > 10 min → sleep/wake)
- When any tier flags staleness, force-kill the old process and
  clean up the lock file so the new instance can proceed

Changes in gateway/run.py:
- On lock acquisition failure, call force_recover_stale_gateway_lock()
  before giving up
- Add a periodic heartbeat thread (120 s) that updates the runtime
  status timestamp so the staleness detection is reliable

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants