Skip to content

fix(gateway): explain stale runtime lock failures (#28561)#28603

Open
wesleysimplicio wants to merge 1 commit into
NousResearch:mainfrom
wesleysimplicio:codex/fix-28561-stale-runtime-lock-guidance
Open

fix(gateway): explain stale runtime lock failures (#28561)#28603
wesleysimplicio wants to merge 1 commit into
NousResearch:mainfrom
wesleysimplicio:codex/fix-28561-stale-runtime-lock-guidance

Conversation

@wesleysimplicio

@wesleysimplicio wesleysimplicio commented May 19, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

This PR improves gateway startup diagnostics around runtime-lock acquisition.

When Hermes cannot acquire gateway.lock and also cannot identify a live gateway PID, it now reports the failure as a likely stale-lock case and points the user to the supported recovery path (hermes gateway run --replace) instead of implying that another live gateway instance definitely exists.

Root cause

The detailed rationale from the original PR body is preserved below. This template update keeps the review structure consistent with #29640.

Fix

  • Keep the existing competing-instance message when a live gateway PID is still present.
  • Add a narrower stale-lock diagnostic when the lock is held but the recorded PID no longer points to a live gateway process.
  • Include the concrete gateway.lock and gateway.pid paths in the error so users can inspect the runtime state quickly.
  • Keep recovery conservative by pointing to the existing --replace path instead of force-killing or auto-removing lock files.
  • Add focused regression coverage for this startup branch.

Why this shape

This shape mirrors #29640 so reviewers can quickly compare scope, root cause, fix, tests, and related context without having to decode a custom PR description.

Tests

The original PR body below contains the previous validation notes, commands, or test plan.
No code changes are introduced by this formatting update itself.

Related PRs / issues

Fixes #28561

Original body

What does this PR do?

This PR improves gateway startup diagnostics around runtime-lock acquisition.

When Hermes cannot acquire gateway.lock and also cannot identify a live gateway PID, it now reports the failure as a likely stale-lock case and points the user to the supported recovery path (hermes gateway run --replace) instead of implying that another live gateway instance definitely exists.

Solution Sketch

  • Keep the existing competing-instance message when a live gateway PID is still present.
  • Add a narrower stale-lock diagnostic when the lock is held but the recorded PID no longer points to a live gateway process.
  • Include the concrete gateway.lock and gateway.pid paths in the error so users can inspect the runtime state quickly.
  • Keep recovery conservative by pointing to the existing --replace path instead of force-killing or auto-removing lock files.
  • Add focused regression coverage for this startup branch.

Related Issue

Fixes #28561

Related / Overlap Check

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • re-checks for a live gateway PID after runtime-lock acquisition fails
  • keeps the existing message when a live competing PID is present
  • emits a more specific stale-lock message when no live PID can be found
  • includes the concrete gateway.lock and gateway.pid paths plus the supported recovery path
  • adds regression coverage for the stale-lock startup branch

How to Test

  1. Run python -m pytest tests/gateway/test_runner_startup_failures.py::test_start_gateway_reports_stale_runtime_lock_guidance tests/gateway/test_runner_startup_failures.py::test_start_gateway_replace_clears_marker_on_permission_denied tests/gateway/test_runner_startup_failures.py::test_start_gateway_verbosity_imports_redacting_formatter -q -n 4.
  2. Confirm the stale-lock path now gives targeted recovery guidance.
  3. Confirm the live-PID path still keeps the existing competing-instance message.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: focused gateway startup regression tests

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Screenshots / Logs

  • Focused validation: python -m pytest tests/gateway/test_runner_startup_failures.py::test_start_gateway_reports_stale_runtime_lock_guidance tests/gateway/test_runner_startup_failures.py::test_start_gateway_replace_clears_marker_on_permission_denied tests/gateway/test_runner_startup_failures.py::test_start_gateway_verbosity_imports_redacting_formatter -q -n 4.
  • Result: 3 passed in 15.33s on Windows.
  • CI note: the broad test workflow is currently failing in unrelated gateway/kanban/tools tests; the touched focused startup tests are listed above.

Generated by Hermes Turbo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

# Issue: start_gateway should verify lock-holder PID is alive before treating stale lock as "another instance"

2 participants