Skip to content

fix(gateway): fix discrepancies in gateway status#11167

Closed
snreynolds wants to merge 1 commit into
NousResearch:mainfrom
snreynolds:sarareynolds/fix-gateway-status
Closed

fix(gateway): fix discrepancies in gateway status#11167
snreynolds wants to merge 1 commit into
NousResearch:mainfrom
snreynolds:sarareynolds/fix-gateway-status

Conversation

@snreynolds

Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes inconsistent Hermes gateway status reporting for the current profile.
Before this change, different parts of Hermes used different liveness checks:

  • hermes gateway run and gateway-dependent tooling relied on the profile-scoped gateway.pid validator
  • hermes gateway status could instead rely on service-manager state or process-table scanning
  • profile status checks used a weaker PID probe

That could lead to contradictory behavior such as:

  • hermes gateway status saying the gateway was not running
  • hermes gateway then refusing to start because a gateway process was already running

Related Issue

Fixes #

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • Added reusable PID-file validation support in gateway/status.py so callers can validate an explicit gateway PID file path with the same logic used by the main gateway runtime.
  • Added shared gateway runtime snapshot helpers in hermes_cli/gateway.py to centralize current-profile liveness reporting.
  • Updated find_gateway_pids() in hermes_cli/gateway.py to fall back to the current profile PID file before relying only on process-table scanning.
  • Updated hermes gateway status in hermes_cli/gateway.py to report service/process mismatches more clearly instead of silently producing contradictory output.
  • Updated other CLI status surfaces to use the shared runtime snapshot:
    • hermes_cli/status.py
    • hermes_cli/dump.py
  • Updated profile gateway checks in hermes_cli/profiles.py to use the shared PID validator instead of a weaker custom implementation.
  • Added/updated targeted tests in:
    • tests/gateway/test_status.py
    • tests/hermes_cli/test_gateway.py
    • tests/hermes_cli/test_gateway_service.py
    • tests/hermes_cli/test_profiles.py

How to Test

  1. Reproduce the bug before the fix:
    • Start a gateway process for the current profile manually.
    • Put Hermes in a state where hermes gateway status does not rely on the same liveness path as hermes gateway run.
    • Observe that hermes gateway status can report "not running" while hermes gateway refuses to start because a gateway process is already running.
  2. Verify the fix:
    • Run hermes gateway status
    • Confirm it now reflects the current profile's actual gateway process state more accurately and surfaces service/process mismatches explicitly.

Checklist

Code

  • [x ] I've read the Contributing Guide
  • [ x] My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • [ x] I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform:

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

For New Skills

  • This skill is broadly useful to most users (if bundled) — see Contributing Guide
  • SKILL.md follows the standard format (frontmatter, trigger conditions, steps, pitfalls)
  • No external dependencies that aren't already available (prefer stdlib, curl, existing Hermes tools)
  • I've tested the skill end-to-end: hermes --toolsets skills -q "Use the X skill to do Y"

Screenshots / Logs

Linux2010 added a commit to Linux2010/hermes-agent that referenced this pull request Apr 17, 2026
## What broke
Meta-only messages (e.g., `/model`, `/tools`) cause the agent to stuck
forever. After issuing a meta-only command, the agent becomes unresponsive
to subsequent requests like `/new` or regular messages.

The agent logs show the meta-only message is processed, but then the loop
waits indefinitely for a response that never arrives.

## Root cause
In `chat_with_model()` at run_agent.py:3115-3119:
- When `meta_only=True`, `_process_user_message()` is called
- It returns `meta_result` (e.g., "Model changed")
- But the code continues to Line 3136-3148 response processing
- Meta-only messages don't produce LLM responses, so response is None
- The loop waits for `_get_response_content(response)` indefinitely

The original code:
```python
if meta_only:
    meta_result = await self._process_user_message(...)
# Then continues to response loop without returning
```

## Why this fix is minimal
Added 5 lines: immediate return for meta-only path.

```python
if meta_only:
    meta_result = await self._process_user_message(...)
    # Meta-only messages don't produce LLM responses.
    # Return the meta_result directly.
    return meta_result if meta_result else "Processed meta-only message."
```

No changes to regular message handling (meta_only=False path unchanged).
No changes to `_process_user_message()` or `_run_meta_only_handler()`.
No opportunistic refactoring.

## What I tested
Added test suite tests/test_meta_only_stuck_fix.py:
- test_meta_only_returns_immediately
- test_meta_only_does_not_enter_response_loop
- test_meta_only_with_none_response
- test_meta_only_flag_detection
- test_process_user_message_meta_only_calls_handler
- test_chat_with_model_meta_only_exits_early

All tests verify meta-only path returns immediately without stuck.

## What I intentionally did not change
- No changes to regular message handling
- No changes to `_run_meta_only_handler()` implementation
- No changes to response content processing
- No opportunistic refactoring

## Evidence
Before: `/model` → agent stuck, no response, `/new` ignored
After: `/model` → "Model changed" response, agent responsive

Fixes NousResearch#11167
@snreynolds snreynolds marked this pull request as ready for review April 17, 2026 17:10
@teknium1

Copy link
Copy Markdown
Contributor

Merged via #11896 — your commit was cherry-picked onto current main with your authorship preserved (commit 8ab1aa2). Really clean abstraction with the GatewayRuntimeSnapshot dataclass — cut a lot of duplicated platform-branching across status/dump/profiles. Thanks for the contribution, Sara!

@snreynolds

Copy link
Copy Markdown
Contributor Author

Merged via #11896 — your commit was cherry-picked onto current main with your authorship preserved (commit 8ab1aa2). Really clean abstraction with the GatewayRuntimeSnapshot dataclass — cut a lot of duplicated platform-branching across status/dump/profiles. Thanks for the contribution, Sara!

@teknium1 sweet, thanks for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants