fix: improve gateway restart resilience and WSL stability by albsa · Pull Request #8213 · NousResearch/hermes-agent

albsa · 2026-04-12T06:53:19Z

What does this PR do?

This PR improves Hermes gateway reliability during restarts and makes local speech-to-text more stable on WSL-style environments.

Specifically it:

saves in-flight gateway turns when restart drain times out
resumes those interrupted turns automatically after the gateway starts again
switches local faster-whisper loading to CPU/int8 for reliability when partial CUDA setups cause libcublas failures
updates the FAQ to recommend idempotent WSL startup wrappers so healthy gateways are not bounced by repeated startup triggers

Related Issue

Fixes #

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
📝 Documentation update
✨ New feature (non-breaking change that adds functionality)
🔒 Security fix
✅ Tests (adding or improving test coverage)
♻️ Refactor (no behavior change)
🎯 New skill (bundled or hub)

Changes Made

`gateway/run.py`

track active inbound events per session
persist interrupted in-flight turns to a restart-resume file when restart drain times out
replay those saved turns automatically after gateway startup

`tools/transcription_tools.py`

force local faster-whisper to load with device="cpu" and compute_type="int8"
avoids WSL/CTranslate2/CUDA partial-runtime failures such as missing libcublas.so.12

`website/docs/reference/faq.md`

document idempotent WSL startup guidance
warn against startup wrappers that restart an already healthy gateway

How to Test

1. Restart-resume behavior

start Hermes gateway
send a Telegram/gateway message that takes long enough to still be running during restart
trigger a gateway restart while the task is active
verify the gateway stores the interrupted turn and resumes it automatically after startup
verify the user eventually gets a final reply instead of being left with only tool-progress output

2. STT stability

send a voice note in a WSL environment with local STT enabled
verify local transcription works in CPU/int8 mode
verify the previous libcublas/CUDA failure path no longer occurs

3. Regression check

ensure normal gateway requests still complete correctly
ensure startup without interrupted turns does not attempt any replay

Checklist

Code

I've read the Contributing Guide
My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
I searched for existing PRs to make sure this isn't a duplicate
My PR contains only changes related to this fix/feature (no unrelated commits)
I've run pytest tests/ -q and all tests pass
I've added tests for my changes (required for bug fixes, strongly encouraged for features)
I've tested on my platform:
- WSL2 / Linux gateway environment

Documentation & Housekeeping

I've updated relevant documentation (README, docs/, docstrings) — or N/A
I've updated cli-config.yaml.example if I added/changed config keys — or N/A
I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Screenshots / Logs

Validation run locally:

python3 -m py_compile gateway/run.py
./venv/bin/python -m pytest tests/gateway/test_session_boundary_hooks.py -q

Observed user-facing failure before fix:

gateway could emit tool-progress messages on Telegram
gateway restart/drain timeout could interrupt active work
user would receive no final reply

Observed behavior after local fix:

restart-resume logic compiles cleanly
targeted gateway tests pass
FAQ updated with idempotent WSL startup guidance

Notes

This PR intentionally excludes local environment-specific startup wrapper changes outside the repository.

teknium1 · 2026-04-28T01:56:25Z

Thanks for the contribution @albsa! All three changes in this PR have since been implemented on main independently.

This is an automated hermes-sweeper review.

Gateway restart-resume — cb4addaca (fix(gateway): auto-resume sessions after drain-timeout restart #12301) introduced resume_pending on SessionEntry, mark_resume_pending() / clear_resume_pending() in gateway/session.py, the drain-timeout flagging block in gateway/run.py (line 2808), and the reason-aware system note injected on the first post-restart turn (line 10413). This is a fuller implementation of the same concept.
faster-whisper CUDA → CPU fallback — 4350668ae (fix(transcription): fall back to CPU when CUDA runtime libs are missing) added _load_local_whisper_model() with try-device=auto / catch-CUDA-lib / retry-device=cpu compute_type=int8, plus a mid-transcribe() eviction+retry path, in tools/transcription_tools.py (line 362). Main's implementation is strictly more complete than the forced-CPU approach here.
WSL gateway FAQ — a8fd7257b (feat(gateway): WSL-aware gateway #7510) already added a WSL-specific FAQ section in website/docs/reference/faq.md (line 431) covering systemd unreliability, foreground/tmux/nohup alternatives, wsl.conf steps, and Task Scheduler auto-start guidance.

No further action needed — closing as implemented.

fix: improve gateway restart resilience and WSL stability

5b0d781

laolaoshiren mentioned this pull request Apr 15, 2026

[Bug]: Telegram topic session can behave like /new after gateway restart/update despite persistent-session design #10163

Open

teknium1 closed this Apr 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve gateway restart resilience and WSL stability#8213

fix: improve gateway restart resilience and WSL stability#8213
albsa wants to merge 1 commit into
NousResearch:mainfrom
albsa:fix/gateway-restart-resume-stt-docs

albsa commented Apr 12, 2026 •

edited

Loading

Uh oh!

teknium1 commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

albsa commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Related Issue

Type of Change

Changes Made

gateway/run.py

tools/transcription_tools.py

website/docs/reference/faq.md

How to Test

1. Restart-resume behavior

2. STT stability

3. Regression check

Checklist

Code

Documentation & Housekeeping

Screenshots / Logs

Notes

Uh oh!

teknium1 commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

albsa commented Apr 12, 2026 •

edited

Loading

`gateway/run.py`

`tools/transcription_tools.py`

`website/docs/reference/faq.md`