Skip to content

Add support for Atropos Agentic RL environments (requires branch tool…#17

Merged
teknium1 merged 1 commit into
mainfrom
atropos-hermes-agent
Feb 7, 2026
Merged

Add support for Atropos Agentic RL environments (requires branch tool…#17
teknium1 merged 1 commit into
mainfrom
atropos-hermes-agent

Conversation

@teknium1

@teknium1 teknium1 commented Feb 7, 2026

Copy link
Copy Markdown
Contributor

…_call_support in Atropos atm)

  • Added new environments for reinforcement learning, including HermesSweEnv for software engineering tasks and TerminalTestEnv for inline testing.
  • Introduced ToolContext for unrestricted access to tools during reward computation.
  • Updated .gitignore to exclude wandb/ directory.
  • Enhanced README.md with detailed architecture and usage instructions for Atropos environments.
  • Added configuration files for SWE and terminal test environments to streamline setup.
  • Removed unnecessary compiled Python files from __pycache__.

…_call_support in Atropos atm)

- Added new environments for reinforcement learning, including `HermesSweEnv` for software engineering tasks and `TerminalTestEnv` for inline testing.
- Introduced `ToolContext` for unrestricted access to tools during reward computation.
- Updated `.gitignore` to exclude `wandb/` directory.
- Enhanced `README.md` with detailed architecture and usage instructions for Atropos environments.
- Added configuration files for SWE and terminal test environments to streamline setup.
- Removed unnecessary compiled Python files from `__pycache__`.
@teknium1 teknium1 merged commit 7f1cd01 into main Feb 7, 2026
sudo-yf pushed a commit to sudo-yf/hermes-agent that referenced this pull request Apr 5, 2026
Adds the newly available GLM-5.1 model to the hardcoded Z.AI provider
model list so it appears in the model dropdown. Fixes NousResearch#17.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ahmedaltewaj

Copy link
Copy Markdown

QA Evidence Update — 2026-04-06

QuickScan run confirms Mixed Content on additional pages beyond Dashboard:

  • /app/cold-outreach — Mixed Content: advisories API
  • /app/settings — Mixed Content: advisories API

All 8 pages affected:

Page Errors
/app/dashboard 5 Mixed Content (entities, reminders, advisories)
/app/inbox 1 Mixed Content (advisories)
/app/inbox/entities 1 Mixed Content (advisories)
/app/knowledge-graph 1 Mixed Content (advisories) + 2x 401
/app/advisories 5 Mixed Content
/app/cvr 1 Mixed Content
/app/cold-outreach 1 Mixed Content (NEW)
/app/settings 1 Mixed Content (NEW)

Root cause: Frontend calls API at http://leksikon.ai/api/v1/... instead of https://leksikon.ai/api/v1/...

@ahmedaltewaj

Copy link
Copy Markdown

QA Evidence — Quick Scan 2026-04-06

Confirmed Mixed Content errors on /app/dashboard:

Mixed Content: The page at 'https://leksikon.ai/app/dashboard' was loaded over HTTPS, 
but requested an insecure resource 'http://leksikon.ai/api/v1/advisories/?page=1&limit=10'. 
Mixed Content: The page at 'https://leksikon.ai/app/dashboard' was loaded over HTTPS, 
but requested an insecure resource 'http://leksikon.ai/api/v1/entities/?limit=10'. 
Mixed Content: The page at 'https://leksikon.ai/app/dashboard' was loaded over HTTPS, 
but requested an insecure resource 'http://leksikon.ai/api/v1/reminders/'. 

Same pattern also found on: /app/inbox, /app/inbox/entities, /app/knowledge-graph

Mode: quick_scan
QA agent run: 2026-04-06
EOF; __hermes_rc=$?; printf 'HERMES_FENCE_a9f7b3'; exit $__hermes_rc

@ahmedaltewaj

Copy link
Copy Markdown

QA Evidence — 2026-04-06 (evening)
Confirmed on all 8 pages scanned — dashboard, inbox, entities, knowledge-graph, advisories, cvr, cold-outreach, settings. All making HTTP API calls blocked by browser over HTTPS.

Pattern: All API endpoints use http://leksikon.ai/api/v1/* instead of https://
Affects: /api/v1/reminders/, /api/v1/advisories/, /api/v1/entities/

@ahmedaltewaj

Copy link
Copy Markdown

QA Evidence — 2026-04-06 (evening)
Confirmed on all 8 pages scanned — dashboard, inbox, entities, knowledge-graph, advisories, cvr, cold-outreach, settings. All making HTTP API calls blocked by browser over HTTPS.

Pattern: All API endpoints use http://leksikon.ai/api/v1/* instead of https://.
Affects: /api/v1/reminders/, /api/v1/advisories/, /api/v1/entities/

@ahmedaltewaj

Copy link
Copy Markdown

QA evidence — 2026-04-06 18:30 CET

New Mixed Content errors confirmed across additional pages:

  • /app/cold-outreachhttp://leksikon.ai/api/v1/advisories/?page=1&limit=50 blocked
  • /app/settingshttp://leksikon.ai/api/v1/advisories/?page=1&limit=50 blocked
  • Settings also shows: 404 and 403 on teammedlemmer API calls (separate regression — see Update snapshot id for ipython #8)

Root cause: frontend is hardcoded to use http:// for API calls instead of respecting the page protocol. All 8 pages affected.

@ahmedaltewaj

Copy link
Copy Markdown

QA Evidence — 2026-04-06T18:56:17+00:00 (QUICK_SCAN)

Mixed Content confirmed on ALL pages (1-8) visited over HTTPS. Error pattern:

Mixed Content: The page at 'https://leksikon.ai/app/[PAGE]' was loaded over HTTPS, 
but requested an insecure resource 'http://leksikon.ai/api/v1/advisories/?page=1&limit=50'. 
This request has been blocked.

Additional Mixed Content endpoints found:

  • http://leksikon.ai/api/v1/reminders/ (advisories page)
  • http://leksikon.ai/api/v1/entities/?limit=10 (advisories page)

Additional errors on specific pages:

@ahmedaltewaj

Copy link
Copy Markdown

QA Run — 2026-04-06 21:00 UTC

Scanned all 8 pages. Mixed Content errors confirmed on:

  • /app/dashboard (reminders, entities, advisories ×2)
  • /app/inbox (advisories)
  • /app/inbox/entities (advisories)
  • /app/knowledge-graph (advisories + 401 ×2)
  • /app/advisories (reminders, entities, advisories ×2)
  • /app/cvr (advisories)
  • /app/cold-outreach (advisories)
  • /app/settings (advisories + 404 + 403 teammedlemmer)

New finding: Settings → Team tab also shows 403 on teammedlemmer endpoint. Issue #8 confirmed.

@ahmedaltewaj

Copy link
Copy Markdown

Fixed as of 2026-04-06. Root cause: Docker image was rebuilt with stale frontend build cached (frontend-builder stage used cached npm build from a version that still had http://). The correct deployment process requires copying frontend/dist to backend/public before Docker build, but the Docker build was running npm run build inside the container while the host's backend/public was a different (older) copy.

Fix applied: Rebuilt Docker image with --no-cache to force fresh npm build, which now uses https://leksikon.ai in all API calls. Verified on all 8 pages (dashboard, inbox, entities, knowledge-graph, advisories, cvr, cold-outreach, settings) — zero Mixed Content errors.

@ahmedaltewaj

Copy link
Copy Markdown

QA evidence — 2026-04-07T00:43:55+00:00

Quick Scan 2026-04-07T00:40+02:00 — Mixed Content still occurring across ALL pages:

Page MC errors on this page
/app/dashboard 6 errors (reminders/, advisories x2, entities/, advisories x2)
/app/inbox 1 new (advisories/?limit=50)
/app/inbox/entities 1 new (advisories/?limit=50)
/app/knowledge-graph 1 new (advisories/?limit=50) + 2x 401
/app/advisories 5 new (advisories x2, entities/, reminders/, advisories/?limit=50)
/app/cvr 1 new (advisories/?limit=50)
/app/cold-outreach 1 new (advisories/?limit=50)
/app/settings 1 new (advisories/?limit=50) + 403 teammedlemmer + 404

All advisories/ API calls blocked by browser. Data not loading on affected pages.

h4x3rotab pushed a commit to Clawdi-AI/hermes-agent that referenced this pull request Apr 10, 2026
…ilience (NousResearch#17)

Merged CrazySerGo's fixes with conflict resolution:
- Root redirect → /dashboard
- Error boundary in __root.tsx
- window.process polyfill for SSR hydration
- WebSocket handshake timeout + retry logic
- Terminal: platform-aware shell defaults (zsh on macOS, bash on Linux, powershell on Windows)
- Terminal: os.homedir() fallback
- Dynamic gateway URL in vite proxy and debug-analyzer
- Loading indicator for chat history
- Icon size normalization

Removed unnecessary chromium-bidi dependency.
Kept SSR externals for playwright and pty-helper copy plugin.
malaiwah pushed a commit to malaiwah/hermes-agent that referenced this pull request Apr 11, 2026
…+ apt-cacher-ng' (NousResearch#17) from fix/ai-review-opencode into main
teknium1 added a commit that referenced this pull request Apr 17, 2026
Two follow-ups to the cherry-picked PR #9873 (`e3bcc819`):

1. `_is_allowed_user` now uses `getattr(self, '_allowed_*_ids', set())`
   so test fixtures that build the adapter via `object.__new__`
   (skipping __init__) don't crash with AttributeError.
   See AGENTS.md pitfall #17 — same pattern as gateway.run.

2. New 3-case regression coverage in test_discord_bot_auth_bypass.py:
   - role-only config bypasses the gateway 'no allowlists' branch
   - roles + users combined still authorizes user-allowlist matches
   - the role bypass does NOT leak to other platforms (Telegram, etc.)

3. Autouse fixture in test_discord_bot_auth_bypass.py clears all Discord
   auth env vars before each test so DISCORD_ALLOWED_ROLES leakage from
   a previous test in the session can't flip later 'should-reject' tests
   into false-pass.

Required because the bare cherry-pick of #9873 only added the adapter-
level role check — it didn't cover the gateway-level _is_user_authorized,
which still rejected role-only setups via the 'no allowlists configured'
branch.
teknium1 added a commit that referenced this pull request Apr 17, 2026
Two follow-ups to the cherry-picked PR #9873 (`e3bcc819`):

1. `_is_allowed_user` now uses `getattr(self, '_allowed_*_ids', set())`
   so test fixtures that build the adapter via `object.__new__`
   (skipping __init__) don't crash with AttributeError.
   See AGENTS.md pitfall #17 — same pattern as gateway.run.

2. New 3-case regression coverage in test_discord_bot_auth_bypass.py:
   - role-only config bypasses the gateway 'no allowlists' branch
   - roles + users combined still authorizes user-allowlist matches
   - the role bypass does NOT leak to other platforms (Telegram, etc.)

3. Autouse fixture in test_discord_bot_auth_bypass.py clears all Discord
   auth env vars before each test so DISCORD_ALLOWED_ROLES leakage from
   a previous test in the session can't flip later 'should-reject' tests
   into false-pass.

Required because the bare cherry-pick of #9873 only added the adapter-
level role check — it didn't cover the gateway-level _is_user_authorized,
which still rejected role-only setups via the 'no allowlists configured'
branch.
OutThisLife added a commit that referenced this pull request Apr 22, 2026
- entry.tsx no longer writes bootBanner() to the main screen before the
  alt-screen enters. The <Banner> renders inside the alt screen via the
  seeded intro row, so nothing is lost — just the flash that preceded it.
  Fixes the torn first frame reported on Alacritty (blitz row 5 #17) and
  shaves the 'starting agent' hang perception (row 5 #1) since the UI
  paints straight into the steady-state view
- AlternateScreen prefixes ERASE_SCROLLBACK (\x1b[3J) to its entry so
  strict emulators start from a pristine grid; named constants replace
  the inline sequences for clarity
- bootBanner.ts deleted — dead code
angelburgosrosado pushed a commit to angelburgosrosado/hermes-agent that referenced this pull request Apr 27, 2026
…gent

Add support for Atropos Agentic RL environments (requires branch tool…
ulasbilgen pushed a commit to ulasbilgen/hermes-adhd-agent that referenced this pull request May 1, 2026
Two follow-ups to the cherry-picked PR NousResearch#9873 (`e3bcc819`):

1. `_is_allowed_user` now uses `getattr(self, '_allowed_*_ids', set())`
   so test fixtures that build the adapter via `object.__new__`
   (skipping __init__) don't crash with AttributeError.
   See AGENTS.md pitfall NousResearch#17 — same pattern as gateway.run.

2. New 3-case regression coverage in test_discord_bot_auth_bypass.py:
   - role-only config bypasses the gateway 'no allowlists' branch
   - roles + users combined still authorizes user-allowlist matches
   - the role bypass does NOT leak to other platforms (Telegram, etc.)

3. Autouse fixture in test_discord_bot_auth_bypass.py clears all Discord
   auth env vars before each test so DISCORD_ALLOWED_ROLES leakage from
   a previous test in the session can't flip later 'should-reject' tests
   into false-pass.

Required because the bare cherry-pick of NousResearch#9873 only added the adapter-
level role check — it didn't cover the gateway-level _is_user_authorized,
which still rejected role-only setups via the 'no allowlists configured'
branch.
ulasbilgen pushed a commit to ulasbilgen/hermes-adhd-agent that referenced this pull request May 1, 2026
- entry.tsx no longer writes bootBanner() to the main screen before the
  alt-screen enters. The <Banner> renders inside the alt screen via the
  seeded intro row, so nothing is lost — just the flash that preceded it.
  Fixes the torn first frame reported on Alacritty (blitz row 5 NousResearch#17) and
  shaves the 'starting agent' hang perception (row 5 NousResearch#1) since the UI
  paints straight into the steady-state view
- AlternateScreen prefixes ERASE_SCROLLBACK (\x1b[3J) to its entry so
  strict emulators start from a pristine grid; named constants replace
  the inline sequences for clarity
- bootBanner.ts deleted — dead code
aj-nt pushed a commit to aj-nt/hermes-agent that referenced this pull request May 1, 2026
Two follow-ups to the cherry-picked PR NousResearch#9873 (`e3bcc819`):

1. `_is_allowed_user` now uses `getattr(self, '_allowed_*_ids', set())`
   so test fixtures that build the adapter via `object.__new__`
   (skipping __init__) don't crash with AttributeError.
   See AGENTS.md pitfall NousResearch#17 — same pattern as gateway.run.

2. New 3-case regression coverage in test_discord_bot_auth_bypass.py:
   - role-only config bypasses the gateway 'no allowlists' branch
   - roles + users combined still authorizes user-allowlist matches
   - the role bypass does NOT leak to other platforms (Telegram, etc.)

3. Autouse fixture in test_discord_bot_auth_bypass.py clears all Discord
   auth env vars before each test so DISCORD_ALLOWED_ROLES leakage from
   a previous test in the session can't flip later 'should-reject' tests
   into false-pass.

Required because the bare cherry-pick of NousResearch#9873 only added the adapter-
level role check — it didn't cover the gateway-level _is_user_authorized,
which still rejected role-only setups via the 'no allowlists configured'
branch.
aj-nt pushed a commit to aj-nt/hermes-agent that referenced this pull request May 1, 2026
- entry.tsx no longer writes bootBanner() to the main screen before the
  alt-screen enters. The <Banner> renders inside the alt screen via the
  seeded intro row, so nothing is lost — just the flash that preceded it.
  Fixes the torn first frame reported on Alacritty (blitz row 5 NousResearch#17) and
  shaves the 'starting agent' hang perception (row 5 #1) since the UI
  paints straight into the steady-state view
- AlternateScreen prefixes ERASE_SCROLLBACK (\x1b[3J) to its entry so
  strict emulators start from a pristine grid; named constants replace
  the inline sequences for clarity
- bootBanner.ts deleted — dead code
teknium1 added a commit that referenced this pull request May 2, 2026
… success log) (#18761)

* fix(gateway): config.yaml wins over .env for agent/display/timezone settings

Regression from the silent config→env bridge. The bridge at module import
time is correct for max_turns (unconditional overwrite), but every other
agent.*, display.*, timezone, and security bridge key was guarded by
'if X not in os.environ' — so a stale .env entry from an old 'hermes setup'
run would shadow the user's current config.yaml indefinitely.

Symptom: agent.max_turns: 500 in config.yaml, HERMES_MAX_ITERATIONS=60
in .env from an old setup, and the gateway silently capped at 60
iterations per turn. Gateway logs confirmed api_calls never exceeded 60.

Three changes:

1. gateway/run.py: drop the 'not in os.environ' guards for all agent.*,
   display.*, timezone, and security.* bridge keys. config.yaml is now
   authoritative for these settings — same semantics already in place
   for max_turns, terminal.*, and auxiliary.*. Also surface the bridge
   failure (previously 'except Exception: pass') to stderr so operators
   see bridge errors instead of silently falling back to .env.

2. gateway/run.py: INFO-log the resolved max_iterations at gateway
   start so operators can verify the config→env bridge did the right
   thing instead of chasing a phantom budget ceiling.

3. hermes_cli/setup.py: stop writing HERMES_MAX_ITERATIONS to .env in
   the setup wizard. config.yaml is the single source of truth. Also
   clean up any stale .env entry left behind by pre-fix setups.

Regression tests in tests/gateway/test_config_env_bridge_authority.py
guard each config→env key against the 'stale .env shadows config' bug.

* fix(gateway): shutdown + restart hygiene (drain timeout, false-fatal, success log)

Three issues observed in production gateway.log during a rapid restart
chain on 2026-05-02, all fixed here.

1. _send_restart_notification logged unconditional success
   adapter.send() catches provider errors (e.g. Telegram 'Chat not found')
   and returns SendResult(success=False); it never raises. The caller
   ignored the return value and always logged 'Sent restart notification
   to <chat>' at INFO, producing a misleading success line directly
   below the 'Failed to send Telegram message' traceback on every boot.
   Now inspects result.success and logs WARNING with the error otherwise.

2. WhatsApp bridge SIGTERM on shutdown classified as fatal error
   _check_managed_bridge_exit() saw the bridge's returncode -15 (our own
   SIGTERM from disconnect()) and fired the full fatal-error path,
   producing 'ERROR ... WhatsApp bridge process exited unexpectedly' plus
   'Fatal whatsapp adapter error (whatsapp_bridge_exited)' on every
   planned shutdown, immediately before the normal '✓ whatsapp
   disconnected'. Adds a _shutting_down flag that disconnect() sets
   before the terminate, and _check_managed_bridge_exit() returns None
   for returncode in {0, -2, -15} while shutting down. OOM-kill (137)
   and other non-signal exits still hit the fatal path.

3. restart_drain_timeout default 60s → 180s
   On 2026-05-02 01:43:27 a user /restart fired while three agents were
   mid-API-call (82s, 112s, 154s into their turns). The 60s drain budget
   expired and all three were force-interrupted. 180s covers realistic
   in-flight agent turns; users on very-long-reasoning models can still
   raise it further via agent.restart_drain_timeout in config.yaml.
   Existing explicit user values are preserved by deep-merge.

Tests
- tests/gateway/test_restart_notification.py: two new tests assert INFO
  is only logged on SendResult(success=True) and WARNING with the error
  string is logged on SendResult(success=False).
- tests/gateway/test_whatsapp_connect.py: parametrized test for
  returncode in {0, -2, -15} proves shutdown-time exits are suppressed;
  separate test proves returncode 137 (SIGKILL/OOM) still surfaces as
  fatal even when _shutting_down is set.
- _check_managed_bridge_exit() reads _shutting_down via getattr-with-
  default so existing _make_adapter() test helpers that bypass __init__
  (pitfall #17 in AGENTS.md) keep working unmodified.
@artmoneyceo

Copy link
Copy Markdown

🔴 RED TEAM AUDIT COMPLETE — 2026-05-09

CRITICAL (0 outstanding) | HIGH (0 outstanding) | MEDIUM (3) | LOW (2)

Vulnerabilities found and patched

ID Severity Description Status
1.1 CRITICAL WhatsApp send() had no principal-only guard ✅ PATCHED
1.2 CRITICAL Telegram send() had no principal-only guard ✅ PATCHED
2.1 CRITICAL WA adapter: 4 dead-code violation-surface methods callable ✅ PATCHED (removed)
1.3 HIGH Slack/Discord send() have no SO41 guard (mitigated by token absence) ✅ DOCUMENTED
2.2 HIGH TelegramAdapter _should_process_message allowed group dispatch via config ✅ PATCHED
3.1 HIGH Out-of-band Telegram paths (romantic_detection, claude_bridge) not hardened ✅ PATCHED
3.2 HIGH claude_cli_proxy.py on localhost:11435 had no authentication ✅ PATCHED
6.1 HIGH Test suites asserted doctrine-violating group dispatch behaviors ✅ PATCHED (all tests rewritten)

Outstanding (principal review required)

  • VULN-3.3 (MEDIUM): claude_bridge.py PROTECTED_PATTERNS substring match bypassable
  • VULN-4.2 (MEDIUM): No startup warning when principal_chat_id empty
  • VULN-7.1 (LOW): Proxy timeout inference channel (mitigated by localhost)
  • VULN-8.1 (LOW): API server binding verification by source inspection only

Adversarial test results

201 pass, 9 expected skips, 0 fail

  • V1/V2/V3 outbound enforcement: 109/109
  • X-suite red team (8 attack classes): 17 pass + 9 skip
  • Q1/R-suite SO41 doctrine: 32/32
  • Group gating suites (WA + TG): 42/42

Baseline comparison

State Failures
Pre-audit 106 (pre-existing)
Post-audit 73
New regressions 0

Full report: ~/operator-os/intel/RED_TEAM_AUDIT_2026_05_09.md
Branch: fix/b3-remove-nonprincipal-outbound

@artmoneyceo

Copy link
Copy Markdown

✅ OUTSTANDING VULNERABILITY PATCHES COMPLETE — 2026-05-09

All 5 authorized items patched. Zero regressions. Runbook committed.


Patch summary by item

Item Severity Commit Suite after
VULN-1.3 — Slack/Discord SO41 guard HIGH 7e81182 209 pass
VULN-4.2 — Startup CRITICAL log empty principal MEDIUM cbd9685c 213 pass
VULN-8.1 — APIServer 0.0.0.0 hard-stop LOW 0110a33b 215 pass
VULN-3.3 — Bridge regex word-boundary upgrade MEDIUM 2bb9298c 228 pass
VULN-7.1 — Proxy key runbook entry LOW N/A (script not in git)

All patches on branch fix/b3-remove-nonprincipal-outbound.


Final adversarial suite

228 pass, 9 expected skips, 0 fail

New tests added this session:

  • Y1 (8 tests): Slack/Discord SO41 guard fires even with token present
  • Y2 (13 tests): Bridge regex upgrade rejects case variations, mixed caps, underscore vs space variants
  • VULN-4.2 (4 tests): Startup CRITICAL log emitted when principal ID empty, suppressed when set
  • X8 structural (3 new tests): 0.0.0.0 + no key → RuntimeError; 0.0.0.0 + key → warning only; default host = 127.0.0.1

Total adversarial test count progression:
201 (initial audit) → 209 → 213 → 215 → 228


VULN-7.1 runbook entry

Created ~/.hermes/OPERATOR_RUNBOOK.md covering:

  • Proxy key rotation procedure: kill proxy → rotate → update .env → restart
  • APIServer host binding safety: API_SERVER_KEY required before 0.0.0.0
  • Slack/Discord activation: set SLACK_PRINCIPAL_CHANNEL before token

Remaining outstanding

ID Severity Description
VULN-3.3 (semantic) MEDIUM Encoding/code-generation bypass — semantic LLM layer. Deferred per principal.

All code-level CRITICAL and HIGH items are closed. Zero open CRITICAL.

nickdlkk pushed a commit to nickdlkk/hermes-agent that referenced this pull request May 11, 2026
… success log) (NousResearch#18761)

* fix(gateway): config.yaml wins over .env for agent/display/timezone settings

Regression from the silent config→env bridge. The bridge at module import
time is correct for max_turns (unconditional overwrite), but every other
agent.*, display.*, timezone, and security bridge key was guarded by
'if X not in os.environ' — so a stale .env entry from an old 'hermes setup'
run would shadow the user's current config.yaml indefinitely.

Symptom: agent.max_turns: 500 in config.yaml, HERMES_MAX_ITERATIONS=60
in .env from an old setup, and the gateway silently capped at 60
iterations per turn. Gateway logs confirmed api_calls never exceeded 60.

Three changes:

1. gateway/run.py: drop the 'not in os.environ' guards for all agent.*,
   display.*, timezone, and security.* bridge keys. config.yaml is now
   authoritative for these settings — same semantics already in place
   for max_turns, terminal.*, and auxiliary.*. Also surface the bridge
   failure (previously 'except Exception: pass') to stderr so operators
   see bridge errors instead of silently falling back to .env.

2. gateway/run.py: INFO-log the resolved max_iterations at gateway
   start so operators can verify the config→env bridge did the right
   thing instead of chasing a phantom budget ceiling.

3. hermes_cli/setup.py: stop writing HERMES_MAX_ITERATIONS to .env in
   the setup wizard. config.yaml is the single source of truth. Also
   clean up any stale .env entry left behind by pre-fix setups.

Regression tests in tests/gateway/test_config_env_bridge_authority.py
guard each config→env key against the 'stale .env shadows config' bug.

* fix(gateway): shutdown + restart hygiene (drain timeout, false-fatal, success log)

Three issues observed in production gateway.log during a rapid restart
chain on 2026-05-02, all fixed here.

1. _send_restart_notification logged unconditional success
   adapter.send() catches provider errors (e.g. Telegram 'Chat not found')
   and returns SendResult(success=False); it never raises. The caller
   ignored the return value and always logged 'Sent restart notification
   to <chat>' at INFO, producing a misleading success line directly
   below the 'Failed to send Telegram message' traceback on every boot.
   Now inspects result.success and logs WARNING with the error otherwise.

2. WhatsApp bridge SIGTERM on shutdown classified as fatal error
   _check_managed_bridge_exit() saw the bridge's returncode -15 (our own
   SIGTERM from disconnect()) and fired the full fatal-error path,
   producing 'ERROR ... WhatsApp bridge process exited unexpectedly' plus
   'Fatal whatsapp adapter error (whatsapp_bridge_exited)' on every
   planned shutdown, immediately before the normal '✓ whatsapp
   disconnected'. Adds a _shutting_down flag that disconnect() sets
   before the terminate, and _check_managed_bridge_exit() returns None
   for returncode in {0, -2, -15} while shutting down. OOM-kill (137)
   and other non-signal exits still hit the fatal path.

3. restart_drain_timeout default 60s → 180s
   On 2026-05-02 01:43:27 a user /restart fired while three agents were
   mid-API-call (82s, 112s, 154s into their turns). The 60s drain budget
   expired and all three were force-interrupted. 180s covers realistic
   in-flight agent turns; users on very-long-reasoning models can still
   raise it further via agent.restart_drain_timeout in config.yaml.
   Existing explicit user values are preserved by deep-merge.

Tests
- tests/gateway/test_restart_notification.py: two new tests assert INFO
  is only logged on SendResult(success=True) and WARNING with the error
  string is logged on SendResult(success=False).
- tests/gateway/test_whatsapp_connect.py: parametrized test for
  returncode in {0, -2, -15} proves shutdown-time exits are suppressed;
  separate test proves returncode 137 (SIGKILL/OOM) still surfaces as
  fatal even when _shutting_down is set.
- _check_managed_bridge_exit() reads _shutting_down via getattr-with-
  default so existing _make_adapter() test helpers that bypass __init__
  (pitfall NousResearch#17 in AGENTS.md) keep working unmodified.
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
Two follow-ups to the cherry-picked PR NousResearch#9873 (`e3bcc819`):

1. `_is_allowed_user` now uses `getattr(self, '_allowed_*_ids', set())`
   so test fixtures that build the adapter via `object.__new__`
   (skipping __init__) don't crash with AttributeError.
   See AGENTS.md pitfall NousResearch#17 — same pattern as gateway.run.

2. New 3-case regression coverage in test_discord_bot_auth_bypass.py:
   - role-only config bypasses the gateway 'no allowlists' branch
   - roles + users combined still authorizes user-allowlist matches
   - the role bypass does NOT leak to other platforms (Telegram, etc.)

3. Autouse fixture in test_discord_bot_auth_bypass.py clears all Discord
   auth env vars before each test so DISCORD_ALLOWED_ROLES leakage from
   a previous test in the session can't flip later 'should-reject' tests
   into false-pass.

Required because the bare cherry-pick of NousResearch#9873 only added the adapter-
level role check — it didn't cover the gateway-level _is_user_authorized,
which still rejected role-only setups via the 'no allowlists configured'
branch.
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
- entry.tsx no longer writes bootBanner() to the main screen before the
  alt-screen enters. The <Banner> renders inside the alt screen via the
  seeded intro row, so nothing is lost — just the flash that preceded it.
  Fixes the torn first frame reported on Alacritty (blitz row 5 NousResearch#17) and
  shaves the 'starting agent' hang perception (row 5 NousResearch#1) since the UI
  paints straight into the steady-state view
- AlternateScreen prefixes ERASE_SCROLLBACK (\x1b[3J) to its entry so
  strict emulators start from a pristine grid; named constants replace
  the inline sequences for clarity
- bootBanner.ts deleted — dead code
jsboige pushed a commit to jsboige/hermes-agent that referenced this pull request May 14, 2026
… success log) (NousResearch#18761)

* fix(gateway): config.yaml wins over .env for agent/display/timezone settings

Regression from the silent config→env bridge. The bridge at module import
time is correct for max_turns (unconditional overwrite), but every other
agent.*, display.*, timezone, and security bridge key was guarded by
'if X not in os.environ' — so a stale .env entry from an old 'hermes setup'
run would shadow the user's current config.yaml indefinitely.

Symptom: agent.max_turns: 500 in config.yaml, HERMES_MAX_ITERATIONS=60
in .env from an old setup, and the gateway silently capped at 60
iterations per turn. Gateway logs confirmed api_calls never exceeded 60.

Three changes:

1. gateway/run.py: drop the 'not in os.environ' guards for all agent.*,
   display.*, timezone, and security.* bridge keys. config.yaml is now
   authoritative for these settings — same semantics already in place
   for max_turns, terminal.*, and auxiliary.*. Also surface the bridge
   failure (previously 'except Exception: pass') to stderr so operators
   see bridge errors instead of silently falling back to .env.

2. gateway/run.py: INFO-log the resolved max_iterations at gateway
   start so operators can verify the config→env bridge did the right
   thing instead of chasing a phantom budget ceiling.

3. hermes_cli/setup.py: stop writing HERMES_MAX_ITERATIONS to .env in
   the setup wizard. config.yaml is the single source of truth. Also
   clean up any stale .env entry left behind by pre-fix setups.

Regression tests in tests/gateway/test_config_env_bridge_authority.py
guard each config→env key against the 'stale .env shadows config' bug.

* fix(gateway): shutdown + restart hygiene (drain timeout, false-fatal, success log)

Three issues observed in production gateway.log during a rapid restart
chain on 2026-05-02, all fixed here.

1. _send_restart_notification logged unconditional success
   adapter.send() catches provider errors (e.g. Telegram 'Chat not found')
   and returns SendResult(success=False); it never raises. The caller
   ignored the return value and always logged 'Sent restart notification
   to <chat>' at INFO, producing a misleading success line directly
   below the 'Failed to send Telegram message' traceback on every boot.
   Now inspects result.success and logs WARNING with the error otherwise.

2. WhatsApp bridge SIGTERM on shutdown classified as fatal error
   _check_managed_bridge_exit() saw the bridge's returncode -15 (our own
   SIGTERM from disconnect()) and fired the full fatal-error path,
   producing 'ERROR ... WhatsApp bridge process exited unexpectedly' plus
   'Fatal whatsapp adapter error (whatsapp_bridge_exited)' on every
   planned shutdown, immediately before the normal '✓ whatsapp
   disconnected'. Adds a _shutting_down flag that disconnect() sets
   before the terminate, and _check_managed_bridge_exit() returns None
   for returncode in {0, -2, -15} while shutting down. OOM-kill (137)
   and other non-signal exits still hit the fatal path.

3. restart_drain_timeout default 60s → 180s
   On 2026-05-02 01:43:27 a user /restart fired while three agents were
   mid-API-call (82s, 112s, 154s into their turns). The 60s drain budget
   expired and all three were force-interrupted. 180s covers realistic
   in-flight agent turns; users on very-long-reasoning models can still
   raise it further via agent.restart_drain_timeout in config.yaml.
   Existing explicit user values are preserved by deep-merge.

Tests
- tests/gateway/test_restart_notification.py: two new tests assert INFO
  is only logged on SendResult(success=True) and WARNING with the error
  string is logged on SendResult(success=False).
- tests/gateway/test_whatsapp_connect.py: parametrized test for
  returncode in {0, -2, -15} proves shutdown-time exits are suppressed;
  separate test proves returncode 137 (SIGKILL/OOM) still surfaces as
  fatal even when _shutting_down is set.
- _check_managed_bridge_exit() reads _shutting_down via getattr-with-
  default so existing _make_adapter() test helpers that bypass __init__
  (pitfall NousResearch#17 in AGENTS.md) keep working unmodified.
olympus-terminal pushed a commit to olympus-terminal/hermes-agent that referenced this pull request May 16, 2026
…gent

Add support for Atropos Agentic RL environments (requires branch tool…
dannyJ848 pushed a commit to dannyJ848/hermes-agent that referenced this pull request May 17, 2026
… success log) (NousResearch#18761)

* fix(gateway): config.yaml wins over .env for agent/display/timezone settings

Regression from the silent config→env bridge. The bridge at module import
time is correct for max_turns (unconditional overwrite), but every other
agent.*, display.*, timezone, and security bridge key was guarded by
'if X not in os.environ' — so a stale .env entry from an old 'hermes setup'
run would shadow the user's current config.yaml indefinitely.

Symptom: agent.max_turns: 500 in config.yaml, HERMES_MAX_ITERATIONS=60
in .env from an old setup, and the gateway silently capped at 60
iterations per turn. Gateway logs confirmed api_calls never exceeded 60.

Three changes:

1. gateway/run.py: drop the 'not in os.environ' guards for all agent.*,
   display.*, timezone, and security.* bridge keys. config.yaml is now
   authoritative for these settings — same semantics already in place
   for max_turns, terminal.*, and auxiliary.*. Also surface the bridge
   failure (previously 'except Exception: pass') to stderr so operators
   see bridge errors instead of silently falling back to .env.

2. gateway/run.py: INFO-log the resolved max_iterations at gateway
   start so operators can verify the config→env bridge did the right
   thing instead of chasing a phantom budget ceiling.

3. hermes_cli/setup.py: stop writing HERMES_MAX_ITERATIONS to .env in
   the setup wizard. config.yaml is the single source of truth. Also
   clean up any stale .env entry left behind by pre-fix setups.

Regression tests in tests/gateway/test_config_env_bridge_authority.py
guard each config→env key against the 'stale .env shadows config' bug.

* fix(gateway): shutdown + restart hygiene (drain timeout, false-fatal, success log)

Three issues observed in production gateway.log during a rapid restart
chain on 2026-05-02, all fixed here.

1. _send_restart_notification logged unconditional success
   adapter.send() catches provider errors (e.g. Telegram 'Chat not found')
   and returns SendResult(success=False); it never raises. The caller
   ignored the return value and always logged 'Sent restart notification
   to <chat>' at INFO, producing a misleading success line directly
   below the 'Failed to send Telegram message' traceback on every boot.
   Now inspects result.success and logs WARNING with the error otherwise.

2. WhatsApp bridge SIGTERM on shutdown classified as fatal error
   _check_managed_bridge_exit() saw the bridge's returncode -15 (our own
   SIGTERM from disconnect()) and fired the full fatal-error path,
   producing 'ERROR ... WhatsApp bridge process exited unexpectedly' plus
   'Fatal whatsapp adapter error (whatsapp_bridge_exited)' on every
   planned shutdown, immediately before the normal '✓ whatsapp
   disconnected'. Adds a _shutting_down flag that disconnect() sets
   before the terminate, and _check_managed_bridge_exit() returns None
   for returncode in {0, -2, -15} while shutting down. OOM-kill (137)
   and other non-signal exits still hit the fatal path.

3. restart_drain_timeout default 60s → 180s
   On 2026-05-02 01:43:27 a user /restart fired while three agents were
   mid-API-call (82s, 112s, 154s into their turns). The 60s drain budget
   expired and all three were force-interrupted. 180s covers realistic
   in-flight agent turns; users on very-long-reasoning models can still
   raise it further via agent.restart_drain_timeout in config.yaml.
   Existing explicit user values are preserved by deep-merge.

Tests
- tests/gateway/test_restart_notification.py: two new tests assert INFO
  is only logged on SendResult(success=True) and WARNING with the error
  string is logged on SendResult(success=False).
- tests/gateway/test_whatsapp_connect.py: parametrized test for
  returncode in {0, -2, -15} proves shutdown-time exits are suppressed;
  separate test proves returncode 137 (SIGKILL/OOM) still surfaces as
  fatal even when _shutting_down is set.
- _check_managed_bridge_exit() reads _shutting_down via getattr-with-
  default so existing _make_adapter() test helpers that bypass __init__
  (pitfall NousResearch#17 in AGENTS.md) keep working unmodified.
verkyyi added a commit to verkyyi/hermes-agent that referenced this pull request May 27, 2026
…tracked .claude/ file

Note in inventory entry NousResearch#17 that .claude/skills/responsiveness-benchmark/SKILL.md
is the sole file this fork tracks under .claude/ (no .gitignore rule covers it),
so it's easy to spot/relocate on an upstream rebase.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
begjb pushed a commit to begjb/hermes-agent that referenced this pull request May 29, 2026
…doffs

The respawn guard's 'active_pr' check correctly prevents duplicate PR
creation when a worker is respawned on a task that already has a PR
URL in its comment history. But the same signal misfires for
review-drain workflows: when an author blocks a task with
'review-required: ...' after posting the PR URL in a comment, the
dispatcher must spawn a reviewer worker on the same task. The PR URL
in comments is the handoff payload, not a duplicate-PR risk.

Observed today on jetminds board: t_b522b69c (PR NousResearch#19) and t_509f152c
(PR NousResearch#17) both looped through respawn_guarded:active_pr → jm-scan-drift
re-block → routing fire → respawn_guarded:active_pr. The block_message
accumulated three '[auto-reblocked by jm-scan-drift after drift
detected]' bracket-suffixes before the loop was caught.

Fix: introduce _review_required_handoff_active(conn, task_id) — a
narrow predicate that reads the task's most-recent 'blocked' event's
reason payload and returns True iff it starts with 'review-required:'.
check_respawn_guard consults the predicate; when active, the
active_pr branch is short-circuited and the spawn proceeds.

The exception is scoped to the single most-recent blocked event, not
a substring scan of full history. Once the reviewer drains and a
'completed' event lands after the block, subsequent re-spawns revert
to the normal active_pr behavior (verified by
test_respawn_guard_review_exception_does_not_leak_past_completion).

Tests: 6 new (review-exception suppress, non-review still guards,
exception expires post-completion, fresh-task baseline, prefix-only
match, leading-whitespace tolerance). 21/21 respawn-guard tests pass;
164/164 test_kanban_db.py file passes. No upstream behavior changed
for non-review-drain workflows.

Refs: substrate task t_ee014b80 (engineering-lead, 2026-05-21)
JetMinds-local; per hermes-fork-patches discipline
teknium1 added a commit that referenced this pull request May 29, 2026
… tests

_adapter_enforces_own_access_policy accessed self.adapters directly, but
several auth tests build a bare GatewayRunner via object.__new__ without
setting .adapters (pitfalls.md #17). Read it defensively with getattr so a
missing/empty adapter map means "no adapter owns the policy" instead of
raising AttributeError.

Fixes 4 tests: test_feishu_bot_auth_bypass, test_discord_bot_auth_bypass (x2),
test_signal::test_signal_in_allowlist_maps.
teknium1 added a commit that referenced this pull request May 29, 2026
… tests

_adapter_enforces_own_access_policy accessed self.adapters directly, but
several auth tests build a bare GatewayRunner via object.__new__ without
setting .adapters (pitfalls.md #17). Read it defensively with getattr so a
missing/empty adapter map means "no adapter owns the policy" instead of
raising AttributeError.

Fixes 4 tests: test_feishu_bot_auth_bypass, test_discord_bot_auth_bypass (x2),
test_signal::test_signal_in_allowlist_maps.
KKT-OPT pushed a commit to KKT-OPT/hermes-agent that referenced this pull request May 31, 2026
… tests

_adapter_enforces_own_access_policy accessed self.adapters directly, but
several auth tests build a bare GatewayRunner via object.__new__ without
setting .adapters (pitfalls.md NousResearch#17). Read it defensively with getattr so a
missing/empty adapter map means "no adapter owns the policy" instead of
raising AttributeError.

Fixes 4 tests: test_feishu_bot_auth_bypass, test_discord_bot_auth_bypass (x2),
test_signal::test_signal_in_allowlist_maps.
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
Two follow-ups to the cherry-picked PR NousResearch#9873 (`e3bcc819`):

1. `_is_allowed_user` now uses `getattr(self, '_allowed_*_ids', set())`
   so test fixtures that build the adapter via `object.__new__`
   (skipping __init__) don't crash with AttributeError.
   See AGENTS.md pitfall NousResearch#17 — same pattern as gateway.run.

2. New 3-case regression coverage in test_discord_bot_auth_bypass.py:
   - role-only config bypasses the gateway 'no allowlists' branch
   - roles + users combined still authorizes user-allowlist matches
   - the role bypass does NOT leak to other platforms (Telegram, etc.)

3. Autouse fixture in test_discord_bot_auth_bypass.py clears all Discord
   auth env vars before each test so DISCORD_ALLOWED_ROLES leakage from
   a previous test in the session can't flip later 'should-reject' tests
   into false-pass.

Required because the bare cherry-pick of NousResearch#9873 only added the adapter-
level role check — it didn't cover the gateway-level _is_user_authorized,
which still rejected role-only setups via the 'no allowlists configured'
branch.
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
- entry.tsx no longer writes bootBanner() to the main screen before the
  alt-screen enters. The <Banner> renders inside the alt screen via the
  seeded intro row, so nothing is lost — just the flash that preceded it.
  Fixes the torn first frame reported on Alacritty (blitz row 5 NousResearch#17) and
  shaves the 'starting agent' hang perception (row 5 NousResearch#1) since the UI
  paints straight into the steady-state view
- AlternateScreen prefixes ERASE_SCROLLBACK (\x1b[3J) to its entry so
  strict emulators start from a pristine grid; named constants replace
  the inline sequences for clarity
- bootBanner.ts deleted — dead code
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
… success log) (NousResearch#18761)

* fix(gateway): config.yaml wins over .env for agent/display/timezone settings

Regression from the silent config→env bridge. The bridge at module import
time is correct for max_turns (unconditional overwrite), but every other
agent.*, display.*, timezone, and security bridge key was guarded by
'if X not in os.environ' — so a stale .env entry from an old 'hermes setup'
run would shadow the user's current config.yaml indefinitely.

Symptom: agent.max_turns: 500 in config.yaml, HERMES_MAX_ITERATIONS=60
in .env from an old setup, and the gateway silently capped at 60
iterations per turn. Gateway logs confirmed api_calls never exceeded 60.

Three changes:

1. gateway/run.py: drop the 'not in os.environ' guards for all agent.*,
   display.*, timezone, and security.* bridge keys. config.yaml is now
   authoritative for these settings — same semantics already in place
   for max_turns, terminal.*, and auxiliary.*. Also surface the bridge
   failure (previously 'except Exception: pass') to stderr so operators
   see bridge errors instead of silently falling back to .env.

2. gateway/run.py: INFO-log the resolved max_iterations at gateway
   start so operators can verify the config→env bridge did the right
   thing instead of chasing a phantom budget ceiling.

3. hermes_cli/setup.py: stop writing HERMES_MAX_ITERATIONS to .env in
   the setup wizard. config.yaml is the single source of truth. Also
   clean up any stale .env entry left behind by pre-fix setups.

Regression tests in tests/gateway/test_config_env_bridge_authority.py
guard each config→env key against the 'stale .env shadows config' bug.

* fix(gateway): shutdown + restart hygiene (drain timeout, false-fatal, success log)

Three issues observed in production gateway.log during a rapid restart
chain on 2026-05-02, all fixed here.

1. _send_restart_notification logged unconditional success
   adapter.send() catches provider errors (e.g. Telegram 'Chat not found')
   and returns SendResult(success=False); it never raises. The caller
   ignored the return value and always logged 'Sent restart notification
   to <chat>' at INFO, producing a misleading success line directly
   below the 'Failed to send Telegram message' traceback on every boot.
   Now inspects result.success and logs WARNING with the error otherwise.

2. WhatsApp bridge SIGTERM on shutdown classified as fatal error
   _check_managed_bridge_exit() saw the bridge's returncode -15 (our own
   SIGTERM from disconnect()) and fired the full fatal-error path,
   producing 'ERROR ... WhatsApp bridge process exited unexpectedly' plus
   'Fatal whatsapp adapter error (whatsapp_bridge_exited)' on every
   planned shutdown, immediately before the normal '✓ whatsapp
   disconnected'. Adds a _shutting_down flag that disconnect() sets
   before the terminate, and _check_managed_bridge_exit() returns None
   for returncode in {0, -2, -15} while shutting down. OOM-kill (137)
   and other non-signal exits still hit the fatal path.

3. restart_drain_timeout default 60s → 180s
   On 2026-05-02 01:43:27 a user /restart fired while three agents were
   mid-API-call (82s, 112s, 154s into their turns). The 60s drain budget
   expired and all three were force-interrupted. 180s covers realistic
   in-flight agent turns; users on very-long-reasoning models can still
   raise it further via agent.restart_drain_timeout in config.yaml.
   Existing explicit user values are preserved by deep-merge.

Tests
- tests/gateway/test_restart_notification.py: two new tests assert INFO
  is only logged on SendResult(success=True) and WARNING with the error
  string is logged on SendResult(success=False).
- tests/gateway/test_whatsapp_connect.py: parametrized test for
  returncode in {0, -2, -15} proves shutdown-time exits are suppressed;
  separate test proves returncode 137 (SIGKILL/OOM) still surfaces as
  fatal even when _shutting_down is set.
- _check_managed_bridge_exit() reads _shutting_down via getattr-with-
  default so existing _make_adapter() test helpers that bypass __init__
  (pitfall NousResearch#17 in AGENTS.md) keep working unmodified.
difeizheng pushed a commit to difeizheng/zdf-hermes-agent that referenced this pull request Jun 3, 2026
Fixes 12 remaining MEDIUM issues from the deep audit (19 total, 7 fixed in Round 12):

design_agent:
- NousResearch#15: add asyncio.wait_for(300s) around LLM API call to prevent infinite hangs
- NousResearch#17: replace 2x hardcoded 'claude-opus-4-8' with shared DEFAULT_MODEL constant

qa_agent / validate_agent:
- NousResearch#20,NousResearch#22,NousResearch#23: already fixed in Round 12 (verified — dynamic timeout/threshold values used)

memory.py:
- NousResearch#24: frontmatter parser uses regex r'^---$' instead of str.split('---',2),
  preventing false splits on content containing '---' (SQL, markdown tables)
- NousResearch#25: parse and preserve 'description' field from frontmatter in metadata,
  fixing write→load roundtrip data loss

profiles.py:
- NousResearch#26: ProfileConfig now frozen=True (immutable dataclass per coding standards)

deploy_agent:
- NousResearch#31: replace 2x sync subprocess.run with asyncio.create_subprocess_exec
- fix 5x .decode() → .decode('utf-8', errors='replace') for Windows CJK safety
- remove unused import subprocess

db.py:
- NousResearch#27: add class docstring explaining RLock + _unlocked pattern
- NousResearch#28: FK constraints already in DDL (verified PRAGMA foreign_keys=ON active)
- NousResearch#29: add _ensure_connection() with PRAGMA integrity_check(1) + auto-reconnect
       on 4 critical methods (create_task, get_task, claim_task, submit_result)
- extract _create_connection() static method for reuse by reconnect

Tests: 79 passed, 0 failed
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
Two follow-ups to the cherry-picked PR NousResearch#9873 (`e3bcc819`):

1. `_is_allowed_user` now uses `getattr(self, '_allowed_*_ids', set())`
   so test fixtures that build the adapter via `object.__new__`
   (skipping __init__) don't crash with AttributeError.
   See AGENTS.md pitfall NousResearch#17 — same pattern as gateway.run.

2. New 3-case regression coverage in test_discord_bot_auth_bypass.py:
   - role-only config bypasses the gateway 'no allowlists' branch
   - roles + users combined still authorizes user-allowlist matches
   - the role bypass does NOT leak to other platforms (Telegram, etc.)

3. Autouse fixture in test_discord_bot_auth_bypass.py clears all Discord
   auth env vars before each test so DISCORD_ALLOWED_ROLES leakage from
   a previous test in the session can't flip later 'should-reject' tests
   into false-pass.

Required because the bare cherry-pick of NousResearch#9873 only added the adapter-
level role check — it didn't cover the gateway-level _is_user_authorized,
which still rejected role-only setups via the 'no allowlists configured'
branch.
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
- entry.tsx no longer writes bootBanner() to the main screen before the
  alt-screen enters. The <Banner> renders inside the alt screen via the
  seeded intro row, so nothing is lost — just the flash that preceded it.
  Fixes the torn first frame reported on Alacritty (blitz row 5 NousResearch#17) and
  shaves the 'starting agent' hang perception (row 5 NousResearch#1) since the UI
  paints straight into the steady-state view
- AlternateScreen prefixes ERASE_SCROLLBACK (\x1b[3J) to its entry so
  strict emulators start from a pristine grid; named constants replace
  the inline sequences for clarity
- bootBanner.ts deleted — dead code
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
… success log) (NousResearch#18761)

* fix(gateway): config.yaml wins over .env for agent/display/timezone settings

Regression from the silent config→env bridge. The bridge at module import
time is correct for max_turns (unconditional overwrite), but every other
agent.*, display.*, timezone, and security bridge key was guarded by
'if X not in os.environ' — so a stale .env entry from an old 'hermes setup'
run would shadow the user's current config.yaml indefinitely.

Symptom: agent.max_turns: 500 in config.yaml, HERMES_MAX_ITERATIONS=60
in .env from an old setup, and the gateway silently capped at 60
iterations per turn. Gateway logs confirmed api_calls never exceeded 60.

Three changes:

1. gateway/run.py: drop the 'not in os.environ' guards for all agent.*,
   display.*, timezone, and security.* bridge keys. config.yaml is now
   authoritative for these settings — same semantics already in place
   for max_turns, terminal.*, and auxiliary.*. Also surface the bridge
   failure (previously 'except Exception: pass') to stderr so operators
   see bridge errors instead of silently falling back to .env.

2. gateway/run.py: INFO-log the resolved max_iterations at gateway
   start so operators can verify the config→env bridge did the right
   thing instead of chasing a phantom budget ceiling.

3. hermes_cli/setup.py: stop writing HERMES_MAX_ITERATIONS to .env in
   the setup wizard. config.yaml is the single source of truth. Also
   clean up any stale .env entry left behind by pre-fix setups.

Regression tests in tests/gateway/test_config_env_bridge_authority.py
guard each config→env key against the 'stale .env shadows config' bug.

* fix(gateway): shutdown + restart hygiene (drain timeout, false-fatal, success log)

Three issues observed in production gateway.log during a rapid restart
chain on 2026-05-02, all fixed here.

1. _send_restart_notification logged unconditional success
   adapter.send() catches provider errors (e.g. Telegram 'Chat not found')
   and returns SendResult(success=False); it never raises. The caller
   ignored the return value and always logged 'Sent restart notification
   to <chat>' at INFO, producing a misleading success line directly
   below the 'Failed to send Telegram message' traceback on every boot.
   Now inspects result.success and logs WARNING with the error otherwise.

2. WhatsApp bridge SIGTERM on shutdown classified as fatal error
   _check_managed_bridge_exit() saw the bridge's returncode -15 (our own
   SIGTERM from disconnect()) and fired the full fatal-error path,
   producing 'ERROR ... WhatsApp bridge process exited unexpectedly' plus
   'Fatal whatsapp adapter error (whatsapp_bridge_exited)' on every
   planned shutdown, immediately before the normal '✓ whatsapp
   disconnected'. Adds a _shutting_down flag that disconnect() sets
   before the terminate, and _check_managed_bridge_exit() returns None
   for returncode in {0, -2, -15} while shutting down. OOM-kill (137)
   and other non-signal exits still hit the fatal path.

3. restart_drain_timeout default 60s → 180s
   On 2026-05-02 01:43:27 a user /restart fired while three agents were
   mid-API-call (82s, 112s, 154s into their turns). The 60s drain budget
   expired and all three were force-interrupted. 180s covers realistic
   in-flight agent turns; users on very-long-reasoning models can still
   raise it further via agent.restart_drain_timeout in config.yaml.
   Existing explicit user values are preserved by deep-merge.

Tests
- tests/gateway/test_restart_notification.py: two new tests assert INFO
  is only logged on SendResult(success=True) and WARNING with the error
  string is logged on SendResult(success=False).
- tests/gateway/test_whatsapp_connect.py: parametrized test for
  returncode in {0, -2, -15} proves shutdown-time exits are suppressed;
  separate test proves returncode 137 (SIGKILL/OOM) still surfaces as
  fatal even when _shutting_down is set.
- _check_managed_bridge_exit() reads _shutting_down via getattr-with-
  default so existing _make_adapter() test helpers that bypass __init__
  (pitfall NousResearch#17 in AGENTS.md) keep working unmodified.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants