Skip to content

fix(gateway): trust adapter-owned access policy over env default-deny (#34515)#34572

Merged
teknium1 merged 3 commits into
mainfrom
hermes/hermes-0675370b
May 29, 2026
Merged

fix(gateway): trust adapter-owned access policy over env default-deny (#34515)#34572
teknium1 merged 3 commits into
mainfrom
hermes/hermes-0675370b

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Summary

Config-driven platform access policies (dm_policy / group_policy / allow_from / group_allow_from) for WeCom, Weixin, Yuanbao, and QQBot now work without also setting a PLATFORM_ALLOWED_USERS env var.

Root cause: these adapters enforce their access policy at intake — a message is dropped inside the adapter and never dispatched unless it already passed the policy. The gateway's env-based check (_is_user_authorized) ran afterward and, with no env allowlist set, fell through to an env-only default-deny — silently rejecting dm_policy: open and config-only allowlists the adapter had already authorized.

Salvage of #34515 (@Frowtek) — same bug, rebuilt fix. Instead of re-implementing each adapter's policy a second time in run.py (a duplicate decision surface that drifts), adapters that own their gate declare it via one capability flag and the gateway trusts it.

Changes

  • gateway/platforms/base.py: new enforces_own_access_policy property (default False) — the single contract.
  • gateway/platforms/{wecom,weixin,yuanbao,qqbot}: one-line override → True.
  • gateway/run.py: _adapter_enforces_own_access_policy() helper + trust branch in _is_user_authorized — when no env allowlist is set and the adapter owns its policy, honor the adapter's decision instead of default-denying. Env allowlists still win when set.
  • gateway/run.py: _get_unauthorized_dm_behavior reads config dm_policy so allowlist/disabled drop unauthorized DMs silently (no pairing-code leak) while pairing opts back in.
  • tests/gateway/test_config_driven_access_policy.py: 20 tests across the base contract, adapter overrides, gateway trust, env-precedence, and DM-behavior resolution.
  • scripts/release.py: AUTHOR_MAP entry for contributor credit.

Why not the original shape

The original PR copied each adapter's policy evaluation (prefix stripping, case-folding, per-group sender allowlists, the four policy verbs) into run.py — 437 lines, a second source of truth. This version is ~85 lines of logic and trusts the gate the adapter already ran. Adding a future own-policy adapter is one property override, not another copy.

Validation

Suite Result
New test_config_driven_access_policy.py 20/20
unauthorized_dm + wecom/weixin/qqbot/platform_registry 337/337
test_yuanbao_pipeline.py 94/94
E2E (real adapter classes + gateway wiring) own-policy→authorized, non-owning→denied, env-allowlist-excludes→denied, dm_policy resolution all correct

Closes #34515.

Infographic

gateway-access-policy-adapter-trust

…#34515)

Config-driven platform policies (dm_policy / group_policy / allow_from /
group_allow_from) for WeCom, Weixin, Yuanbao, and QQBot now work without
also setting a PLATFORM_ALLOWED_USERS env var.

These adapters enforce their access policy at intake — a message is dropped
inside the adapter and never dispatched unless it already passed the policy.
The gateway's env-based check (_is_user_authorized) ran afterward and, with
no env allowlist set, fell through to an env-only default-deny — silently
rejecting `dm_policy: open` and config-only allowlists the adapter had
already authorized.

Rather than re-implement each adapter's policy a second time in run.py
(which would drift), adapters that own their gate now declare it via a new
BasePlatformAdapter.enforces_own_access_policy property (default False). The
gateway trusts that flag and skips the env-only default-deny for those
platforms. Env allowlists still take precedence when set.

Also resolves unauthorized DM behavior from config dm_policy so allowlist /
disabled policies drop unauthorized DMs silently instead of leaking pairing
codes, while an explicit pairing policy opts back in.

Co-authored-by: Frowtek <frowte3k@gmail.com>
@github-actions

github-actions Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

🔎 Lint report: hermes/hermes-0675370b vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9449 on HEAD, 9447 on base (🆕 +2)

🆕 New issues (2):

Rule Count
unresolved-import 1
invalid-argument-type 1
First entries
tests/gateway/test_config_driven_access_policy.py:25: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tests/gateway/test_config_driven_access_policy.py:96: [invalid-argument-type] invalid-argument-type: Argument to function `BasePlatformAdapter.enforces_own_access_policy` is incorrect: Expected `BasePlatformAdapter`, found `object`

✅ Fixed issues: none

Unchanged: 4904 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

teknium1 added 2 commits May 29, 2026 03:59
… tests

_adapter_enforces_own_access_policy accessed self.adapters directly, but
several auth tests build a bare GatewayRunner via object.__new__ without
setting .adapters (pitfalls.md #17). Read it defensively with getattr so a
missing/empty adapter map means "no adapter owns the policy" instead of
raising AttributeError.

Fixes 4 tests: test_feishu_bot_auth_bypass, test_discord_bot_auth_bypass (x2),
test_signal::test_signal_in_allowlist_maps.
Two CI flakes surfaced on PR #34572 (both in files this PR doesn't touch;
pre-existing host-dependent flakes):

1. test_process_registry::TestPopenLeakOnSetupFailure — the failure-cleanup
   tests use a fake proc.pid (8888/9999) and assert proc.kill() runs. But
   spawn_local's primary cleanup is os.killpg(os.getpgid(pid), SIGKILL),
   falling back to proc.kill() only on ProcessLookupError/PermissionError/
   OSError. When the fake PID happens to exist on a busy host, os.getpgid
   succeeds, os.killpg fires against an UNRELATED real process group, and
   proc.kill() is never reached -> flaky AssertionError (and a real risk of
   SIGKILLing an innocent process group from a unit test). Patch os.getpgid
   to raise ProcessLookupError so the fallback path runs deterministically
   and no real killpg is ever issued.

2. test_web_server::test_resize_escape_is_forwarded — the receive loop calls
   the blocking conn.receive_bytes() with no exception guard. Once the child
   prints its winsize and exits, the PTY closes; on a missed-marker run the
   next recv blocks until the 30s pytest-timeout instead of failing fast.
   Add a try/except break (matching the working sibling tests) and bump the
   child's pre-read sleep 0.15s -> 0.5s so the resize reliably lands first.

Verified: 4/4 pass across 3 consecutive runs; root cause for #1 reproduced
(os.getpgid(1) succeeds -> old code skips proc.kill).
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery platform/wecom WeCom / WeChat Work adapter platform/qqbot QQ Bot adapter labels May 29, 2026
@teknium1 teknium1 merged commit 1c53d39 into main May 29, 2026
25 checks passed
@teknium1 teknium1 deleted the hermes/hermes-0675370b branch May 29, 2026 11:22
KKT-OPT pushed a commit to KKT-OPT/hermes-agent that referenced this pull request May 31, 2026
Two CI flakes surfaced on PR NousResearch#34572 (both in files this PR doesn't touch;
pre-existing host-dependent flakes):

1. test_process_registry::TestPopenLeakOnSetupFailure — the failure-cleanup
   tests use a fake proc.pid (8888/9999) and assert proc.kill() runs. But
   spawn_local's primary cleanup is os.killpg(os.getpgid(pid), SIGKILL),
   falling back to proc.kill() only on ProcessLookupError/PermissionError/
   OSError. When the fake PID happens to exist on a busy host, os.getpgid
   succeeds, os.killpg fires against an UNRELATED real process group, and
   proc.kill() is never reached -> flaky AssertionError (and a real risk of
   SIGKILLing an innocent process group from a unit test). Patch os.getpgid
   to raise ProcessLookupError so the fallback path runs deterministically
   and no real killpg is ever issued.

2. test_web_server::test_resize_escape_is_forwarded — the receive loop calls
   the blocking conn.receive_bytes() with no exception guard. Once the child
   prints its winsize and exits, the PTY closes; on a missed-marker run the
   next recv blocks until the 30s pytest-timeout instead of failing fast.
   Add a try/except break (matching the working sibling tests) and bump the
   child's pre-read sleep 0.15s -> 0.5s so the resize reliably lands first.

Verified: 4/4 pass across 3 consecutive runs; root cause for NousResearch#1 reproduced
(os.getpgid(1) succeeds -> old code skips proc.kill).
hechuyi pushed a commit to hechuyi/hermes-agent that referenced this pull request Jun 6, 2026
Two CI flakes surfaced on PR NousResearch#34572 (both in files this PR doesn't touch;
pre-existing host-dependent flakes):

1. test_process_registry::TestPopenLeakOnSetupFailure — the failure-cleanup
   tests use a fake proc.pid (8888/9999) and assert proc.kill() runs. But
   spawn_local's primary cleanup is os.killpg(os.getpgid(pid), SIGKILL),
   falling back to proc.kill() only on ProcessLookupError/PermissionError/
   OSError. When the fake PID happens to exist on a busy host, os.getpgid
   succeeds, os.killpg fires against an UNRELATED real process group, and
   proc.kill() is never reached -> flaky AssertionError (and a real risk of
   SIGKILLing an innocent process group from a unit test). Patch os.getpgid
   to raise ProcessLookupError so the fallback path runs deterministically
   and no real killpg is ever issued.

2. test_web_server::test_resize_escape_is_forwarded — the receive loop calls
   the blocking conn.receive_bytes() with no exception guard. Once the child
   prints its winsize and exits, the PTY closes; on a missed-marker run the
   next recv blocks until the 30s pytest-timeout instead of failing fast.
   Add a try/except break (matching the working sibling tests) and bump the
   child's pre-read sleep 0.15s -> 0.5s so the resize reliably lands first.

Verified: 4/4 pass across 3 consecutive runs; root cause for NousResearch#1 reproduced
(os.getpgid(1) succeeds -> old code skips proc.kill).

(cherry picked from commit 1c53d39)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists platform/qqbot QQ Bot adapter platform/wecom WeCom / WeChat Work adapter type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants