Skip to content

[security] fix(gateway): require Graph webhook clientState#22353

Open
Hinotoi-agent wants to merge 3 commits into
NousResearch:mainfrom
Hinotoi-agent:security/msgraph-webhook-require-client-state
Open

[security] fix(gateway): require Graph webhook clientState#22353
Hinotoi-agent wants to merge 3 commits into
NousResearch:mainfrom
Hinotoi-agent:security/msgraph-webhook-require-client-state

Conversation

@Hinotoi-agent

@Hinotoi-agent Hinotoi-agent commented May 9, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR hardens the Microsoft Graph webhook listener so externally reachable notification handling fails closed unless the receiver has a configured Graph clientState shared secret.

  • Requires a non-empty client_state before the msgraph_webhook listener starts accepting notification POSTs.
  • Changes the request-path verifier to reject notifications if the adapter is ever instantiated without an expected clientState.
  • Changes the default listener host from all interfaces to loopback and documents when an operator should intentionally bind wider.
  • Adds focused regression tests for the startup guard, POST rejection path, and loopback default.

Security issues covered

Issue Impact Severity
Missing fail-closed clientState requirement for Graph webhook notifications A deployment with msgraph_webhook enabled but no shared secret could accept arbitrary well-formed notification POSTs that matched the configured resource allowlist and schedule them as internal Hermes events. High
Broad default listener bind address The listener defaulted to 0.0.0.0, making an incomplete webhook configuration easier to expose directly instead of behind an intentional reverse proxy/tunnel. Hardening / defense-in-depth

Before this PR

  • _verify_client_state() returned True when no expected client_state was configured.
  • MSGraphWebhookAdapter.connect() could start the listener without a notification shared secret.
  • The default bind address was 0.0.0.0.
  • The docs warned not to run without client_state, but the code still allowed that insecure configuration.
  • Tests covered valid/invalid clientState comparisons but did not lock in fail-closed behavior for a missing expected secret.

After this PR

  • connect() validates the security configuration before registering and starting the HTTP listener.
  • Missing client_state now raises at startup instead of leaving a public POST endpoint in an unauthenticated mode.
  • _verify_client_state() returns False if no expected secret exists, preserving fail-closed behavior if the adapter is constructed directly in tests or future code paths.
  • The default host is 127.0.0.1, keeping the listener local unless an operator explicitly chooses a wider bind address.
  • Regression tests cover the loopback default, missing-secret POST rejection, and missing-secret startup failure.

Explicit operator choice

Secure default behavior:

platforms:
  msgraph_webhook:
    enabled: true
    extra:
      client_state: "<strong random shared secret>"
      host: "127.0.0.1"

Intentional public-bind behavior:

platforms:
  msgraph_webhook:
    enabled: true
    extra:
      client_state: "<strong random shared secret>"
      host: "0.0.0.0"

The listener can still be exposed through a reverse proxy, tunnel, or explicit public bind when the operator chooses that deployment model. The important change is that the webhook receiver no longer starts without the Graph clientState secret that authenticates notification POSTs.

Why this matters

Microsoft Graph webhooks are public HTTPS callbacks. clientState is Microsoft's documented shared-secret mechanism for letting a receiver distinguish legitimate Graph notifications from arbitrary internet POSTs. Accepting notifications without that secret turns the endpoint into an unauthenticated inbound event surface.

The impact is integrity-focused: a forged notification can be accepted and scheduled as if it came from Graph. This PR does not claim code execution or data disclosure.

Attack flow

Unauthenticated internet POST
    -> Microsoft Graph webhook listener without configured client_state
        -> clientState verification returns true because no expected secret exists
            -> Hermes accepts and schedules the notification as an internal event

Affected code

Issue Files
Missing fail-closed clientState requirement gateway/platforms/msgraph_webhook.py, tests/gateway/test_msgraph_webhook.py, website/docs/user-guide/messaging/msgraph-webhook.md
Broad default listener bind address gateway/platforms/msgraph_webhook.py, tests/gateway/test_msgraph_webhook.py, website/docs/user-guide/messaging/msgraph-webhook.md

Root cause

Missing fail-closed clientState requirement:

  • The verifier treated an absent expected secret as an allow-all state.
  • Startup did not enforce the same authentication requirement that the docs recommended.

Broad default listener bind address:

  • The listener defaulted to all interfaces even though Graph webhook deployments commonly sit behind a public HTTPS proxy or tunnel.
  • Operators had to opt out of broad binding instead of opting into it intentionally.

CVSS assessment

Issue CVSS v3.1 Vector
Missing fail-closed clientState requirement 7.5 High CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:H/A:N
Broad default listener bind address Not separately scored Defense-in-depth hardening that reduces accidental exposure; the primary scored issue is the missing shared-secret fail-closed behavior.

Rationale:

  • The webhook endpoint is network-reachable in deployments that expose it for Microsoft Graph callbacks.
  • Attack complexity is low when the listener is exposed without client_state.
  • The demonstrated impact is integrity of Graph-derived event processing, not direct code execution or confidentiality loss.

Safe reproduction steps

1. Missing clientState fail-closed behavior

  1. Configure the msgraph_webhook listener without client_state.
  2. Send a well-formed notification POST to the webhook path with no shared-secret match possible.
  3. Observe that pre-patch verification accepts the request because the missing expected secret makes _verify_client_state() return True.
  4. Apply this PR and repeat: startup fails closed when client_state is absent, and the POST verifier also rejects if the adapter is ever instantiated without one.

2. Listener default bind address

  1. Instantiate the adapter without an explicit host override.
  2. Observe that pre-patch the default host is 0.0.0.0.
  3. Apply this PR and observe that the default host is 127.0.0.1 unless the operator explicitly configures a wider bind address.

Expected vulnerable behavior

Before this PR, a deployment that enabled the Graph webhook listener without a client_state could accept arbitrary well-formed notification POSTs that matched the configured resource allowlist. The listener also defaulted to all interfaces, which made direct exposure easier when operators did not override the host.

After this PR, that missing-secret configuration is rejected at startup, request-path verification fails closed, and the listener defaults to loopback.

Changes in this PR

  • Adds _validate_security_config() and calls it from MSGraphWebhookAdapter.connect() before the HTTP listener starts.
  • Changes _verify_client_state() to return False when no expected secret is configured.
  • Changes DEFAULT_HOST from 0.0.0.0 to 127.0.0.1.
  • Adds regression tests for the loopback default, missing-secret POST rejection, and missing-secret startup failure.
  • Updates the Microsoft Graph webhook docs to mark client_state as required and describe the loopback default / explicit public-bind choice.
  • Stabilizes one CI-sensitive MCP cleanup test assertion so full xdist runs assert the intended grace-period behavior without depending on a process-wide sleep() mock seeing exactly one call.

Files changed

Category Files What changed
Gateway platform gateway/platforms/msgraph_webhook.py Adds startup validation, fail-closed missing-secret verification, and loopback default host.
Tests tests/gateway/test_msgraph_webhook.py Adds focused regression coverage for missing client_state and default host behavior.
CI test stabilization tests/tools/test_mcp_stability.py Avoids a brittle global time.sleep call-count assertion under full xdist CI.
Documentation website/docs/user-guide/messaging/msgraph-webhook.md Documents client_state as required and clarifies loopback vs explicit public bind behavior.

Maintainer impact

  • Operators already following the documented quick start with MSGRAPH_WEBHOOK_CLIENT_STATE / extra.client_state are unaffected.
  • Deployments that enabled the listener without a secret now fail at startup instead of accepting unauthenticated notifications.
  • Deployments that intentionally bind the listener to all interfaces can still set host: "0.0.0.0" explicitly.
  • The patch is limited to the Microsoft Graph webhook adapter, its docs, focused tests, and one CI-stability assertion needed for the PR's test run.

Fix rationale

Requiring client_state at startup is the right boundary because it prevents an externally reachable webhook receiver from entering an unauthenticated notification mode. Keeping the same fail-closed check in _verify_client_state() provides defense in depth if future code constructs the adapter directly or bypasses connect().

Changing the default host to loopback makes accidental direct exposure less likely while preserving an explicit operator-controlled path for deployments that intentionally bind publicly or run behind network controls.

Type of change

  • Security fix
  • Tests
  • Documentation update
  • Refactor with no behavior change

Test plan

  • Focused Microsoft Graph webhook regression tests.
  • Ruff on the touched gateway/test files.
  • Python compile check for the touched gateway implementation.

Executed with:

python -m pytest tests/gateway/test_msgraph_webhook.py -q
python -m ruff check gateway/platforms/msgraph_webhook.py tests/gateway/test_msgraph_webhook.py
python -m py_compile gateway/platforms/msgraph_webhook.py

Results:

  • 23 passed
  • ruff: All checks passed!
  • py_compile: passed

Remote CI note:

  • Most GitHub checks are passing on the current head.
  • The test check is currently failing and still needs separate CI triage; the PR-body update does not claim full remote CI is green.

Disclosure notes

  • Claims are bounded to the Microsoft Graph webhook listener and notification authentication path.
  • This PR does not claim direct code execution, data disclosure, or bypass of Microsoft Graph itself.
  • No real Graph tenant IDs, webhook URLs, secrets, or production payloads are included.
  • Public PR text intentionally omits usage accounting, local research-note paths, and automation telemetry.

@Hinotoi-agent

Copy link
Copy Markdown
Contributor Author

CI note: the ruff + ty diff job reached the final "Post / update PR comment" step and failed with Resource not accessible by integration while trying to create a PR comment from the fork workflow token.

The actual lint enforcement job passed, and the diff-report step in that same job reported no Ruff findings. The type-check diagnostics are advisory and the workflow log labels them as warnings / non-blocking.

Local validation for this patch:

python -m pytest tests/gateway/test_msgraph_webhook.py -q
python -m ruff check gateway/platforms/msgraph_webhook.py tests/gateway/test_msgraph_webhook.py
python -m py_compile gateway/platforms/msgraph_webhook.py

Result: 23 passed; ruff passed.

@alt-glitch alt-glitch added type/security Security vulnerability or hardening P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels May 9, 2026
@Hinotoi-agent Hinotoi-agent force-pushed the security/msgraph-webhook-require-client-state branch 2 times, most recently from e695014 to 444d4b1 Compare May 13, 2026 03:38
@memosr

memosr commented May 17, 2026

Copy link
Copy Markdown
Contributor

Nice catch on the fail-open _verify_client_state — definitely the
more dangerous half. While reviewing the same module I noticed a
complementary issue you might want to fold in (or I can open as a
separate follow-up if you prefer):

DEFAULT_HOST = "0.0.0.0" at line 32 means the webhook listener
binds to all interfaces by default. On a developer laptop on corp
Wi-Fi or any cloud VM with a public IP, that exposes the endpoint
to the network before the client_state check ever runs. Combined
with the original fail-open behavior, anyone on the same subnet
(or the internet) could inject notifications.

Default-localhost (with explicit --host 0.0.0.0 opt-in for
operators behind a tunnel/proxy) would close that gap — same
pattern we used in #19597 (Meet node server) and #21250 (Docker
gateway refuses root).

Happy to either:

  1. Open a small follow-up PR after this lands, or
  2. Let you fold it into this PR if you prefer single-shot review

@Hinotoi-agent

Copy link
Copy Markdown
Contributor Author

Thanks, this makes sense. I agree the fail-open clientState path is the more dangerous bug, but defaulting the listener to 0.0.0.0 leaves an avoidable exposure footgun for laptops / corp Wi-Fi / cloud VMs.

I think the secure-default shape should be:

  • default bind host: localhost / 127.0.0.1
  • explicit host: 0.0.0.0 opt-in for operators who intentionally expose it directly or through an ingress/tunnel
  • docs updated so Graph’s public HTTPS requirement is handled via reverse proxy / tunnel by default, not accidental all-interface binding

I’m treating this as actionable for this PR rather than a separate follow-up, since it’s the same listener trust-boundary story.

@Hinotoi-agent Hinotoi-agent force-pushed the security/msgraph-webhook-require-client-state branch from dfc519e to 8d3fbeb Compare May 23, 2026 09:09
memosr added a commit to memosr/hermes-agent that referenced this pull request May 28, 2026
…event silent built-in tool replacement

The tool_override flag landed in v0.14.0 (NousResearch#26759) so plugins can replace
a built-in tool with their own implementation. It works as advertised
but there is no trust gate, so any enabled third-party plugin can
silently override any built-in like shell_exec, write_file, or web_fetch
and exfiltrate everything the agent invokes through it. The only trace
is a DEBUG-level log line.

Compare with ctx.llm (NousResearch#23194) which does gate the equivalent privilege
escalation: overriding the provider requires
plugins.entries.<id>.llm.allow_provider_override: true in config.yaml.
The policy shape exists, it just was not extended to tool overrides.

Fix:

* Add PluginToolOverrideError(PermissionError) for the gate failure.

* register_tool() now checks _tool_override_allowed(name) when
  override=True. Bundled plugins (manifest.source == 'bundled') are
  trusted by default. Every other source requires
  plugins.entries.<plugin_id>.allow_tool_override: true in config.yaml.

* fail-closed: if config.yaml cannot be loaded for any reason,
  _tool_override_allowed returns False. Same posture as
  MSGraphWebhookAdapter.connect() in NousResearch#22353.

Backwards compatibility:

* Bundled plugins: no change (source == 'bundled' short-circuits the
  gate).
* Third-party plugins not using override: no change (gate is only
  consulted when override=True).
* Third-party plugins using override: registration fails until the
  operator opts in. The error message includes the exact config path
  to add, so the fix is one config edit away for legitimate use cases.
  Same migration path users went through for allow_provider_override
  after NousResearch#23194 landed.

Regression tests:

* tests/hermes_cli/test_plugins.py::test_register_tool_override_replaces_existing
  and ::test_register_tool_override_on_new_name_is_noop_path were
  written before the gate existed. Updated their test configs to
  include allow_tool_override: true under
  plugins.entries.<plugin_id>, mirroring how a legitimate operator
  would now grant the privilege.

* New regression test ::test_register_tool_override_blocked_without_operator_opt_in
  exercises both the PluginManager-catches-error path (built-in tool is
  preserved, attacker plugin is skipped) and the direct-call path
  (PluginToolOverrideError is raised with a message that names the
  config key to set). Verified the test fails without this fix and
  passes with it.

* All 73 tests in test_plugins.py continue to pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/security Security vulnerability or hardening

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants