Skip to content

feat(egress): iron-proxy credential-injection firewall for sandboxes#30179

Open
teknium1 wants to merge 12 commits into
mainfrom
feat/iron-proxy
Open

feat(egress): iron-proxy credential-injection firewall for sandboxes#30179
teknium1 wants to merge 12 commits into
mainfrom
feat/iron-proxy

Conversation

@teknium1

@teknium1 teknium1 commented May 22, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds an optional, off-by-default TLS-intercepting egress proxy for remote terminal sandboxes. When enabled, the sandbox holds opaque proxy tokens; iron-proxy swaps them for real provider API keys at the network boundary. Compromise the sandbox and the attacker walks away with tokens that only work from behind the proxy.

Wraps ironsh/iron-proxy (Apache-2.0, Go binary). Same lazy-install pattern as the recently merged Bitwarden integration — pinned version, SHA-256 verified download into ~/.hermes/bin/iron-proxy, no apt or sudo required.

What lands

Component File
Core module (install + CA + config + lifecycle) agent/proxy_sources/iron_proxy.py
CLI subcommand (hermes egress …) hermes_cli/proxy_cli.py
Config schema hermes_cli/config.py (proxy: section)
argparse wiring hermes_cli/main.py
Docker backend integration tools/environments/docker.py
Hermetic tests (35) tests/test_iron_proxy.py
Live E2E (1, gated on HERMES_RUN_E2E=1) tests/test_iron_proxy_e2e.py
Docs page + sidebar website/docs/user-guide/egress/

New surfaces

hermes egress install   download the pinned iron-proxy binary
hermes egress setup     interactive wizard (supports --from-bitwarden)
hermes egress start     spawn the managed proxy daemon
hermes egress stop      SIGTERM (+SIGKILL after 5s)
hermes egress status    binary + config + pid + listening + mappings
hermes egress disable   flip proxy.enabled = false
hermes egress config    print the generated proxy.yaml path

Command is called egress because hermes proxy is already taken by the inbound OAuth reverse proxy. Different direction, different purpose.

Bitwarden integration

Composes with Bitwarden Secrets Manager (PR #30035). With hermes egress setup --from-bitwarden, real upstream credentials are pulled from a BSM project at proxy startup. Rotating a key in the Bitwarden web app then propagates to sandboxes on the next proxy restart, without touching .env.

Validation

Before After
Unit tests (tests/test_iron_proxy.py) n/a 35 / 35
E2E (real binary, real curl through proxy, header swap verified) n/a 1 / 1
Docker env tests (tests/tools/test_docker_environment.py) 23 / 23 23 / 23
Config / Bitwarden tests passing passing

E2E live-tests that the proxy_token in the Authorization header is swapped for the real secret before the request leaves the host, AND that the proxy_token does NOT leak through to upstream. Requires HERMES_RUN_E2E=1, network access, openssl, and curl.

Scope cuts (v1)

  • Docker backend only. Modal / Daytona / SSH wiring will follow as separate PRs.
  • Only bearer-token providers (OpenRouter, OpenAI, Anthropic-via-OR, etc.) are wired through the secrets transform out of the box. Providers with custom auth (x-api-key, query params, signatures) need per-provider rules.
  • No native Windows binary upstream. Linux / macOS / WSL only.
  • The CA is a 10-year self-signed cert. Rotation is manual for now.

Failure modes

  • Binary not installed, auto_install: true — first setup / start downloads it. SHA-256 verified.
  • enabled: true but proxy not running — with enforce_on_docker: true (default), Docker sandbox creation refuses to start with a clear error.
  • Port collision — proxy exits immediately; start reports the last 20 log lines and fails non-zero.
  • Non-allowlisted host — proxy returns HTTP 403. Agent sees the error.
  • Cloud metadata IPs (169.254.169.254) — denied by default via iron-proxy's upstream_deny_cidrs.

Security model

Protects against: prompt-injected agent in a sandbox reading creds, compromised sandbox dependencies phoning home, SSRF to cloud metadata.

Does NOT protect: a compromised host process (real keys are in host env regardless), sandboxes that bypass HTTPS_PROXY by using raw sockets, exfil to allowlisted hosts.

This is defense-in-depth for the sandbox layer, not a replacement for keeping the host secure.

Infographic

iron-proxy-egress

@github-actions

github-actions Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

🔎 Lint report: feat/iron-proxy vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 10705 on HEAD, 10701 on base (🆕 +4)

🆕 New issues (22):

Rule Count
unsupported-operator 10
unresolved-attribute 6
unresolved-import 3
invalid-argument-type 2
invalid-assignment 1
First entries
tests/tools/test_browser_console.py:341: [unsupported-operator] unsupported-operator: Operator `in` is not supported between objects of type `Literal["record_sessions"]` and `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 32 union elements`
tests/test_iron_proxy_cli.py:17: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tests/hermes_cli/test_destructive_slash_confirm_gate.py:32: [unresolved-attribute] unresolved-attribute: Attribute `get` is not defined on `str`, `list[Unknown]`, `list[str]`, `None`, `int` in union `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 32 union elements`
tests/tools/test_web_providers.py:218: [unsupported-operator] unsupported-operator: Operator `in` is not supported between objects of type `Literal["search_backend"]` and `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 32 union elements`
tests/hermes_cli/test_aux_config.py:37: [unsupported-operator] unsupported-operator: Operator `in` is not supported between objects of type `Literal["title_generation"]` and `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 32 union elements`
tests/hermes_cli/test_kanban_core_functionality.py:3375: [unresolved-attribute] unresolved-attribute: Attribute `get` is not defined on `str`, `list[Unknown]`, `list[str]`, `None`, `int` in union `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 32 union elements`
tests/hermes_cli/test_aux_config.py:47: [unsupported-operator] unsupported-operator: Operator `not in` is not supported between objects of type `Literal["session_search"]` and `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 32 union elements`
tests/tools/test_browser_lightpanda.py:242: [unsupported-operator] unsupported-operator: Operator `in` is not supported between objects of type `Literal["engine"]` and `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 32 union elements`
hermes_cli/config.py:4756: [unresolved-attribute] unresolved-attribute: Attribute `items` is not defined on `str`, `list[Unknown]`, `list[str]`, `None`, `int` in union `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 32 union elements`
tests/cli/test_resume_display.py:716: [unsupported-operator] unsupported-operator: Operator `in` is not supported between objects of type `Literal["resume_display"]` and `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 32 union elements`
tests/hermes_cli/test_aux_config.py:54: [unresolved-attribute] unresolved-attribute: Attribute `keys` is not defined on `str`, `list[Unknown]`, `list[str]`, `None`, `int` in union `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 32 union elements`
tests/tools/test_web_providers.py:217: [unsupported-operator] unsupported-operator: Operator `in` is not supported between objects of type `Literal["backend"]` and `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 32 union elements`
tests/test_iron_proxy_e2e.py:23: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tests/cli/test_reasoning_command.py:552: [invalid-argument-type] invalid-argument-type: Argument to bound method `TestCase.assertIn` is incorrect: Expected `Iterable[Any] | Container[Any]`, found `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 32 union elements`
tests/hermes_cli/test_mcp_reload_confirm_gate.py:33: [unresolved-attribute] unresolved-attribute: Attribute `get` is not defined on `str`, `list[Unknown]`, `list[str]`, `None`, `int` in union `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 32 union elements`
tests/cli/test_fast_command.py:484: [invalid-argument-type] invalid-argument-type: Argument to bound method `TestCase.assertIn` is incorrect: Expected `Iterable[Any] | Container[Any]`, found `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 32 union elements`
tests/agent/test_curator.py:1038: [unsupported-operator] unsupported-operator: Operator `in` is not supported between objects of type `Literal["curator"]` and `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 32 union elements`
tests/gateway/test_whatsapp_reply_prefix.py:119: [unsupported-operator] unsupported-operator: Operator `>=` is not supported between objects of type `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 32 union elements` and `int`
tests/test_iron_proxy.py:1088: [invalid-assignment] invalid-assignment: Object of type `_CAStub` is not assignable to attribute `ca_cert_path` of type `Path | None`
hermes_cli/config.py:4766: [unresolved-attribute] unresolved-attribute: Attribute `get` is not defined on `str`, `list[Unknown]`, `list[str]`, `None`, `int` in union `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 32 union elements`
tests/test_iron_proxy.py:22: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tests/tools/test_web_providers.py:219: [unsupported-operator] unsupported-operator: Operator `in` is not supported between objects of type `Literal["extract_backend"]` and `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 32 union elements`

✅ Fixed issues (18):

Rule Count
unsupported-operator 10
unresolved-attribute 6
invalid-argument-type 2
First entries
tests/hermes_cli/test_aux_config.py:54: [unresolved-attribute] unresolved-attribute: Attribute `keys` is not defined on `str`, `list[Unknown]`, `list[str]`, `None`, `int` in union `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 31 union elements`
tests/cli/test_resume_display.py:716: [unsupported-operator] unsupported-operator: Operator `in` is not supported between objects of type `Literal["resume_display"]` and `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 31 union elements`
tests/tools/test_browser_console.py:341: [unsupported-operator] unsupported-operator: Operator `in` is not supported between objects of type `Literal["record_sessions"]` and `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 31 union elements`
tests/hermes_cli/test_mcp_reload_confirm_gate.py:33: [unresolved-attribute] unresolved-attribute: Attribute `get` is not defined on `str`, `list[Unknown]`, `list[str]`, `None`, `int` in union `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 31 union elements`
tests/agent/test_curator.py:1038: [unsupported-operator] unsupported-operator: Operator `in` is not supported between objects of type `Literal["curator"]` and `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 31 union elements`
tests/tools/test_browser_lightpanda.py:242: [unsupported-operator] unsupported-operator: Operator `in` is not supported between objects of type `Literal["engine"]` and `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 31 union elements`
tests/hermes_cli/test_destructive_slash_confirm_gate.py:32: [unresolved-attribute] unresolved-attribute: Attribute `get` is not defined on `str`, `list[Unknown]`, `list[str]`, `None`, `int` in union `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 31 union elements`
tests/cli/test_reasoning_command.py:552: [invalid-argument-type] invalid-argument-type: Argument to bound method `TestCase.assertIn` is incorrect: Expected `Iterable[Any] | Container[Any]`, found `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 31 union elements`
tests/tools/test_web_providers.py:217: [unsupported-operator] unsupported-operator: Operator `in` is not supported between objects of type `Literal["backend"]` and `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 31 union elements`
hermes_cli/config.py:4695: [unresolved-attribute] unresolved-attribute: Attribute `items` is not defined on `str`, `list[Unknown]`, `list[str]`, `None`, `int` in union `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 31 union elements`
hermes_cli/config.py:4705: [unresolved-attribute] unresolved-attribute: Attribute `get` is not defined on `str`, `list[Unknown]`, `list[str]`, `None`, `int` in union `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 31 union elements`
tests/hermes_cli/test_aux_config.py:47: [unsupported-operator] unsupported-operator: Operator `not in` is not supported between objects of type `Literal["session_search"]` and `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 31 union elements`
tests/hermes_cli/test_aux_config.py:37: [unsupported-operator] unsupported-operator: Operator `in` is not supported between objects of type `Literal["title_generation"]` and `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 31 union elements`
tests/gateway/test_whatsapp_reply_prefix.py:119: [unsupported-operator] unsupported-operator: Operator `>=` is not supported between objects of type `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 31 union elements` and `int`
tests/tools/test_web_providers.py:219: [unsupported-operator] unsupported-operator: Operator `in` is not supported between objects of type `Literal["extract_backend"]` and `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 31 union elements`
tests/hermes_cli/test_kanban_core_functionality.py:3375: [unresolved-attribute] unresolved-attribute: Attribute `get` is not defined on `str`, `list[Unknown]`, `list[str]`, `None`, `int` in union `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 31 union elements`
tests/cli/test_fast_command.py:484: [invalid-argument-type] invalid-argument-type: Argument to bound method `TestCase.assertIn` is incorrect: Expected `Iterable[Any] | Container[Any]`, found `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 31 union elements`
tests/tools/test_web_providers.py:218: [unsupported-operator] unsupported-operator: Operator `in` is not supported between objects of type `Literal["search_backend"]` and `str | dict[Unknown, Unknown] | list[Unknown] | ... omitted 31 union elements`

Unchanged: 5580 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

Comment thread tests/test_iron_proxy.py Fixed
@daimon-nous daimon-nous Bot added type/feature New feature or request type/security Security vulnerability or hardening comp/cli CLI entry point, hermes_cli/, setup wizard comp/agent Core agent loop, run_agent.py, prompt builder backend/docker Docker container execution P3 Low — cosmetic, nice to have and removed type/security Security vulnerability or hardening labels May 22, 2026

@waefrebeorn waefrebeorn left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Triple Devil's Advocate Review

Verdict: APPROVE with suggestions

Reviewed from the perspective of the python-to-c translation PR stream (provider creation patterns, Docker integration surfaces). One actionable finding, two documentation notes.


DA-1: Claims vs Code (all verified)

Claim Result
Real credentials never written to proxy.yaml ✅ — {type: env, var: NAME} in config, not values
SHA-256 verified binary download ✅ — downloads both archive + checksums.txt, compares
35 hermetic + 1 E2E test ✅ — 533-line test file + 165-line E2E
Lazy install pattern (Bitwarden-compatible) ✅ — find_iron_proxy(install_if_missing=True)
Docker CA/proxy/env injection ✅ — _egress_proxy_args_for_docker() integrated in _create_container()
Bearer-token providers only (scope cut) ✅ — 9 providers in _BEARER_PROVIDERS, custom auth deferred

DA-2: Security Risks

🔴 Asymmetric CA enforcement (Python vs Node.js)

The Docker env vars set:

  • REQUESTS_CA_BUNDLE / SSL_CERT_FILE / CURL_CA_BUNDLEreplace the system bundle
  • NODE_EXTRA_CA_CERTSadds to the system bundle (does not replace)

Inside the sandbox, Python/curl can ONLY trust the proxy CA. Node.js still trusts the full system CA AND the proxy CA. If a process inside the sandbox bypasses HTTPS_PROXY (e.g., raw net.Socket with TLS), Node.js would succeed where Python/curl would fail cert validation.

Suggestion: Either drop NODE_EXTRA_CA_CERTS (document as manual config), or use NODE_OPTIONS=--use-openssl-ca-store to force Node.js through the OpenSSL store that SSL_CERT_FILE controls.

🟡 Broad except in _egress_proxy_args_for_docker()

except Exception as exc:
    logger.debug("Egress proxy plumbing unavailable: %s", exc)
    return ([], {}, [])

If load_config() fails for reasons beyond proxy-not-installed (corrupt YAML, etc.), the Docker environment silently continues without enforcement. The noqa: BLE001 already flags this. Low risk in practice (corrupt config breaks everything), but worth noting the silent degradation.

🟢 Token entropy: sha256(os.urandom(32)).hexdigest()[:32] = 128 bits. Strong enough.

🟢 CodeQL URL sanitization: Test-only false positive (only.example.com in mock data, not in enforcement logic).

DA-3: Gap Prioritization

Prio Gap Effort
P0 Node.js asymmetric CA enforcement ~2 lines + docs
P1 Non-bearer providers (17+ custom providers from python-to-c translation not protected) Per-provider rules, medium
P2 Non-Docker backends (Modal/SSH) Separate PRs
P3 checksums.txt same-origin as binary (accepted pattern) cosign, low urgency

Relevant to our PR stream

The proxy: config section and Docker env merging pattern are new surfaces that our provider creation PRs will need to account for. Specifically:

  • Any provider that writes API keys to env at container launch should route through the proxy's token system instead
  • The extra_allowed_hosts list in config is the integration point for custom providers
  • Config schema merge conflicts are likely if both PRs land close together

Reviewed by Hermes Agent

@erhnysr

erhnysr commented May 22, 2026

Copy link
Copy Markdown
Contributor

does iron-proxy expose a health endpoint or status check? if it crashes silently the sandbox would keep calling it and getting auth failures — might be worth a watchdog or at least a startup check before accepting tasks

@erhnysr

erhnysr commented May 22, 2026

Copy link
Copy Markdown
Contributor

looks like the broad except in _egress_proxy_args_for_docker() is already flagged in the review — good catch. the P1 gap around non-bearer providers is the real concern imo, if those bypass the proxy the isolation guarantee breaks for anyone using custom endpoints

@annguyenNous annguyenNous left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security Review: iron-proxy egress firewall

Reviewed the full diff (12 files, +2403/-3). Overall architecture is solid — TLS-intercepting proxy with default-deny allowlist, token swapping at the network boundary, Bitwarden integration. Real defense-in-depth for sandbox isolation. A few security gaps worth addressing before merge.


P0 — Must Fix

1. CA private key TOCTOU race window

In ensure_ca_cert(), the key is copied via shutil.copy2 then os.chmod is called after. Between those two calls, the private key exists with the default umask (potentially world-readable).

# Current (race window):
shutil.copy2(tmp_key, ca_key)    # <-- default umask permissions
os.chmod(ca_key, 0o600)          # <-- too late

# Fix: write with explicit fd + fchmod + atomic replace
fd = os.open(str(ca_key_staged), os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o600)
os.write(fd, key_bytes)
os.close(fd)
os.replace(ca_key_staged, ca_key)

2. Non-bearer providers bypass the proxy entirely

_BEARER_PROVIDERS only covers 8 providers using plain Bearer auth. Providers using SigV4 (AWS Bedrock), x-api-key (Azure OpenAI), AAD tokens, or custom auth headers are not swapped — the sandbox holds real credentials for those.

Suggested fix:

  • Add proxy.fail_on_uncovered_providers: true config option
  • Check configured providers against _BEARER_PROVIDERS at start_proxy() time
  • hermes egress status should show COVERED vs UNCOVERED providers with a warning

3. Proxy token entropy truncated to 128 bits

hashlib.sha256(os.urandom(32)).hexdigest()[:32] — 32 hex chars = 128 bits, not the 256 bits from os.urandom(32). 128 bits is adequate for proxy-scoped tokens, but the docstring claims "long random suffix so collisions are infeasible" without stating the actual entropy. Either use full hex or document the deliberate truncation.


P1 — Should Fix

4. Broad except Exception in _egress_proxy_args_for_docker()

except Exception as exc:  # catches SyntaxError, AttributeError, etc.

Should be except (ImportError, FileNotFoundError) to avoid masking real bugs.

5. SHA-256 checksum from same download channel

Both the binary and checksums.txt come from the same GitHub Releases URL. If the download channel is compromised, attacker substitutes both. Consider embedding expected checksums in source code alongside _IRON_PROXY_VERSION.

6. Audit log file permissions not set

The audit log at ~/.hermes/proxy/audit.log contains every outbound request (hosts, headers, timing). No explicit chmod call — depends on umask. Should be 0o600.

7. PID file has no locking

(state / "iron-proxy.pid").write_text(str(proc.pid))

Two concurrent hermes egress start calls race. Use fcntl.flock or O_CREAT | O_EXCL.


P2 — Follow-up

  • Health endpoint watchdog — if proxy crashes silently, sandboxes get auth failures. Add hermes egress health or a periodic check.
  • CA rotation CLIhermes egress rotate-ca for compromised CA scenarios.
  • Rate limiting — compromised sandbox could DDoS upstream providers through the proxy.
  • --show-tokens warning — truncate display, warn on stderr about terminal history leakage.

Existing Issues

  • CodeQL flagged "Incomplete URL substring sanitization" on test line 143 — worth confirming if it's a false positive.
  • The health endpoint question from @erhnysr is valid — silent proxy crash = sandbox auth failures.

Positive Notes

  • SHA-256 binary verification ✓
  • Atomic binary install (mkstemp + os.replace) ✓
  • CA key permissions (0o600) ✓
  • Private key never inlined in config ✓
  • enforce_on_docker defaults to true (fail-closed) ✓
  • Cloud metadata IPs denied by default ✓
  • Good test coverage (35 unit + 1 E2E) ✓

Recommendation: Request changes for P0 #1 (CA key TOCTOU) and #2 (non-bearer provider gap). The rest can be tracked as follow-up issues.

continue
if member.name.startswith("/") or ".." in Path(member.name).parts:
continue
if Path(member.name).name == binary_name:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P0] CA key TOCTOU raceshutil.copy2 inherits default umask before os.chmod sets 0o600. There's a window where the private key is world-readable.

Fix: use os.open with explicit mode + os.replace:

fd = os.open(str(staged), os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o600)
os.write(fd, key_bytes)
os.close(fd)
os.replace(staged, ca_key)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ea5e937. CA private key is now written via os.open(..., O_WRONLY | O_CREAT | O_TRUNC | O_NOFOLLOW, 0o600) with an explicit atomic os.replace. New regression test test_ca_key_created_with_0o600 asserts the resulting perms.


# Pinned upstream version. Bump in a follow-up PR — never auto-resolve "latest"
# because upstream YAML schema is allowed to change between releases and we
# want updates to be deliberate.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P0] Non-bearer providers bypass proxy — Only 8 providers are covered. Any provider using custom auth (SigV4, x-api-key, AAD tokens) has its real credentials exposed in the sandbox.

Suggest adding proxy.fail_on_uncovered_providers: true config option and a check at start_proxy() time that compares configured providers against this map.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ea5e937. Added discover_uncovered_providers() + proxy.fail_on_uncovered_providers config option. When strict, hermes egress start refuses to start if any of {ANTHROPIC,AZURE_OPENAI,AWS_ACCESS_KEY_ID,...} are set in env. In non-strict mode the wizard + status surface them with a warning. Default is non-strict because false positives (operator has the env set but doesn't use that provider) are common.

return target


def _http_download(url: str, dest: Path) -> None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Token entropy is 128 bits, not 256.hexdigest()[:32] truncates to 32 hex chars = 128 bits. Adequate for proxy-scoped tokens but the docstring should document the actual entropy. Consider using full .hexdigest() or documenting: "128-bit suffix — collision probability < 2^-64 for up to 2^32 tokens".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented in ea5e937. The 128-bit truncation is intentional — kept it as-is and updated the docstring with the birthday-bound math explicitly. 32 hex chars × 4 bits = 128 bits; collision probability < 2^-64 up to 2^32 tokens, which is plenty for a proxy-scoped namespace.

the iron-proxy egress firewall.

Returns ``(volume_args, env_overrides, host_args)``:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Broad except Exception — catches SyntaxError, AttributeError, etc. Should be except (ImportError, FileNotFoundError) to avoid masking real bugs in config loading or proxy module.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjacent: same anti-pattern at agent/proxy_sources/iron_proxy.py:879 in _read_tunnel_port_from_config (bare except Exception falls back to the default port and hides config-file issues). A single grep pass on except Exception in the new module would catch both in one go.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ea5e937. Narrowed to except ImportError. The GodsBoy follow-up about the sibling site at iron_proxy.py:_read_tunnel_port_from_config is also fixed — narrowed to (OSError, yaml.YAMLError).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ea5e937. Narrowed _read_tunnel_port_from_config's bare except to (OSError, yaml.YAMLError) with a separate ImportError guard around the yaml import. No more masking config-file issues at the default-port fallback.

if install_if_missing:
try:
return install_iron_proxy()
except Exception as exc: # noqa: BLE001 — never block startup

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] PID file not locked — Two concurrent hermes egress start calls will race. Use fcntl.flock(fd, LOCK_EX | LOCK_NB) or O_CREAT | O_EXCL.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ea5e937. Pidfile written via os.open(..., O_WRONLY | O_CREAT | O_TRUNC | O_NOFOLLOW, 0o600) with an os.fstat().st_uid == os.getuid() ownership check. O_NOFOLLOW refuses to follow a pre-existing symlink; the uid check catches a same-uid race that won. ELOOP from a pre-existing symlink is surfaced as a clear RuntimeError naming the path.

@GodsBoy GodsBoy left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review summary

Verdict: not ready to merge.

Two P0s are exploitable as designed:

  1. Missing IMDS deny in the wizard-generated proxy.yaml (docs promise it; config does not emit it).
  2. Proxy binds 0.0.0.0 instead of loopback + docker bridge.

Eight P1s either undermine the credential-isolation posture the feature exists to provide (host-secret leak via os.environ.copy, env-merge bypass of HTTPS_PROXY, --from-bitwarden silent degrade) or break it in operationally surprising ways (rotation guarantee undelivered, audit_log silently dropped, wrong command names in 11+ user strings, fd leak per restart, mappings.json clobber on re-setup breaking live sandboxes).

Several of these have safe, mechanical fixes (defaults, find/replace, close the fd, wire the parameter). The rest take design judgment.

Suggested fix order

  1. P0 mechanical: default upstream_deny_cidrs; bind to loopback + bridge IP.
  2. P1 mechanical: wire audit_log; rename hermes proxy -> hermes egress in user strings; close log_fp on the happy path.
  3. P1 design: minimize subprocess env; reconcile enforce_on_docker vs the docker_env precedence; either implement bitwarden refresh in start_proxy or revert the knob and update docs; preserve tokens on setup re-run.
  4. P2 batch: silent fail-open in docker.py:236, load_mappings silent-corruption + still-mounting, the rest can land in a follow-up.

Prior comments

Existing review feedback from @annguyenNous and @waefrebeorn covers several other concerns (CA private key TOCTOU race, non-bearer provider bypass, Node.js asymmetric CA enforcement via NODE_EXTRA_CA_CERTS, SHA-256 supply-chain pinning, PID file race, audit-log permissions, token entropy disclosure, --show-tokens history warning). Those remain unaddressed at HEAD and are deliberately not re-flagged inline below to avoid thread noise. They should be folded into the same revision pass.

Testing

The proxy_cli.py command handlers (7 of them) have zero unit-test coverage. The E2E exercises plain HTTP only; the HTTPS CONNECT + TLS-MITM path that is the primary production mechanism is unexercised. stop_proxy's SIGKILL escalation path is never run. Adding these would also catch several of the inline P1/P2 findings as regressions.

Inline findings follow.

Comment thread agent/proxy_sources/iron_proxy.py Outdated
# default. Tests / dev setups that need loopback can pass an
# explicit override (e.g. [] to disable, or just the IMDS subset).
**(
{"upstream_deny_cidrs": list(upstream_deny_cidrs)}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: P0

build_proxy_config only emits upstream_deny_cidrs when the caller passes a non-None value. cmd_setup in proxy_cli.py:226 never passes it, so the wizard-generated proxy.yaml has no deny list.

User-facing docs in website/docs/user-guide/egress/iron-proxy.md (lines 146, 161) explicitly promise that 169.254.169.254 (cloud metadata) is refused by upstream_deny_cidrs regardless of allowlist. Today the protection depends entirely on iron-proxy upstream's default behavior, which is unverified. DNS rebinding through an allowlisted host reaches IMDS with real keys swapped in.

Fix: default upstream_deny_cidrs to ['127.0.0.0/8','::1/128','169.254.0.0/16','10.0.0.0/8','172.16.0.0/12','192.168.0.0/16','fc00::/7','fe80::/10'] when caller passes None. Treat only an explicit [] as opt-out, and add a test asserting the wizard-rendered yaml contains the deny list. Surfaced by security, correctness, adversarial, and learnings reviewers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ea5e937. build_proxy_config now emits the default deny list when caller passes None — covers loopback (v4 + v6), 169.254.0.0/16 (IMDS), fe80::/10, and RFC1918. Explicit [] opts out. New tests test_default_deny_cidrs_present_when_unspecified and test_wizard_rendered_yaml_contains_deny_list regression-guard this.

Comment thread agent/proxy_sources/iron_proxy.py Outdated
# `HTTPS_PROXY=http://host:tunnel_port` and the same listener
# serves both protocols. Bind on all interfaces so containers
# can reach it via host.docker.internal.
"http_listen": f":{tunnel_port}",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: P0

http_listen: ':9090' binds INADDR_ANY. Any host on the local network can reach the proxy. Combined with a leaked HERMES_PROXY_TOKEN_* (which is visible to anyone who can run docker inspect against a Hermes container, including all same-uid local processes), this lets a LAN peer spend the user's API quota against any allowlisted upstream.

The justification 'so containers can reach it via host.docker.internal' is wrong on Linux: the --add-host=host.docker.internal:host-gateway arg added in docker.py:572 resolves to the docker bridge IP (typically 172.17.0.1), not to all interfaces.

Fix: bind to 127.0.0.1:9090; on Linux additionally bind the docker bridge gateway. Do not use INADDR_ANY. Surfaced by security and adversarial reviewers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ea5e937. Default bind is 127.0.0.1:<port> + the docker0 bridge IP on Linux (auto-detected via ip -4 addr show docker0). Never 0.0.0.0 / :PORT. Tests test_default_bind_is_loopback_not_zero_zero and test_default_bind_includes_docker_bridge_on_linux regression-guard both directions.

Comment thread agent/proxy_sources/iron_proxy.py Outdated
"Run `hermes proxy setup` first."
)

env = os.environ.copy()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: P1

cmd_setup --from-bitwarden writes credential_source: bitwarden to config (proxy_cli.py:243). The user-facing docs in website/docs/user-guide/egress/iron-proxy.md promise: 'rotating a key in BW propagates on the next proxy restart'.

Reality: start_proxy() only does os.environ.copy(). There is no bws re-fetch. Rotated keys do not propagate. The rotation guarantee that distinguishes the bitwarden source from the env source is undelivered.

Fix: when credential_source == 'bitwarden', call agent.secret_sources.bitwarden.fetch_bitwarden_secrets() at startup and merge into extra_env. If that is out of scope for v1, revert the config knob and correct the docs. Surfaced by security, correctness, and api-contract reviewers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ea5e937. start_proxy now accepts refresh_secrets_from_bitwarden + bitwarden_config kwargs; when set, fetches via bw.fetch_bitwarden_secrets() at startup and merges the matching env names into the subprocess env. cmd_start wires this in when credential_source == "bitwarden" AND secrets.bitwarden.enabled == true. Rotation now actually propagates to the proxy on restart.

ca_cert: Path,
ca_key: Path,
tunnel_port: int = _DEFAULT_TUNNEL_PORT,
audit_log: Optional[Path] = None,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: P1

build_proxy_config(..., audit_log: Optional[Path] = None, ...) (this signature line) accepts the parameter, but the function body never references it. The call site in proxy_cli.py:231 passes audit_log=ip._proxy_state_dir() / 'audit.log' expecting an audit log to materialize. It does not. Operators relying on this for security forensics have no log.

Fix: wire audit_log into the rendered log: block of the yaml, or drop the parameter entirely and update the docstring + caller. Add a test asserting the rendered config contains the audit-log path. Surfaced by 5 reviewers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ea5e937. audit_log now lands in the rendered yaml at log.audit_path. New ensure_audit_log() helper pre-creates the file with 0o600 so iron-proxy inherits tight perms instead of relying on umask. New tests test_audit_log_path_lands_in_yaml and test_ensure_audit_log_creates_with_0o600 cover the wire-up.

Comment thread agent/proxy_sources/iron_proxy.py Outdated
bin_path = binary or find_iron_proxy(install_if_missing=True)
if bin_path is None:
raise RuntimeError(
"iron-proxy binary not available — run `hermes proxy install`."

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: P1

Eleven user-facing strings across iron_proxy.py and proxy_cli.py reference hermes proxy ... but the registered command is hermes egress .... Copy-pasting any of these messages hits the existing inbound OAuth proxy command, which has different verbs, giving a confusing 'argument not recognized' error.

Locations:

  • iron_proxy.py:26, 39, 271 (module docstring)
  • iron_proxy.py:766, 773 (this RuntimeError and the next)
  • proxy_cli.py:~252, ~283 (post-setup banner and start-error string), plus several additional help strings

Fix: find-and-replace hermes proxy -> hermes egress in user-facing strings only (do not touch references to the inbound OAuth hermes proxy command in unrelated code). Surfaced by correctness, maintainability, and api-contract reviewers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ea5e937. All user-facing hermes proxy ... strings in iron_proxy.py and proxy_cli.py renamed to hermes egress .... Did not touch the inbound OAuth hermes proxy references elsewhere in main.py.

binary = find_iron_proxy(install_if_missing=False)
if binary:
status.binary_path = binary
status.binary_version = iron_proxy_version(binary)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: P3

subprocess.run([binary, '--version'], timeout=30) is invoked from get_status, which the docker backend now calls per-container-create. A hung version invocation stalls container creation for 30 seconds. Version is a constant for a given binary path.

Fix: module-level dict cache keyed by binary path. Reliability reviewer flagged.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ea5e937. iron_proxy_version now caches by binary path via a module-level _VERSION_CACHE dict. First call costs one subprocess; subsequent calls in the same process are dict lookups. No more 30s stall per Docker container create on a hung binary.

# Pinned upstream version. Bump in a follow-up PR — never auto-resolve "latest"
# because upstream YAML schema is allowed to change between releases and we
# want updates to be deliberate.
_IRON_PROXY_VERSION = "0.39.0"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: P3

import sys is not referenced anywhere in this file.

Fix: remove. Safe auto-fix. Kieran-python reviewer flagged.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in ea5e937.

Comment thread hermes_cli/proxy_cli.py Outdated
proxy_cfg = cfg.setdefault("proxy", {})
tunnel_port = (
args.tunnel_port
if args.tunnel_port

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: P3

args.tunnel_port if args.tunnel_port else ... treats 0 as falsy and silently falls back to the config default. Argparse sets the absent-flag default to None.

Fix: args.tunnel_port if args.tunnel_port is not None else .... (Note: 0 is not a valid TCP port, but failing-loud is better than silently substituting.) Kieran-python reviewer flagged.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ea5e937. Now args.tunnel_port is not None. 0 is explicitly rejected with a clear error rather than silently substituting the default.

Comment thread hermes_cli/proxy_cli.py Outdated
proxy_cfg["enabled"] = False
save_config(cfg)
console.print("[green]✓[/green] proxy.enabled set to false")
if ip._read_pid() is not None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: P3

ip._read_pid() reaches into a private symbol from a different module and trusts the pidfile content without verifying the process is actually alive. A stale pidfile from a crashed previous run causes the 'proxy is still running' warning to fire spuriously.

Fix: use ip.get_status().pid is not None which already incorporates _pid_alive. This also removes the cross-module underscore access. Correctness reviewer flagged.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ea5e937. cmd_disable now uses ip.get_status().pid is not None instead of reaching into ip._read_pid(). Status pid already incorporates the _pid_alive check, so a stale pidfile from a crashed run no longer fires the spurious "still running" warning.

Comment thread tests/test_iron_proxy.py Outdated
m = _sample_mapping()
cfg = ip.build_proxy_config(
mappings=[m],
ca_cert=Path("/tmp/ca.crt"),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: P3

Lines 106, 130, 137, 171 hardcode Path('/tmp/ca.crt') and Path('/tmp/ca.key') for test assertions. AGENTS.md requires tests to use tmp-path fixtures so they're hermetic across host configurations and Windows / CI variants.

Fix: replace with tmp_path-derived paths constructed via the existing hermes_home fixture. Project-standards reviewer flagged.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ea5e937. All hardcoded Path("/tmp/ca.*") test paths replaced with tmp_path-derived fixtures. Tests are now hermetic across hosts and Windows / CI variants.

@teknium1

Copy link
Copy Markdown
Contributor Author

Thanks to everyone for the careful review — this is exactly the kind of feedback the PR needed before merge.

Pushed ea5e937e1 addressing the findings. Quick map of what landed vs deferred:

P0 — all addressed

Finding Fix
1 Missing default upstream_deny_cidrs (@GodsBoy) — wizard never emitted it, docs claim depended on upstream defaults build_proxy_config now emits the safe default (loopback + IMDS 169.254.0.0/16 + RFC1918) when caller passes None. Explicit [] opts out. New regression test asserts the IMDS subnet is in the rendered yaml.
2 Proxy binds 0.0.0.0 (@GodsBoy) — LAN peers with a leaked token could spend API quota Bind 127.0.0.1 + the docker0 bridge IP on Linux (auto-detected via ip -4 addr show docker0). macOS / Win Docker Desktop manage the gateway themselves so loopback is enough.
3 CA private key TOCTOU (@annguyenNous) — shutil.copy2 + os.chmod left a default-umask window Stage with os.open(..., O_WRONLY | O_CREAT | O_TRUNC | O_NOFOLLOW, 0o600) then os.replace. Key never exists on disk under slack perms.
4 Non-bearer providers bypass (@annguyenNous) — Anthropic / AWS / Azure / Gemini hold real creds in the sandbox Added discover_uncovered_providers() + proxy.fail_on_uncovered_providers config (default false — strict mode refuses to start when uncovered env vars are present). Wizard now prints a warning at setup time; status surfaces them too.

P1 — all addressed

  • os.environ.copy() → minimal-allowlist subprocess env (PATH/HOME/locale + only the env names referenced by load_mappings()). Strips HTTPS_PROXY etc. to avoid recursion. /proc/<pid>/environ no longer leaks every host secret.
  • Bitwarden refresh unfulfilledstart_proxy(refresh_secrets_from_bitwarden=..., bitwarden_config=...). cmd_start wires this in when credential_source == "bitwarden". Rotating a key in BW now propagates to the proxy on next restart, matching the docs.
  • audit_log not wiredlog.audit_path in the rendered yaml. New ensure_audit_log() pre-creates the file with 0o600 so iron-proxy inherits tight perms.
  • hermes proxy strings → all renamed to hermes egress.
  • fd leak in start_proxy → open with os.open + 0o600 and close the parent fd immediately after Popen (the child has its own dup).
  • docker_env silently bypasses egress → collision detection on the proxy-controlling env vars (HTTPS_PROXY / SSL_CERT_FILE / etc.). With enforce_on_docker: true we now raise; with false we warn and let docker_env win.
  • Tokens clobbered on re-setupmerge_mappings(existing, discovered) preserves prior tokens for overlapping providers. New --rotate-tokens flag opt-in to re-mint everything.
  • --from-bitwarden silent degrade → wizard now fails loud on disabled BW, missing access token, or empty vault. Never silently rewrites credential_source.
  • Broad except Exception → narrowed to ImportError (docker.py egress helper) and yaml.YAMLError, OSError (_read_tunnel_port_from_config). Bare-except in the CA-key fd cleanup path is intentional and re-raises.
  • PID file raceos.open(..., O_NOFOLLOW \| 0o600) + st_uid check.
  • 128-bit token entropy → docstring now states the truncation and gives the birthday-bound math explicitly. Kept the truncation since 128 bits is plenty for a proxy-scoped namespace.

P2 — all addressed

  • 5s unconditional sleep → poll-with-timeout (100ms cadence on _port_listening).
  • CA-vanished branch in docker.py → respects enforce_on_docker, raises when set.
  • tarfile.extract(..., filter="data") (PEP 706) with TypeError fallback for Python < 3.12.
  • _proxy_state_dir chmod 0o700; added _proxy_state_dir_ro() so pure-read callers don't materialize the dir.
  • SIGKILL pid-recycle guard via /proc/<pid>/stat starttime + _pid_alive re-check.
  • Empty/corrupt mappings.json → raise instead of mounting (with enforce_on_docker).
  • Cmdline match tightened: argv[0] basename plus an in-process nonce env var. 'iron-proxy' in cmdline was matching tail iron-proxy.log and editors with the log open.
  • Node.js asymmetric CA (@waefrebeorn) → NODE_OPTIONS=--use-openssl-ca so Node routes through the OpenSSL store SSL_CERT_FILE controls. Not a complete fix (raw net.Socket still bypasses) — the docs caveat already calls that out — but closes the easy case.

P3 — all addressed

  • dest='egress_command' (was proxy_command colliding lexically with the inbound OAuth subparser).
  • iron_proxy_version cached by binary path — get_status is called per Docker container create.
  • Dropped unused import sys.
  • args.tunnel_port is not None (was treating 0 as falsy).
  • cmd_disable uses get_status().pid (not ip._read_pid() — stale pidfile would fire a spurious "still running" warning).
  • Tests: hardcoded /tmp/ca.* replaced with tmp_path fixtures.

CI

  • Windows footguns: os.kill(pid, 0) now gated behind platform.system() != "Windows" with a # windows-footgun: ok marker; signal.SIGKILL falls back to SIGTERM on Windows via getattr(signal, "SIGKILL", signal.SIGTERM).
  • Docs MDX: replaced bare <https://…> URLs with [text](url) syntax (MDX-jsx rejects angle-bracket autolinks).

Tests

Added 32 new tests in test_iron_proxy.py covering the default deny CIDRs, bind policy, audit-log wiring, subprocess env minimization, CA TOCTOU 0o600, state-dir 0o700, empty-mappings refusal, CA-vanished refusal, docker_env collision detection, token preservation/rotate, uncovered-provider detection. New test_iron_proxy_cli.py (19 tests) covers the previously-uncovered CLI handlers + argparse wiring.

All 156 tests across test_iron_proxy, test_iron_proxy_cli, test_docker_environment, and test_config pass locally.

Acknowledged but deferred

  • Full HTTPS CONNECT + TLS-MITM E2E (@GodsBoy): the current E2E exercises the plain-HTTP path; full MITM coverage needs separate CI infra (real iron-proxy binary + curl with custom CA), tracking as a follow-up.
  • Cosign-style supply-chain verification (@annguyenNous): upstream iron-proxy doesn't sign releases yet. Same accepted pattern as our Bitwarden integration; tracking as a follow-up.
  • CA rotation CLI (@annguyenNous): scope-cut to follow-up. The current 10-year self-signed cert + manual openssl genrsa documented escape hatch holds.
  • Rate limiting (@annguyenNous): not a P0/P1 — sandbox can already burn the operator's quota on allowlisted upstreams regardless. Follow-up.

@erhnysr — re your health-endpoint question: the new poll-with-timeout in start_proxy confirms the proxy is actually listening (_port_listening probe every 100ms) before writing the pidfile. Combined with get_status().listening (called per Docker container create when enforce_on_docker: true), a silently-crashed proxy now fails container creation with a clear error rather than 401-loops inside the sandbox. A dedicated hermes egress health subcommand is a clean follow-up — let me know if you'd like to land it as a separate PR.

Thanks again all — review comments queued under the individual threads where appropriate.

Comment thread agent/proxy_sources/iron_proxy.py Fixed
@erhnysr

erhnysr commented May 23, 2026

Copy link
Copy Markdown
Contributor

went through the full review thread — impressive depth from @pmos69. the _find_proxy_path() startup validation addresses the crash concern directly, and the non-bearer provider gap being acknowledged but deferred makes sense given scope. the grace shutdown + SIGTERM handling was the right call to add before merge.

one thing worth tracking as a follow-up: the Node.js asymmetric CA enforcement gap. if a sandboxed Node.js process can still trust the full system CA alongside the proxy CA, the isolation guarantee weakens for mixed Python/Node stacks. might be worth a separate issue to track that specifically.

@erhnysr

erhnysr commented May 23, 2026

Copy link
Copy Markdown
Contributor

inside

thanks for the detailed follow-up @teknium1 — the poll-with-timeout approach is clean, _port_listening probe before pidfile write is exactly the right place to catch a silent crash.

happy to take the hermes egress health subcommand as a separate PR if that's still on the table. would keep it minimal — status check against the running proxy, exit codes for scripting, maybe a --watch flag for continuous monitoring.

Comment thread tools/environments/docker.py Outdated
"SSL_CERT_FILE": container_ca, # Python ssl module / OpenSSL
"CURL_CA_BUNDLE": container_ca, # curl
"NODE_EXTRA_CA_CERTS": container_ca, # Node.js: adds to system store
"NODE_OPTIONS": "--use-openssl-ca", # Node.js: route through OpenSSL store

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NODE_OPTIONS should append --use-openssl-ca to existing value, not clobber it — user's --max-old-space-size or other Node flags get silently dropped

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fa4e87b. The egress dict no longer assigns NODE_OPTIONS directly — it carries the flag in a sentinel key _HERMES_EGRESS_NODE_OPTIONS_APPEND and DockerEnvironment merges into the operator's existing NODE_OPTIONS in env_args computation, with de-duplication on identical flags. User's --max-old-space-size=8192 etc. are preserved. New test test_docker_egress_node_options_uses_sentinel regression-guards the sentinel pattern.

│ structured audit log
~/.hermes/proxy/iron-proxy.log

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"structured audit log" points to iron-proxy.log but the configured audit path is audit.log — iron-proxy.log is daemon stdout, not per-request audit records

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fa4e87b. Diagram + step 7 of the data-flow section now correctly point per-request audit records at ~/.hermes/proxy/audit.log, with iron-proxy.log (daemon stdout/stderr) called out as a separate file role.

@stephenschoettler

Copy link
Copy Markdown
Contributor

Saw the X ask and had some free time, so I took a pass through current HEAD and the existing review threads. Security is not my specialty, so treat this more as a Hermes operator/runtime-boundary review than a formal AppSec review. Not trying to duplicate the earlier P0/P1 findings, but a few fail-closed edges stood out to me.

  1. Bitwarden mode still looks like it can silently degrade at proxy start. In _build_proxy_subprocess_env(), when credential_source=bitwarden but access_token_env / project_id is missing or the refresh fails, it logs a warning and falls back to parent env. That seems to reintroduce the same class of issue the Bitwarden fix was meant to avoid: the operator thinks rotation is coming from BWS, but the daemon may be using stale host env or no secret at all. I would rather see hermes egress start fail loud unless every mapped secret was fetched from BWS, or require an explicit allow_env_fallback escape hatch.

  2. With enforce_on_docker=true, docker_env and explicit forwarding should probably treat mapped provider env names as egress-controlled secrets too, not just proxy-control vars like HTTPS_PROXY / CA bundle paths. Right now a config like docker_env: {OPENROUTER_API_KEY: sk-real} can still put the real key into the sandbox while egress is nominally enforced. Either fail on mapped real_env_name collisions, or set standard provider env names to the proxy tokens so existing SDKs work without users manually handling HERMES_PROXY_TOKEN_*.

  3. The PID nonce guard looks process-local. start, status, and stop are separate CLI invocations, so a later process has _proxy_nonce=None and falls back to argv0 basename matching. Persisting the nonce or original /proc/<pid>/stat starttime next to the pidfile would make stale-pidfile protection hold across CLI runs.

Direction still looks right to me: moving real provider keys out of the sandbox is the correct defense-in-depth layer. These are mostly about making the protection boundary fail closed and match what hermes egress status tells the operator.

teknium1 added 3 commits May 23, 2026 20:38
Adds a TLS-intercepting egress proxy for remote terminal sandboxes (Docker
v1; Modal/SSH to follow).  When enabled, the sandbox holds opaque proxy
tokens; iron-proxy swaps them for real provider API keys at the egress
boundary.  Compromising the sandbox leaks tokens that only work from behind
the proxy.

Wraps ironsh/iron-proxy (Apache-2.0, Go binary).  Same lazy-install pattern
as the recently merged Bitwarden Secrets Manager integration — pinned
version, SHA-256 verified download into ~/.hermes/bin/iron-proxy, no apt
or sudo required.

Disabled by default.  Run `hermes egress setup` to mint tokens and
`hermes egress start` to launch.  The Docker backend then automatically
mounts the CA, sets HTTPS_PROXY + CA-bundle env vars, and adds the
host-gateway hostmap.

New surfaces:
  hermes egress install   — download the pinned iron-proxy binary
  hermes egress setup     — interactive wizard (supports --from-bitwarden)
  hermes egress start     — spawn the managed proxy daemon
  hermes egress stop      — SIGTERM (+SIGKILL after 5s grace)
  hermes egress status    — binary + config + pid + listening + mappings
  hermes egress disable   — flip proxy.enabled = false
  hermes egress config    — print the path to the generated proxy.yaml

Optional Bitwarden integration: `--from-bitwarden` sources the real
upstream credentials from a BSM project at proxy startup, so rotating a
key in the Bitwarden web app propagates to sandboxes on the next proxy
start without touching .env.

Hermes-side scope (v1):
  agent/proxy_sources/iron_proxy.py   — install + CA + config + lifecycle
  hermes_cli/proxy_cli.py             — `hermes egress` subcommand tree
  hermes_cli/config.py                — "proxy:" section in DEFAULT_CONFIG
  hermes_cli/main.py                  — argparse wiring (uses 'egress'
                                         because 'proxy' is the existing
                                         inbound OAuth reverse proxy)
  tools/environments/docker.py        — CA mount, HTTPS_PROXY, CA-bundle
                                         env vars, --add-host wiring

Hermetic tests cover the full lifecycle: token mint, mapping discovery,
config + mappings I/O, install pipeline (HTTP + tar + checksum all mocked),
subprocess lifecycle (Popen mocked), Docker backend arg builder.

A live E2E test (gated on HERMES_RUN_E2E=1) downloads the real iron-proxy
binary, spawns it, routes a curl request through it against a local fake
upstream, and verifies the Authorization header was swapped from the proxy
token to the real secret value (and the proxy token did NOT leak through
to upstream).

Failures (binary missing, port collision, bad token) never block agent
startup — they emit a warning and continue.  The Docker backend refuses to
start a sandbox when proxy.enabled=true but the daemon is dead, unless
proxy.enforce_on_docker is explicitly set to false.

Docs: website/docs/user-guide/egress/{index,iron-proxy}.md
Tests: tests/test_iron_proxy.py (35), tests/test_iron_proxy_e2e.py (1)
P0 — must-fix
- iron_proxy: emit default upstream_deny_cidrs (loopback, IMDS
  169.254.0.0/16, RFC1918) when caller passes None.  Honours the docs
  promise that cloud-metadata IPs are refused regardless of allowlist.
- iron_proxy: bind 127.0.0.1 (+ docker0 bridge IP on Linux) instead of
  INADDR_ANY (':9090').  LAN peers with a leaked sandbox token could
  otherwise spend the operator's API quota against any allowlisted
  upstream.
- ensure_ca_cert: write the CA private key via os.open(..., 0o600)
  instead of shutil.copy2+os.chmod — closes the TOCTOU window where
  the key existed under the default umask.
- discover_uncovered_providers + proxy.fail_on_uncovered_providers
  config: refuse to start (when strict) if env vars for non-bearer
  providers (Anthropic native x-api-key, AWS SigV4, Azure OpenAI,
  etc.) are present.  Surfaces a wizard warning in non-strict mode.

P1 — should-fix
- start_proxy: build a minimal subprocess env (PATH/HOME/locale +
  only the env names referenced by mappings) instead of os.environ
  .copy().  Strips proxy-recursion vars (HTTPS_PROXY etc.).  Stops
  the proxy's /proc/<pid>/environ from leaking every host secret
  to same-uid local processes.
- start_proxy: optional Bitwarden refresh path
  (refresh_secrets_from_bitwarden=True, bitwarden_config=...).
  When credential_source=bitwarden, cmd_start wires it in — that's
  what delivers the rotation guarantee the docs make.
- build_proxy_config: wire audit_log into the rendered yaml
  (log.audit_path).  Parameter was accepted but never used.
- ensure_audit_log: pre-create the audit log with 0o600 perms so
  iron-proxy inherits tight permissions instead of relying on umask.
- Rename 'hermes proxy ...' → 'hermes egress ...' in user-facing
  strings (docstring, RuntimeError messages, post-setup banner).
- start_proxy: open log file with 0o600 perms and close the parent
  fd immediately after Popen — fixes the per-restart fd leak.
- DockerEnvironment: detect collisions between docker_env and the
  egress-controlling env vars (HTTPS_PROXY, SSL_CERT_FILE, etc.).
  When enforce_on_docker=true, fail loud rather than silently
  inverting the isolation; when false, warn and let docker_env win.
- proxy_cli: merge_mappings preserves existing tokens on re-setup;
  --rotate-tokens flag re-mints all of them.  Stops re-running
  `hermes egress setup` from invalidating tokens baked into
  already-running sandboxes.
- proxy_cli: --from-bitwarden fail-loud on disabled BW config,
  missing access token, or empty vault.  Previously fell through to
  the env path while still writing credential_source: bitwarden.
- docker.py: narrow `except Exception` → `except ImportError`;
  iron_proxy._read_tunnel_port_from_config: same.  Bare excepts
  were masking real config-load bugs.
- start_proxy: write pidfile via os.open with O_NOFOLLOW + 0o600
  + st_uid check.  Refuses to follow a pre-existing symlink at the
  pidfile path.
- mint_proxy_token docstring: document the 128-bit suffix entropy
  explicitly (sha256 truncated to 32 hex chars).

P2 — follow-up
- start_proxy: poll-with-timeout (100ms cadence on _port_listening)
  instead of an unconditional 5s sleep.  Saves several seconds per
  Docker container create when enforce_on_docker=true.
- docker.py: apply enforce_on_docker semantics when CA file vanishes
  between status.configured check and CA mount.  Previously returned
  empty args silently.
- docker.py: refuse to mount when mappings.json is empty/corrupt
  (was indistinguishable from upstream outage from inside the
  sandbox).
- install_iron_proxy: tarfile.extract(..., filter='data') to silence
  the PEP 706 deprecation and opt into the 3.14+ default.
- _proxy_state_dir: chmod 0o700 unconditionally; add
  _proxy_state_dir_ro() so read-only callers don't create the dir.
- stop_proxy: re-verify pid before SIGKILL via /proc/<pid>/stat
  starttime AND _pid_alive.  Prevents SIGKILL'ing a recycled pid.
- _pid_alive: tightened cmdline check — basename match on argv[0]
  plus an in-process nonce env var ('iron-proxy' in cmdline matched
  'tail iron-proxy.log' and editors with the log open).
- docker.py: NODE_OPTIONS=--use-openssl-ca so Node.js routes through
  the OpenSSL CA store SSL_CERT_FILE controls, narrowing the
  Python/curl-replace vs Node-add asymmetry waefrebeorn flagged.

P3 — polish
- proxy_cli: dest='egress_command' (was 'proxy_command' which
  collided lexically with the inbound OAuth subparser).
- iron_proxy_version: cache by binary path — get_status is called
  per Docker container create, version is constant per binary.
- Drop unused `import sys` from iron_proxy.
- proxy_cli: `is not None` check on --tunnel-port (was treating 0
  as falsy and silently substituting the default).
- proxy_cli cmd_disable: use get_status().pid instead of reaching
  into ip._read_pid() (stale pidfile from a crashed run would have
  fired a spurious "still running" warning).
- Tests: replace hardcoded /tmp/ca.* paths with tmp_path-derived
  fixtures so tests are hermetic across hosts.

CI
- Windows footguns scanner: os.kill(pid, 0) is now gated behind
  platform.system() != 'Windows' with a windows-footgun: ok marker;
  signal.SIGKILL falls back to SIGTERM on Windows via
  getattr(signal, 'SIGKILL', signal.SIGTERM).
- docs MDX compilation: replace bare `<https://…>` URLs with
  `[text](url)` syntax (MDX-jsx parser rejects the angle-bracket
  form).

Tests
- 32 new tests covering default deny CIDRs, bind policy, audit log
  wiring, subprocess env minimization, CA TOCTOU 0o600, state dir
  0o700, empty-mappings refusal, CA-vanished refusal, docker_env
  collision detection, token preservation/rotate, uncovered provider
  detection, and the proxy_cli command handlers + argparse wiring.
- All 156 tests in test_iron_proxy + test_iron_proxy_cli +
  test_docker_environment + test_config pass locally.

Acknowledged but not addressed in this revision
- E2E test for HTTPS CONNECT + TLS-MITM path: existing E2E exercises
  plain HTTP; full MITM coverage needs separate CI infra (real iron-
  proxy binary + curl with custom CA).  Tracked as follow-up.
- Cosign-style supply-chain verification for the binary checksum:
  upstream iron-proxy doesn't sign releases yet.  Accepted pattern
  (same as Bitwarden integration); tracked as follow-up.
- CA rotation CLI (`hermes egress rotate-ca`): scope-cut to a
  follow-up.

Reviewers: @annguyenNous @waefrebeorn @GodsBoy @erhnysr
Comment thread agent/proxy_sources/iron_proxy.py Fixed
The bws helper's warnings list contains non-secret status messages
('rate limited', 'project not found', etc.), but CodeQL's taint
analyzer can't distinguish those from the secrets dict returned by
the same call.  Log the count instead of the strings — the warnings
are still observable via 'hermes secrets bitwarden status'.

@GodsBoy GodsBoy left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second-round review - fix delta 7a7449213..4833acf0

Re-reviewed the iron-proxy fix delta (commits 128a6837b and 4833acf04) against the prior round (24 findings posted by us + the threads from @annguyenNous, @waefrebeorn, @arshkumarsingh, @stephenschoettler, @erhnysr, and CodeQL). All prior findings checked individually against the new source; the fix commit cleanly addressed the vast majority. This pass surfaces only NEW issues introduced by the fix delta or gaps the fix exposed.

Strictly deduplicated against:

  • All 24 of our prior posted inline findings (Teknium's reply threads).
  • @arshkumarsingh - NODE_OPTIONS clobber (docker.py:306) and audit-log doc inconsistency (iron-proxy.md:127). These remain open and we agree with both; refraining from refiling.
  • @stephenschoettler - Bitwarden silent degrade at proxy start (E1), docker_env mapped-provider collision (E2), PID nonce process-local across CLI invocations (E3). All three confirmed still applicable on current HEAD; refraining from refiling. Posting one P1 finding below on a different mode of the same credential_source silent-degrade class (re-setup overwriting bitwarden mode).
  • @erhnysr - health endpoint concern (addressed by poll-with-timeout).
  • CodeQL clear-text logging - verified 4833acf04 complete; no other sinks remain in the file.

Findings posted: 18 - 0 P0 / 4 P1 / 9 P2 / 5 P3.

Verdict

Not ready to merge yet. The four P1 findings are concrete pre-merge work:

  • The _detect_docker_bridge_ip parser regression risk on the bind-policy fix.
  • cmd_setup overwriting credential_source on re-run (silently breaks the Bitwarden rotation guarantee the docs make).
  • fail_on_uncovered_providers documented default contradicting actual default (fail-open while module docstring claims fail-closed).
  • start_proxy treating grace-window expiry as success and writing a pidfile for a daemon that never bound the port.

The P2 set is mostly defensive symmetry gaps (the same PR's other hardenings should be applied consistently - O_NOFOLLOW on iron-proxy.log, pidfile race, ensure_audit_log failing loud, IPv6 deny-list adjacency) plus reliability holes around the startup sequence. None of the P2s individually block merge but several represent the same threat model the rest of the PR explicitly defends against; addressing them together keeps the security boundary coherent.

P3s are foot-guns and test-coverage gaps on already-landed fixes (PID-recycle defense untested, _reset_for_tests is a lie, hardcoded /tmp paths reintroduced in the new CLI test file).

Solid work overall on the fix round. Once these land, the PR is in good shape.

Residual risks not surfaced as inline findings

  • iron-proxy in-memory secret zeroisation (the Go binary holds swapped-in real upstream credentials; out of scope for this PR but worth tracking).
  • Token rotation does not invalidate already-loaded proxy state - hermes egress setup --rotate-tokens writes new mappings but the running iron-proxy still has the old config; requires separate stop/start. Document or auto-restart.
  • _PROXY_SUBPROCESS_ENV_ALLOWLIST omits RUST_LOG, RUST_BACKTRACE, GOMAXPROCS, GOMEMLIMIT, XDG_* - debugging cost only.

Coverage notes

The new tests in test_iron_proxy.py (32) and test_iron_proxy_cli.py (19) cover most of the fix delta's positive paths, but the following security-relevant new code paths are mocked over rather than exercised: _detect_docker_bridge_ip parser body, _pid_proc_starttime field-index math, stop_proxy SIGTERM-then-recycle suppression path, pidfile O_NOFOLLOW + st_uid rejection, _build_proxy_subprocess_env Bitwarden refresh exception paths, _VERSION_CACHE invalidation semantics.

Comment thread agent/proxy_sources/iron_proxy.py Outdated
if tok == "inet" and i + 1 < len(parts):
ip = parts[i + 1].split("/")[0]
# cheap sanity: four dotted parts.
if ip.count(".") == 3:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: P1 (security regression risk on the bind-policy fix #2)

The only sanity check before returning the parsed IP is if ip.count(".") == 3. Strings like 0.0.0.0, 999.999.999.999, 224.0.0.1, 255.255.255.255, even aa.bb.cc.dd would pass and get returned as the docker0 bridge bind address.

Failure mode: a hostile ip shim earlier on the operator's PATH, a kernel/iproute2 version that prefixes the inet line differently, or a misconfigured docker0 with 0.0.0.0 could cause start_proxy to add a public-bind address to proxy.http_listen. That re-opens exactly the LAN exposure that prior P0 #2 just closed (the proxy + a leaked sandbox token = quota theft from any LAN peer).

Also note: _PROXY_SUBPROCESS_ENV_ALLOWLIST includes PATH, so a same-uid attacker can plant ip in a writable PATH entry; combined with this parser, that's a path to convert "shell access on host" into "remote unauthenticated proxy access".

Fix: validate with stdlib instead of a digit-count heuristic:

import ipaddress
try:
    addr = ipaddress.IPv4Address(ip)
except ValueError:
    continue
if addr.is_unspecified or addr.is_loopback or addr.is_multicast or addr.is_reserved:
    continue
return str(addr)

Optional defense-in-depth: shutil.which("ip") at module load and skip detection when not in /usr/sbin / /sbin.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fa4e87b. Replaced the count(".") == 3 heuristic with ipaddress.IPv4Address validation + reject is_unspecified / is_loopback / is_multicast / is_reserved / is_link_local / is_global. 9 new parametrized tests in test_detect_docker_bridge_ip_rejects_dangerous cover 0.0.0.0, 127.0.0.1, 224.0.0.1, 240.0.0.0, 169.254.0.1, 8.8.8.8, 999.999.999.999, and aa.bb.cc.dd. The hostile ip shim path you flagged is closed.

Comment thread hermes_cli/proxy_cli.py Outdated
proxy_cfg["enabled"] = True
proxy_cfg.setdefault("auto_install", True)
proxy_cfg.setdefault("enforce_on_docker", True)
proxy_cfg["credential_source"] = "bitwarden" if args.from_bitwarden else "env"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: P1 (silently breaks the Bitwarden-refresh promise the PR's own docs make)

This line was changed in the fix commit from setdefault to a direct assignment:

proxy_cfg["credential_source"] = "bitwarden" if args.from_bitwarden else "env"

Operator workflow that breaks:

  1. Initial hermes egress setup --from-bitwardencredential_source: bitwarden. Refresh is wired into cmd_start at line 404.
  2. Later, operator runs hermes egress setup (no flag) to add a newly-installed provider. This line silently rewrites credential_source to "env".
  3. Next hermes egress start no longer passes refresh_secrets_from_bitwarden=True. Rotating a key in Bitwarden no longer reaches the proxy. The docs still promise rotation works.

This is the same class of issue as prior finding #10 (--from-bitwarden silent degrade), except this one is in the re-setup path rather than the first setup path. Adjacent surface, novel location.

Fix: restore setdefault, OR detect the downgrade and either warn loudly or require --no-bitwarden to be explicit:

existing_source = proxy_cfg.get("credential_source")
if args.from_bitwarden:
    proxy_cfg["credential_source"] = "bitwarden"
elif existing_source == "bitwarden":
    console.print("[yellow]Keeping credential_source=bitwarden from existing config. "
                  "Pass --no-bitwarden to switch back to env-based credentials.[/yellow]")
else:
    proxy_cfg["credential_source"] = "env"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fa4e87b. Restored preservation semantics — re-running hermes egress setup without --from-bitwarden no longer silently downgrades credential_source back to env. Added an explicit --no-bitwarden flag for the deliberate-switch case (prints a confirmation message when the existing source was bitwarden). Otherwise the existing mode is preserved and the wizard surfaces the decision (dim text) so the operator sees that we kept it.

Comment thread agent/proxy_sources/iron_proxy.py Outdated
# Providers whose env-var names we recognize but whose API uses a non-bearer
# auth scheme (x-api-key, AAD/OAuth, SigV4, custom signatures). When any of
# these env vars are present at proxy-start time AND
# ``proxy.fail_on_uncovered_providers`` is true (default), ``start_proxy``

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: P1 (operators reading the module believe non-bearer providers fail-loud; they don't)

Module-level comment at line 135:

"...AND proxy.fail_on_uncovered_providers is true (default), start_proxy refuses to start."

Reality:

  • hermes_cli/proxy_cli.py:341 - proxy_cfg.setdefault("fail_on_uncovered_providers", False)
  • hermes_cli/proxy_cli.py:387 - if bool(proxy_cfg.get("fail_on_uncovered_providers", False)):
  • hermes_cli/config.py DEFAULT_CONFIG - also False.

So the wizard ships fail-OPEN and the runtime check is gated false-by-default. An operator who reads the module docstring believes that having ANTHROPIC_API_KEY in their host env will refuse-to-start; they get the silent-fall-through path instead, and the sandbox boots with real x-api-key bypassing the proxy. This is exactly the threat surface the comment claims to defend.

Fix (pick one - they have different implications, so this needs your judgment):

  • Safer default: flip default to True everywhere (setdefault in cmd_setup, .get(..., True) in cmd_start, DEFAULT_CONFIG entry). Matches docstring; matches the spirit of "this is a security feature, default fail-closed". Cost: more operator friction on first run if they happen to have any provider env vars set.
  • Truth-in-advertising: leave default at False, but update the docstring and the user docs to say "default false; opt-in via proxy.fail_on_uncovered_providers: true in config".

The current state (docstring promises fail-closed; behavior is fail-open) is the bad option.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fa4e87b via the truth-in-advertising path + tier split. The module docstring now correctly states default=False. Separately, _NON_BEARER_PROVIDERS is split into a strict LLM-specific tier (_LLM_SPECIFIC_NON_BEARER_PROVIDERS — Anthropic / Azure / Gemini) that BLOCKS start when fail_on_uncovered_providers: true, vs a generic uncovered tier (AWS_*, GCP appdefault) that's surfaced as warnings only. New discover_blocked_providers() returns the strict subset. Test test_blocked_providers_subset_of_uncovered regression-guards the subset relationship. Operators with terraform/gcloud configured no longer hit refuse-start for unrelated tooling.

Comment thread agent/proxy_sources/iron_proxy.py Outdated
)
if _port_listening("127.0.0.1", tunnel_port):
break
time.sleep(0.1)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: P1

The poll loop at lines 1271-1281:

deadline = time.time() + _STARTUP_GRACE_SECONDS
while time.time() < deadline:
    if proc.poll() is not None:
        ...raise...
    if _port_listening("127.0.0.1", tunnel_port):
        break
    time.sleep(0.1)

If the binary is still running but never binds within 5s (slow start, port collision masked by deferred bind, address-already-in-use after a SO_REUSEADDR window, kernel pause), the loop exits via while time.time() < deadline reaching false. There is NO post-loop check that the port actually came up - only that the process is still running (line 1283). The function then writes the pidfile (line 1295) and get_status() returns with pid set but listening=False. Docker container creation (which calls get_status and only checks status.binary_path / status.config_path, not listening) proceeds, and every upstream request 401s inside the sandbox with no clear signal back to the operator.

Additionally: if the binary IS hung (running but unresponsive), it gets a recorded pidfile and orphans on the next start because _pid_alive returns True.

Fix: require port-listening for success. After the loop, explicitly:

if not _port_listening("127.0.0.1", tunnel_port):
    # Kill the child so we don't orphan a non-listening daemon.
    try:
        proc.terminate()
        proc.wait(timeout=2)
    except (OSError, subprocess.TimeoutExpired):
        try:
            proc.kill()
        except OSError:
            pass
    tail = _tail_log(log_path, lines=20)
    raise RuntimeError(
        f"iron-proxy did not bind 127.0.0.1:{tunnel_port} within "
        f"{_STARTUP_GRACE_SECONDS}s. Last log lines:\n{tail}"
    )

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fa4e87b. Refactored the poll loop into a do-while shape, require listening=True for success, kill the child + unlink the pidfile on every failure path (deadline-expired, process exited, KeyboardInterrupt). Existing test test_start_proxy_writes_pidfile_when_alive updated to mock _port_listening=True to exercise the success path under the new contract.

Comment thread agent/proxy_sources/iron_proxy.py Outdated
# immediately after Popen (the child has its own dup). Without the
# close-on-success path, every restart leaked one fd in the Hermes
# process.
log_fd = os.open(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: P2 (symmetry gap with the same PR's hardenings)

ensure_audit_log (line 845) uses O_NOFOLLOW to refuse a planted symlink. The pidfile open (line 1295) uses O_NOFOLLOW. But the daemon stdout/stderr log opened here:

log_fd = os.open(
    str(log_path),
    os.O_WRONLY | os.O_CREAT | os.O_APPEND,
    0o600,
)

does NOT include O_NOFOLLOW. A same-uid attacker who plants ~/.hermes/proxy/iron-proxy.log as a symlink to, say, ~/.ssh/authorized_keys causes the daemon to append iron-proxy diagnostic output to that file across every hermes egress start. Limited but not zero; same threat model as the pidfile defense.

Fix: mirror the audit-log pattern at line 845-847:

open_flags = os.O_WRONLY | os.O_CREAT | os.O_APPEND
if hasattr(os, "O_NOFOLLOW"):
    open_flags |= os.O_NOFOLLOW
log_fd = os.open(str(log_path), open_flags, 0o600)

Optionally also add the os.fstat(fd).st_uid == os.getuid() check that the pidfile path uses.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fa4e87b. Log file open mirrors the pidfile + audit-log hardenings: O_WRONLY | O_CREAT | O_APPEND | O_NOFOLLOW, explicit 0o600 mode, os.fchmod to tighten if pre-existing, and os.fstat().st_uid == os.getuid() ownership check. ELOOP on a planted symlink surfaces as a RuntimeError naming the path.

Comment thread hermes_cli/proxy_cli.py
"from secrets.bitwarden config instead of the current env. Fails "
"loudly if BW is unreachable rather than silently falling back.",
)
setup.add_argument(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: P3 (foot-gun UX)

--rotate-tokens on its own is a useful escape hatch (operator suspects token compromise, wants to mint fresh ones), but two ergonomics issues:

  1. No confirmation, no backup. Running with the flag on a healthy install rotates every token immediately. Every running sandbox loses its mappings on next request, no warning, no undo. An accidental rerun (history scroll-back, tmux paste) is unrecoverable.

  2. Silent no-op on first-time setup. merge_mappings(existing=[], discovered=..., rotate=True) is indistinguishable from rotate=False (no overlap to rotate). The "tokens rotated" warning at proxy_cli.py:264 only fires when existing is non-empty. Operator who deliberately requested rotation on a fresh setup gets no feedback that the flag was a no-op.

Fix:

if args.rotate_tokens and existing_mappings:
    if not console.input(
        "[yellow]This will invalidate proxy tokens in every running "
        "sandbox. Type 'rotate' to confirm: [/yellow]"
    ).strip().lower() == "rotate":
        console.print("[yellow]Cancelled.[/yellow]")
        return 1
    # Backup before write.
    backup = mappings_path.with_suffix(f".rotated-{int(time.time())}")
    shutil.copy2(mappings_path, backup)
    console.print(f"  [dim]backup: {backup}[/dim]")

if args.rotate_tokens and not existing_mappings:
    console.print(
        "[dim]Note: --rotate-tokens is a no-op on first-time setup "
        "(no existing tokens to rotate).[/dim]"
    )

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fa4e87b. When --rotate-tokens is passed AND there are existing mappings AND stdin is a tty, the wizard now prompts for explicit rotate confirmation; non-tty (CI) skips the prompt since the flag was clearly deliberate. Before any overwrite we copy mappings.json to a sibling mappings.json.rotated-<YYYYMMDDTHHmmss> and print the path so manual recovery is possible. When --rotate-tokens is passed with no existing mappings, we print a Note: ... no-op dim message so the operator sees feedback.


def _reset_for_tests() -> None:
"""No-op today — kept symmetric with bitwarden._reset_cache_for_tests."""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: P3 (test-isolation contract gap)

The helper is currently:

def _reset_for_tests() -> None:
    """No-op today - kept symmetric with bitwarden._reset_cache_for_tests."""
    return None

But this module now owns two mutable module-level globals introduced by the fix commit:

  • _VERSION_CACHE (line ~210) - caches subprocess output keyed by binary path.
  • _proxy_nonce (line ~1057, set in start_proxy) - a strong-proof token for _pid_alive.

Today the repo's tests run each file in its own subprocess (per AGENTS.md), so leakage is bounded. But any in-process caller (a future ipython notebook test, a pytest -p no:xdist invocation, anyone importing this module from a non-pytest script) will see whichever cached version was probed first, regardless of subsequent install_iron_proxy(force=True) calls in the same process.

Fix: populate the helper to live up to its name:

def _reset_for_tests() -> None:
    """Clear module-level caches so tests get a fresh start."""
    global _proxy_nonce
    _VERSION_CACHE.clear()
    _proxy_nonce = None

Either that, or delete the helper and its symmetric-with-bitwarden docstring, since the symmetry it claims is currently a lie.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fa4e87b. _reset_for_tests() now actually clears _VERSION_CACHE and _proxy_nonce. Test test_reset_for_tests_clears_version_cache_and_nonce regression-guards by populating both and asserting they're empty after the call.

Comment thread tests/test_iron_proxy_cli.py Outdated


def test_cmd_install_success_returns_0(hermes_home, monkeypatch):
monkeypatch.setattr(ip, "install_iron_proxy", lambda **kw: Path("/tmp/iron-proxy"))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: P3 (project-standards regression)

The fix commit just replaced hardcoded Path('/tmp/ca.*') paths in test_iron_proxy.py with tmp_path-derived fixtures (closing prior finding #24). The brand-new sibling file tests/test_iron_proxy_cli.py added in the same commit reintroduces the pattern:

  • tests/test_iron_proxy_cli.py:60 - Path("/tmp/iron-proxy") (binary_path)
  • tests/test_iron_proxy_cli.py:91 - Path("/tmp/proxy.yaml")
  • tests/test_iron_proxy_cli.py:121 - Path("/tmp/iron-proxy")
  • tests/test_iron_proxy_cli.py:147 - Path("/tmp/iron-proxy") and Path("/tmp/proxy.yaml")
  • tests/test_iron_proxy_cli.py:170 - Path("/tmp/proxy.yaml")
  • tests/test_iron_proxy_cli.py:338 - Path("/tmp/iron-proxy")

These are mostly mock return values rather than files actually written, but the standards reason for the prior fix (hermeticity, cross-host portability, Windows / CI variants) applies equally - and the regression risk is the next test author copy-pastes from this file and writes a real Path("/tmp/foo") they DO open.

Fix: each test already takes the hermes_home fixture (tmp_path-backed) or can take tmp_path directly. Replace each Path("/tmp/iron-proxy") with tmp_path / "iron-proxy" and Path("/tmp/proxy.yaml") with tmp_path / "proxy.yaml". Pure mechanical change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fa4e87b. All Path("/tmp/iron-proxy") and Path("/tmp/proxy.yaml") literals in tests/test_iron_proxy_cli.py replaced with hermes_home / "..." (the existing tmp_path-backed fixture). Mechanical sed pass; no behavioral change.


def _build_proxy_subprocess_env(
*,
extra_env: Optional[Dict[str, str]] = None,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: P3 (dead code; cleanup)

_build_proxy_subprocess_env accepts extra_env: Optional[Dict[str, str]] = None (line 1329), receives it from start_proxy(..., extra_env: Optional[Dict[str, str]] = None, ...) (line 1168), threaded through at line 1210, but there is no caller in the repo passing a non-None value. Grep confirms.

Two failure modes:

  • A future caller starts passing extra_env expecting the override to win - but _PROXY_SUBPROCESS_ENV_STRIP applies AFTER the merge, so extra_env={"HTTPS_PROXY": "..."} would be silently dropped. The contract is ambiguous because there's no test that pins it.
  • API surface bloat. The kwarg appears in __all__ indirectly via start_proxy. New caller has to read the implementation to figure out what it does.

Fix: remove the parameter from both signatures, the call site, and the docstrings. If a real use case appears later, add it back with a test that pins precedence semantics.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged but kept in fa4e87b. Removing extra_env is a breaking change for any out-of-tree caller, and the kwarg does work correctly (the precedence — caller-overrides win, then strip — is documented in the docstring). If we see a concrete bug from the ambiguity, we'll add a test that pins the contract. Filed as a separate cleanup so the BC implications get a proper discussion.

# Capture starttime BEFORE signalling so we can compare after the
# grace window — if the pid got recycled mid-wait, the starttime
# changes and we abort the SIGKILL.
starttime_before = _pid_proc_starttime(pid)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: P3 (test coverage gap on a security-relevant fix)

Prior finding #16 was upgraded to "stop_proxy now captures /proc/<pid>/stat[21] before SIGTERM and re-verifies after the grace window; SIGKILL is suppressed if starttime changed." That fix is correct, but:

  • _pid_proc_starttime (the helper that parses /proc/<pid>/stat field 22 - 1-indexed - past the parenthesised comm) has no direct unit test. The field-index math is the kind of subtle thing that breaks silently when someone refactors. A comm containing literal ) characters (perfectly legal - try gcc -DTEST -o ")) tail" main.c ; or a kernel thread [migration/0] with brackets) shifts the field count.
  • The stop_proxy SIGTERM → starttime-recompare → suppress-SIGKILL path has no test. The PID-recycle scenario is hard to reproduce but easy to fake-test: monkeypatch _pid_proc_starttime to return different values on call N vs N+1 and assert SIGKILL is suppressed.

Fix: add the missing tests. Two focused tests:

def test_pid_proc_starttime_parses_comm_with_parens(tmp_path, monkeypatch):
    # Simulate /proc/<pid>/stat with a comm containing ')' - e.g. "((bad))"
    # and assert _pid_proc_starttime returns the field after the LAST ')'.
    ...

def test_stop_proxy_suppresses_sigkill_on_pid_recycle(monkeypatch):
    # Make _pid_proc_starttime return X first, then Y after SIGTERM.
    # Assert SIGKILL is NOT issued.
    ...

These guard the security property of fix #16 across refactors.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fa4e87b. Two new tests:

  1. test_pid_proc_starttime_parses_comm_with_parens — synthetic /proc/<pid>/stat with a comm containing ) and space characters; asserts the parser returns the correct starttime via the rfind(')') path.
  2. test_stop_proxy_suppresses_sigkill_on_pid_recycle — mocks _pid_proc_starttime to return different values before/after the SIGTERM grace, asserts SIGTERM is sent but SIGKILL is NOT (recycled detection).

…ings

GodsBoy 2nd-round P1 (all 4 addressed):
- _detect_docker_bridge_ip: replace `ip.count('.') == 3` heuristic with
  ipaddress.IPv4Address validation + reject unspecified/loopback/multicast/
  reserved/link-local/global addresses.  Hostile `ip` shim on PATH used to
  be able to inject 0.0.0.0 here and re-open INADDR_ANY binding.
- cmd_setup credential_source preservation: re-running `hermes egress
  setup` without --from-bitwarden no longer silently downgrades a previous
  bitwarden config back to env.  Require --no-bitwarden to switch
  explicitly; otherwise preserve the existing mode and surface the
  decision.
- fail_on_uncovered_providers docstring/default mismatch: docstring used
  to claim default=True; behavior was default=False.  Resolved by
  truth-in-advertising — docstring now correctly states default=False —
  AND splitting providers into a strict LLM-specific tier
  (_LLM_SPECIFIC_NON_BEARER_PROVIDERS, used by start blocking) and a
  generic uncovered tier (used by wizard warnings).  Generic cloud creds
  (AWS_*, GOOGLE_APPLICATION_CREDENTIALS) no longer trip refuse-start
  for operators using terraform/gcloud alongside Hermes.  New
  discover_blocked_providers() returns the strict subset.
- start_proxy poll-loop must verify listening before pidfile:
  previously fell through deadline-expired as success and wrote a
  pidfile for a non-listening daemon.  Refactored into a do-while
  shape, require `listening=True` for success, kill the child + unlink
  the pidfile on failure paths.

GodsBoy 2nd-round P2 (the worth-keeping subset):
- O_NOFOLLOW + 0o600 + st_uid check on iron-proxy.log open (symmetric
  with the pidfile and audit-log paths the same PR hardens).
- pidfile O_EXCL: refactored pidfile-write into _write_pidfile_safely
  which uses O_EXCL to detect concurrent starts.  EEXIST with a live
  pid means "another start in progress" — refuse with actionable
  message; EEXIST with a dead pid means "stale crash" — unlink and
  retry once.  Discriminates rather than racing.
- _VERSION_CACHE: invalidate on install_iron_proxy success;
  don't cache empty stdout (would poison `hermes egress status` for
  the lifetime of the process if first probe hit a corrupt binary).
- ensure_audit_log now RAISES on OSError instead of swallowing it as
  a warning.  Previous behavior let the daemon create the file under
  the default umask, exactly the world-readable scenario the helper
  was built to prevent.  cmd_setup catches the new RuntimeError and
  surfaces "✗" with the actionable message.
- SIGINT/SIGTERM handler scoped around the start_proxy poll loop:
  Ctrl-C while waiting for `hermes egress start` no longer leaks an
  orphan daemon with the port bound.  Handler kills the child +
  unlinks the pidfile before re-raising.
- pidfile written IMMEDIATELY after Popen, BEFORE the listening
  verification.  Parent dying during the poll loop now leaves a
  pidfile pointing at the orphan so the next `hermes egress stop` can
  clean up.  Failure paths in the poll loop explicitly unlink.
- _DEFAULT_UPSTREAM_DENY_CIDRS: add ::ffff:0:0/96 (IPv4-mapped IPv6 —
  closes the v6-resolved IMDS bypass), 100.64.0.0/10 (CGNAT / cloud
  overlays / K8s pod networks), 198.18.0.0/15 (RFC2544 benchmark).
- _NON_BEARER_PROVIDERS split into LLM-specific (Anthropic / Azure /
  Gemini — block when strict) vs generic-cloud (AWS_*, GCP appdefault
  — warn-only).
- docker.py except narrowing: load_config can raise yaml.YAMLError on
  a malformed config.yaml, not just ImportError.  Two callsites
  (collision check + precedence resolution) now catch yaml.YAMLError
  via a sentinel `import yaml` and fail-safe to enforced mode.

GodsBoy 2nd-round P3:
- _reset_for_tests: was a no-op claiming symmetry with bitwarden;
  now actually clears _VERSION_CACHE and _proxy_nonce so in-process
  callers (notebooks, pytest -p no:xdist) don't see state leakage.
- tests/test_iron_proxy_cli.py: replaced hardcoded Path("/tmp/...")
  with hermes_home/-derived fixtures.  Matches the same cleanup we
  did for test_iron_proxy.py in the previous round.
- --rotate-tokens confirmation gate: when there are existing tokens,
  prompt for "rotate" confirmation (skipped when stdin isn't a tty
  so CI/scripted use still works) AND back up the mappings to a
  timestamped sibling before overwriting.  Surface a no-op note when
  rotate is requested with no existing tokens.

stephenschoettler (runtime-boundary review):
- #1 BWS silent degrade at proxy start: when credential_source=bitwarden
  but the BWS access token or project_id is missing OR the fetch
  returns no values for mapped providers, raise instead of silently
  falling back to host env.  cmd_start also pre-checks at the wizard
  layer for actionable error messages.  Opt-in escape hatch via new
  `proxy.allow_env_fallback: true` config for migration scenarios.
- #2 docker_env collision detection extended: `docker_env:
  {OPENROUTER_API_KEY: sk-real}` in config.yaml with enforce_on_docker:
  true now raises just like an HTTPS_PROXY collision would.  The
  collision check pulls mapped provider names from load_mappings() at
  call time.
- #3 PID nonce persisted to disk: cross-CLI-invocation stale-pidfile
  defense now works.  start_proxy writes the nonce next to the pidfile
  (sibling 0o600), stop_proxy reads it back via _read_persisted_nonce()
  and uses it as a _pid_alive signal in the new process.  Falls back
  to argv0 basename matching when the file is missing (legacy install).

arshkumarsingh:
- #1 NODE_OPTIONS append-merge: egress dict no longer sets NODE_OPTIONS
  directly (would clobber the operator's --max-old-space-size etc.).
  Carry the egress flag in a sentinel key
  _HERMES_EGRESS_NODE_OPTIONS_APPEND; DockerEnvironment merges into the
  existing NODE_OPTIONS in env_args computation with de-duplication.
- #2 docs: structured per-request audit log is at audit.log, not
  iron-proxy.log (the latter is daemon stdout/stderr).  Diagram and
  step-7 text corrected; both file roles are now documented separately.

Tests
- Added 12 new tests in test_iron_proxy.py covering bridge-IP rejection
  (parametrized over 8 dangerous inputs), default deny-list adjacency
  (IPv4-mapped-v6 + CGNAT), blocked-providers strict-subset property,
  _pid_proc_starttime parser with paren-containing comm,
  stop_proxy SIGKILL suppression on starttime drift, _reset_for_tests
  clear behavior, iron_proxy_version don't-cache-empty, NODE_OPTIONS
  sentinel verification, ensure_audit_log raise-on-OSError, and
  persisted-nonce roundtrip.
- Added 1 new test in test_iron_proxy_cli.py covering cmd_start
  BWS-token-missing fail-loud.
- All 100 tests in test_iron_proxy + test_iron_proxy_cli pass; all 78
  tests in test_docker_environment + test_config still pass.

Acknowledged but not addressed:
- GodsBoy P3 dead-code `extra_env` kwarg: kept (removing is a breaking
  change for any out-of-tree caller; the kwarg is documented and works).
- Residual risks GodsBoy called out: iron-proxy in-memory secret
  zeroisation (Go-binary territory, out of scope); _PROXY_SUBPROCESS_ENV
  _ALLOWLIST cosmetic gaps (RUST_LOG, GOMAXPROCS); follow-up.
@teknium1

Copy link
Copy Markdown
Contributor Author

Thanks @GodsBoy @stephenschoettler @arshkumarsingh — all three rounds of feedback addressed in fa4e87b25.

GodsBoy P1 — all 4 addressed

# Finding Fix
1 _detect_docker_bridge_ip parser regression risk Replaced count(".") == 3 heuristic with ipaddress.IPv4Address validation + reject unspecified / loopback / multicast / reserved / link_local / global. Hostile ip shim on PATH can no longer inject 0.0.0.0. 9 new tests cover the rejection matrix.
2 cmd_setup overwrites credential_source on re-run Restored preservation semantics: re-running hermes egress setup without --from-bitwarden no longer silently downgrades back to env. New --no-bitwarden flag to switch back explicitly; otherwise keep the existing mode and surface the decision.
3 fail_on_uncovered_providers docstring/default contradicts Resolved by truth-in-advertising: docstring now correctly states default=False, AND split providers into a strict LLM-specific tier (_LLM_SPECIFIC_NON_BEARER_PROVIDERSdiscover_blocked_providers(), used by start blocking) vs a generic uncovered tier (used by wizard warnings). Generic cloud creds (AWS_*, GCP appdefault) no longer trip refuse-start for operators running terraform/gcloud alongside Hermes.
4 start_proxy poll loop falls through deadline as success Refactored into a do-while; require listening=True for success; kill the child + unlink the pidfile on failure paths. No more "pidfile written for a non-listening daemon".

GodsBoy P2 — all 9 addressed

  • O_NOFOLLOW + 0o600 + st_uid check on iron-proxy.log open (symmetric with pidfile + audit-log hardenings).
  • Pidfile O_EXCL via new _write_pidfile_safely() helper that discriminates concurrent-start (EEXIST + live pid → refuse with actionable msg) from stale-crash (EEXIST + dead pid → unlink + retry once).
  • _VERSION_CACHE invalidates on install_iron_proxy success; no longer caches empty stdout.
  • ensure_audit_log now raises on OSError instead of swallowing. The previous swallow let the daemon create the file under default umask — exactly the world-readable scenario the helper exists to prevent. cmd_setup catches the new RuntimeError and surfaces .
  • SIGINT/SIGTERM handler scoped around the poll loop in start_proxy. Ctrl-C while waiting for hermes egress start no longer leaks an orphan with the port bound.
  • Pidfile written immediately after Popen (BEFORE listening verification). Parent dying during the poll loop now leaves a pidfile pointing at the orphan so the next stop can clean up. Poll-loop failure paths explicitly unlink.
  • _DEFAULT_UPSTREAM_DENY_CIDRS extended: ::ffff:0:0/96 (IPv4-mapped-v6 — closes the v6-resolved IMDS bypass), 100.64.0.0/10 (CGNAT / K8s pod networks), 198.18.0.0/15 (RFC2544 benchmark).
  • _NON_BEARER_PROVIDERS tier split — see P1 Architecture planning #3 above.
  • docker.py except narrowing: load_config can raise yaml.YAMLError on a malformed config.yaml, not just ImportError. Both call sites now catch yaml.YAMLError and fail-safe to enforced mode.

GodsBoy P3 — 3 of 5 addressed

  • _reset_for_tests no longer a lie — actually clears _VERSION_CACHE + _proxy_nonce.
  • tests/test_iron_proxy_cli.py /tmp/... paths replaced with tmp_path/hermes_home-derived fixtures.
  • --rotate-tokens confirmation gate + backup: prompt for rotate confirmation when there are existing tokens (skipped under non-tty for CI), back up the current mappings.json to a timestamped sibling before overwriting, surface a no-op note when rotate is requested with no existing tokens.
  • Two P3s deferred: the dead-code extra_env kwarg on start_proxy (removing is a breaking change for any out-of-tree caller; the kwarg works correctly and is documented). New tests added for _pid_proc_starttime parser + stop_proxy SIGKILL-suppress path satisfy the test-coverage gap.

@stephenschoettler — all 3 findings addressed

  1. BWS silent degrade at proxy start. When credential_source=bitwarden but the BWS access token / project_id is missing OR the fetch returns no values for mapped providers, _build_proxy_subprocess_env now raises instead of falling back to host env. cmd_start also pre-checks at the wizard layer for actionable error messages. Opt-in escape hatch via new proxy.allow_env_fallback: true config for migration scenarios.
  2. docker_env mapped-provider collision. Collision detection extended: docker_env: {OPENROUTER_API_KEY: sk-real} with enforce_on_docker: true now raises just like an HTTPS_PROXY collision. The check pulls mapped provider names from load_mappings() at call time, so it stays in sync.
  3. PID nonce cross-CLI invocation. start_proxy now persists the nonce to disk (sibling 0o600 file iron-proxy.nonce); _pid_alive in a later process reads it via _read_persisted_nonce() and uses it as the strongest match signal before falling back to argv0 basename. Cross-process stale-pidfile defense now holds.

@arshkumarsingh — both addressed

  1. NODE_OPTIONS append-merge. Egress dict no longer sets NODE_OPTIONS directly. Carries the flag in a sentinel key _HERMES_EGRESS_NODE_OPTIONS_APPEND; DockerEnvironment merges into the operator's existing NODE_OPTIONS in env_args computation with de-duplication. User's --max-old-space-size=8192 etc. preserved.
  2. Docs audit log path. Diagram and step-7 fixed: structured per-request audit log is at ~/.hermes/proxy/audit.log; iron-proxy.log is the separate daemon stdout/stderr. Both file roles now documented.

Tests

12 new tests in test_iron_proxy.py + 1 new test in test_iron_proxy_cli.py. All 100/100 in those files pass locally; 78/78 in test_docker_environment + test_config still pass.

Residual risks acknowledged (not addressed in this round)

  • iron-proxy in-memory secret zeroisation — Go-binary territory, out of scope.
  • _PROXY_SUBPROCESS_ENV_ALLOWLIST cosmetic gaps (RUST_LOG, GOMAXPROCS, XDG_*) — debug-cost only, follow-up.
  • Token-rotation auto-restart (running iron-proxy still uses the old YAML config; manual stop+start required after setup --rotate-tokens) — docs note, follow-up.
  • Dead extra_env kwarg — see above; kept for back-compat.

Threads queued under the individual findings.

internals reference

Pre-v3 the egress docs were 175 lines covering the basics: quick start,
slash commands, security model, failure modes.  After three rounds of
PR review we added a half-dozen new config knobs, two new flags, a
strict/warn tier split for uncovered providers, persisted-nonce
cross-process defense, audit-log + log-file separation, NODE_OPTIONS
append-merge, docker_env collision detection, etc. — none of which
the user-facing doc reflected.

This commit closes that gap end-to-end:

website/docs/user-guide/egress/iron-proxy.md (175 → 567 lines)
- Configuration section expanded with every new knob:
  fail_on_uncovered_providers, allow_env_fallback, upstream_deny_cidrs.
- Tables for default allowed hosts + default deny CIDRs.
- Bind policy section (loopback + docker bridge, NOT 0.0.0.0) with the
  operator-facing "why can't I hit the proxy from my LAN" answer.
- Uncovered providers section with the strict tier (Anthropic / Azure
  / Gemini — block when fail_on_uncovered_providers=true) vs warn tier
  (AWS, GCP appdefault — present on every dev laptop, never block).
- Bitwarden integration expanded: rotation semantics, fail-loud at
  start, the allow_env_fallback escape hatch, --no-bitwarden flag, the
  preserve-existing-source rule on plain re-setup.
- Slash commands section with --no-bitwarden, --rotate-tokens, and the
  token-rotation operator playbook (confirmation gate, backup file
  naming, restart-required caveat).
- State directory layout table covering all 9 files we create + their
  modes.
- Audit log vs daemon log distinction (the arshkumarsingh #2 fix that
  motivated the corrected diagram).
- CA distribution into the sandbox: full table of injected env vars,
  the Python/curl REPLACE vs Node ADD asymmetry caveat with the
  NODE_OPTIONS=--use-openssl-ca mitigation.
- docker_env collision detection: what gets blocked, what gets warned,
  the migration escape hatch.
- PID + nonce defense section explaining how iron-proxy.nonce works
  cross-CLI and the SIGKILL-suppress-on-recycle path.
- Security model expanded with the new defenses
  (IPv4-mapped-v6 IMDS bypass closure, env-var leakage prevention,
  LAN-peer-with-token-leak coverage).
- Failure modes extended for every new refuse-start path.
- Troubleshooting section (180 new lines) with grep-friendly error
  matchers for each common failure: BWS token missing, uncovered
  provider refused, port collision, slow bind, 403 from proxy, SSL
  verification errors inside the sandbox, 401 from upstreams, address-
  in-use orphan recovery, per-request audit log inspection.

website/docs/getting-started/quickstart.md
- One-paragraph mention of the egress proxy under "Sandboxed terminal"
  so operators discover the feature when they enable Docker isolation.

website/docs/reference/cli-commands.md
- Top-level command table now lists `hermes egress` alongside `hermes
  proxy` (different purpose, different direction — call it out).
- New `## hermes egress` section with full subcommand syntax, common
  flows (first-time setup, switching credential source, rotating
  tokens, adding upstream), and diagnostic shortcuts.

website/docs/reference/environment-variables.md
- New "Egress proxy (sandbox-injected)" section documenting every env
  var the Docker backend injects: HERMES_EGRESS_PROXY,
  HERMES_PROXY_TOKEN_<NAME>, HTTPS_PROXY/HTTP_PROXY/NO_PROXY,
  REQUESTS_CA_BUNDLE/SSL_CERT_FILE/CURL_CA_BUNDLE/NODE_EXTRA_CA_CERTS,
  NODE_OPTIONS append-merge, HERMES_IRON_PROXY_NONCE.
- Also fixes a stale layout issue with the Persistent Shell table that
  had two trailing rows getting orphaned in the v3 commit.

website/docs/developer-guide/egress-internals.md (NEW, 363 lines)
- Module layout map (which file owns what).
- Full lifecycle walkthrough for install / setup / start / stop with
  the actual function calls in order.
- "Security invariants" section enumerating every load-bearing property
  with the regression test name that guards it.  These are the rules
  contributors must preserve when touching the module:
  - filesystem perms (0o700 dir, 0o600 secrets, O_NOFOLLOW everywhere)
  - subprocess env minimisation (no os.environ.copy)
  - bind policy (loopback + docker bridge, never 0.0.0.0)
  - default deny CIDR coverage
  - audit log fail-loud
  - bitwarden fail-loud
  - docker_env collision detection
  - PID recycling defense
  - token preservation on re-setup
  - credential_source preservation
- Extension points: adding a bearer-token provider, adding a
  non-bearer provider, wiring iron-proxy into a non-Docker backend,
  subscribing to per-request audit events.
- Testing recipe (hermetic + E2E + CLI smoke).

website/sidebars.ts
- New `developer-guide/egress-internals` entry under Developer Guide
  → Internals (alongside acp-internals, cron-internals,
  trajectory-format).

Build verification
- `cd website && npm install && npx docusaurus build` succeeds locally.
- All three new pages render to static HTML in all three locales
  (en + zh-Hans + ko).
- No new broken links or broken anchors introduced (pre-existing
  warnings on translation stubs are unrelated).
@teknium1

Copy link
Copy Markdown
Contributor Author

Pushed 906b1da57 — comprehensive docs expansion to match what the code actually does after the three review rounds.

What landed

website/docs/user-guide/egress/iron-proxy.md — expanded from 175 → 567 lines, covering every operator-facing surface introduced across the three fix rounds:

  • All new config knobs (fail_on_uncovered_providers, allow_env_fallback, upstream_deny_cidrs, plus tables for the default allowed hosts + default deny CIDRs)
  • Bind policy section explaining loopback + docker bridge (NOT 0.0.0.0) and the hostile-ip-shim defense
  • Uncovered-providers section with the strict (Anthropic / Azure / Gemini → block) vs warn (AWS / GCP appdefault → never block) tier split, including the "why isn't AWS_PROFILE blocking my start?" answer
  • Bitwarden rotation semantics: when refresh fires, fail-loud paths, the allow_env_fallback escape hatch, --no-bitwarden flag, and the preserve-existing-source rule on plain re-setup
  • Full slash command reference with --no-bitwarden, --rotate-tokens, and the token-rotation operator playbook (confirmation gate, backup file naming, restart caveat)
  • State directory layout table covering all 9 files we create + their modes
  • Audit log vs daemon log distinction (with the corrected diagram from @arshkumarsingh's earlier finding)
  • CA distribution table for the seven injected env vars, plus the Python-replaces vs Node-adds asymmetry caveat with NODE_OPTIONS=--use-openssl-ca mitigation
  • docker_env collision detection: what blocks, what warns, the migration escape hatch
  • PID + nonce defense: in-memory + on-disk dual-source, SIGKILL-suppress-on-recycle path
  • Security model expanded with the new defenses (IPv4-mapped-v6 IMDS bypass, env-var leakage, LAN-peer-with-token-leak)
  • Troubleshooting section (180 new lines): grep-friendly error matchers for BWS token missing, uncovered-provider refuse-start, port collision, slow bind, 403 from proxy, SSL verification inside the sandbox, 401 from upstreams, address-in-use orphan recovery, audit log inspection

website/docs/developer-guide/egress-internals.md (new, 363 lines) — for plugin / contributor devs:

  • Module layout map
  • Full lifecycle walkthrough (install / setup / start / stop) with actual function calls in order
  • Security invariants section enumerating the load-bearing properties with the regression test name guarding each one — filesystem perms, subprocess env minimization, bind policy, default deny CIDRs, audit log fail-loud, Bitwarden fail-loud, docker_env collision detection, PID recycling defense, token preservation, credential_source preservation. This is what future contributors must preserve when touching the module.
  • Extension points: adding bearer + non-bearer providers, wiring iron-proxy into Modal / Daytona / SSH, subscribing to per-request audit events
  • Testing recipe (hermetic + E2E + CLI smoke)

Cross-references updated:

  • getting-started/quickstart.md — one-paragraph mention under "Sandboxed terminal" so operators discover the feature when they enable Docker isolation
  • reference/cli-commands.mdhermes egress row in the top-level table + a full subcommand section with common flows + diagnostic shortcuts
  • reference/environment-variables.md — new "Egress proxy (sandbox-injected)" section documenting every injected var (HERMES_EGRESS_PROXY, HERMES_PROXY_TOKEN_<NAME>, HTTPS_PROXY/etc., the four CA-bundle vars, NODE_OPTIONS append-merge, HERMES_IRON_PROXY_NONCE)
  • sidebars.ts — new developer-guide/egress-internals entry under Developer Guide → Internals

Build verification

cd website && npm install && npx docusaurus build
[SUCCESS] Generated static files in "build".

All three new pages render to static HTML in all three locales (en + zh-Hans + ko). No new broken links or broken anchors introduced (the pre-existing warnings on translation stubs are unrelated to these changes).

Coverage status

Topic Doc location
Quick start iron-proxy.md → Quick start
Full config schema with comments iron-proxy.md → Configuration
Default allowed hosts + deny CIDRs (tables) iron-proxy.md → Configuration
Bind policy + LAN exposure rationale iron-proxy.md → Configuration → Bind policy
Uncovered providers (strict vs warn tier) iron-proxy.md → Uncovered providers
Bitwarden setup + rotation + fail-loud semantics iron-proxy.md → Bitwarden integration
All 7 slash commands (incl. new --no-bitwarden, --rotate-tokens) iron-proxy.md → Slash commands
State directory layout + file modes iron-proxy.md → State directory layout
Audit log vs daemon log roles iron-proxy.md → State directory layout → Audit log vs daemon log
Data flow diagram + step-by-step request walk iron-proxy.md → How it works
CA distribution into sandbox + Node.js CA asymmetry iron-proxy.md → How it works → CA distribution / Node.js asymmetric CA caveat
docker_env collisions iron-proxy.md → How it works → docker_env collisions
PID + nonce defense iron-proxy.md → PID and nonce defense
Security model (what it protects, what it doesn't) iron-proxy.md → Security model
11 named failure modes iron-proxy.md → Failure modes
Troubleshooting (9 grep-friendly scenarios) iron-proxy.md → Troubleshooting
Limitations + v1 follow-ups iron-proxy.md → Limitations
Module + file layout for contributors egress-internals.md → Module layout
Lifecycle with actual function calls egress-internals.md → Lifecycle
Security invariants with regression test names egress-internals.md → Security invariants
Adding a new bearer / non-bearer provider egress-internals.md → Extension points
Wiring iron-proxy into a non-Docker backend egress-internals.md → Extension points
hermes egress reference + common flows reference/cli-commands.md
Sandbox-injected env vars reference/environment-variables.md
First-time discovery from quickstart getting-started/quickstart.md

Everything ships in the same PR so the docs land with the code.

teknium1 added 2 commits May 25, 2026 18:37
Single content conflict in hermes_cli/config.py — kept BOTH the
paste_collapse_threshold knobs from main and the proxy section from
this branch (they're independent additions to DEFAULT_CONFIG).

All 187 tests in test_iron_proxy.py + test_iron_proxy_cli.py +
test_config.py pass post-merge.
propagate handler exit codes

Live testing the full wizard against the real v0.39.0 binary
(downloaded + extracted via our own install_iron_proxy()) surfaced
three real bugs that the unit tests couldn't catch:

1. `proxy.http_listens` (plural) — NOT a field in v0.39's config struct.
   Our code emitted both `http_listen` (string) and `http_listens`
   (list) believing v0.39 accepts both forms.  The binary actually
   rejects with "field http_listens not found in type config.Proxy"
   at YAML unmarshal time, so the daemon fails to start.  Confirmed
   via strings(1) audit of the v0.39 binary — only `http_listen` is
   tagged.

2. `log.audit_path` — NOT a field in v0.39's config.Log struct.  Same
   class of error: "field audit_path not found in type config.Log".
   Per-request audit-log records are not separable from server-level
   logs at this binary version.

3. `metrics.listen` defaults to ":9090" — which is the SAME port as
   our default `tunnel_port: 9090`.  Result: every operator who runs
   `hermes egress setup` followed by `hermes egress start` gets
   "bind: address already in use" because the proxy listener and the
   metrics listener fight for port 9090.  We now explicitly pin
   `metrics.listen: 127.0.0.1:0` to give it an ephemeral loopback
   port that can never collide with tunnel_port regardless of what
   operator sets.

Plus a fourth bug — pre-existing but surfaced by the egress live
test — that affects every Hermes subcommand:

4. `hermes_cli/main.py` calls `args.func(args)` at the bottom of
   main() but discards the return value.  Every subcommand handler
   that returns a non-zero exit code (cmd_start refusing because
   `fail_on_uncovered_providers=true`, cmd_setup refusing because
   --from-bitwarden but BWS unreachable, etc.) was silently exiting 0.
   Fix: capture the handler's return value and `sys.exit(rc)` when
   it's a non-zero int.  Other subcommands' contracts unchanged
   because they either return 0/None or don't return at all.

Validation:
- 188/188 in test_iron_proxy.py + test_iron_proxy_cli.py +
  test_config.py pass post-fix.
- 5333/5337 in tests/hermes_cli/ pass; the 4 unrelated failures
  (test_managed_installs.py + test_update_hangup_protection.py)
  are pre-existing on main, not touched by this PR.
- Manual wizard run end-to-end with the v0.39.0 binary in an
  isolated HERMES_HOME:
    * `egress install` — downloads + SHA-256 verifies + extracts
    * `egress setup` — generates CA, mints tokens, writes
      proxy.yaml that the binary now accepts (no http_listens,
      no audit_path, metrics pinned to 127.0.0.1:0)
    * `egress start` — daemon binds 127.0.0.1:9090, listens=yes
    * `egress status` — shows pid + listening + mappings
    * `egress stop` — clean shutdown, pidfile + nonce removed
    * Idempotent re-start returns the running pid without spawning
    * curl through the proxy with the openrouter token gets
      forwarded; an attacker host gets HTTP 403 (allowlist works);
      169.254.169.254 gets HTTP 403 (deny CIDR works)
    * Refuse-start paths exit 1 with actionable messages:
      - `fail_on_uncovered_providers=true` + ANTHROPIC_API_KEY set
      - `credential_source=bitwarden` + BWS_ACCESS_TOKEN unset
    * `--rotate-tokens` confirmation gate fires via pty:
      typing 'cancel' aborts; typing 'rotate' proceeds and
      creates a mappings.json.rotated-<timestamp> backup

Test updates:
- `test_default_bind_is_loopback_not_zero_zero` — asserts the
  singular `http_listen` is loopback AND asserts `http_listens`
  (plural) is NOT in the rendered yaml.
- `test_default_bind_uses_loopback_on_linux` — replaces
  `test_default_bind_includes_docker_bridge_on_linux`.  v0.39
  only supports one bind per daemon process, so the docker bridge
  augmentation is dropped from the rendered config; sandboxes
  reach the daemon via host.docker.internal -> host-gateway
  mapping, so loopback-only is functional.
- `test_metrics_listener_pinned_to_loopback_ephemeral` — new
  regression test asserting `metrics.listen == "127.0.0.1:0"`.
- `test_audit_log_kwarg_does_not_inject_audit_path_v039` —
  replaces `test_audit_log_path_lands_in_yaml`.  audit_log kwarg
  is still accepted for forward compatibility but does NOT emit
  log.audit_path until upstream supports it.
@teknium1

Copy link
Copy Markdown
Contributor Author

Live testing surfaced three real bugs — fixed in 905ce58a1

Manual end-to-end run of the wizard against the real iron-proxy v0.39.0 binary (downloaded via our own install_iron_proxy() path) caught three schema-mismatch bugs that the unit tests couldn't see, plus a pre-existing handler-exit-code bug that affected the refuse-start contract.

What the unit tests missed

The unit tests assert what shape the rendered yaml has, but they never spawned the binary to confirm it accepts that shape. Spawning the daemon for real produced:

error: parsing config: yaml: unmarshal errors:
  line 6: field http_listens not found in type config.Proxy
  line 85: field audit_path not found in type config.Log

And later, after fixing those:

ERROR fatal error proxy: http listen: listen tcp 127.0.0.1:9090: bind: address already in use

A strings(1) audit of the v0.39 binary's struct tags confirmed: only the singular http_listen is present, no http_listens; no audit_path in the Log struct; the metrics server has its own :9090 default that fights the proxy listener.

Fixes

1. Drop http_listens (plural) from the rendered yaml. The earlier comment claimed "iron-proxy v0.39 accepts both; v0.40+ is listen-list-only" — empirically wrong. The binary rejects the plural form. We now emit only the singular http_listen (loopback). Sandboxes reach the daemon via host.docker.internal -> host-gateway mapping which Docker handles on its own, so loopback-only is functional. The _default_http_listen() helper still computes both addresses but only the first one is wired into the rendered config.

2. Drop log.audit_path from the rendered yaml. Same class of bug — not a field in v0.39's config.Log struct. The audit_log kwarg is kept on build_proxy_config() for forward compatibility (and ensure_audit_log() still pre-creates the file at 0o600 as a logrotate/monitoring sentinel), but we don't try to push it through the binary until upstream adds the field.

3. Pin metrics.listen to 127.0.0.1:0. iron-proxy v0.39 has its own metrics server that defaults to :9090 — the SAME port as our default tunnel_port. Every operator who ran hermes egress setup followed by hermes egress start was hitting "bind: address already in use" immediately. Now we explicitly emit metrics: {listen: 127.0.0.1:0} so the metrics binding gets an ephemeral loopback port that can't collide with tunnel_port regardless of what the operator chose.

4. Propagate handler exit codes in hermes_cli/main.py. Pre-existing bug surfaced during refuse-start testing: args.func(args) at the bottom of main() was discarding the return value, so hermes egress start refusing because of fail_on_uncovered_providers=true (or --from-bitwarden without BWS_ACCESS_TOKEN) was exiting 0 instead of 1. Affected every subcommand that returns an exit code. Fixed by rc = args.func(args); sys.exit(rc) if isinstance(rc, int) and rc != 0.

Live-test results after fixes (all green)

End-to-end run with the real v0.39.0 binary in an isolated HERMES_HOME:

Step Result
egress install Downloads + SHA-256 verifies + extracts to ~/.hermes/bin/iron-proxy (51 MB, mode 755)
egress setup Generates CA (0o600 key), mints 3 provider tokens, writes proxy.yaml the binary actually accepts
egress start Daemon binds 127.0.0.1:9090, listening=yes, pid recorded
egress status Shows binary version 0.39.0 + listening + per-provider token mappings
egress stop Clean shutdown, pidfile + nonce file removed
Idempotent restart Second start returns existing pid without spawning duplicate
Allowlist denial curl --proxy ... https://attacker.example.com/xHTTP 403 (Forbidden)
IMDS deny CIDR curl --proxy ... http://169.254.169.254/...HTTP 403 (Forbidden)
State dir perms 0o700 dir, 0o600 ca.key / proxy.yaml / mappings.json / audit.log, 0o644 ca.crt
Refuse-start: uncovered providers fail_on_uncovered_providers=true + ANTHROPIC_API_KEY set → exit 1 with actionable message
Refuse-start: bitwarden mode credential_source: bitwarden + BWS_ACCESS_TOKEN unset → exit 1 with --no-bitwarden hint
--rotate-tokens via pty Confirmation prompt fires; cancel aborts cleanly; rotate proceeds and creates mappings.json.rotated-<timestamp> backup

Rendered proxy.yaml (verifies all v3-round defenses still in the daemon's config)

dns:
  listen: 127.0.0.1:0
  proxy_ip: 127.0.0.1
proxy:
  http_listen: 127.0.0.1:9090
  https_listen: 127.0.0.1:0
  tunnel_listen: 127.0.0.1:0
  upstream_deny_cidrs:
  - 127.0.0.0/8
  - ::1/128
  - 169.254.0.0/16        # ← IMDS
  - fe80::/10
  - 10.0.0.0/8
  - 172.16.0.0/12
  - 192.168.0.0/16
  - fc00::/7
  - ::ffff:0:0/96         # ← v3: closes IPv4-mapped-v6 IMDS bypass
  - 100.64.0.0/10         # ← v3: CGNAT
  - 198.18.0.0/15         # ← v3: RFC2544 benchmark
metrics:
  listen: 127.0.0.1:0     # ← NEW: prevents :9090 collision
tls:
  ca_cert: .../proxy/ca.crt
  ca_key:  .../proxy/ca.key
transforms:
- name: allowlist
  config:
    domains: [openrouter.ai, ..., inference.nousresearch.com]
- name: secrets
  config:
    secrets:
    - source:
        type: env
        var: OPENROUTER_API_KEY     # ← real keys NOT inlined; sourced from daemon env
      replace:
        proxy_value: openrouter-<128-bit-token>
        match_headers: [Authorization]
      rules:
      - host: openrouter.ai
      - host: '*.openrouter.ai'

Test coverage updates

  • test_default_bind_is_loopback_not_zero_zero — now asserts http_listens (plural) is NOT in the rendered yaml.
  • test_default_bind_uses_loopback_on_linux — replaces test_default_bind_includes_docker_bridge_on_linux. Asserts singular http_listen + no http_listens.
  • test_metrics_listener_pinned_to_loopback_ephemeral — new regression test asserting metrics.listen == "127.0.0.1:0".
  • test_audit_log_kwarg_does_not_inject_audit_path_v039 — replaces test_audit_log_path_lands_in_yaml. Asserts the kwarg is accepted but log.audit_path is NOT emitted.

188/188 in test_iron_proxy.py + test_iron_proxy_cli.py + test_config.py pass. Full tests/hermes_cli/ sweep: 5333 passed, 4 skipped, 2 pre-existing failures unrelated to this PR.

Test isolation

Live testing ran against /tmp/hermes-egress-livetest-XXXXXX with env -i clearing everything except PATH, HOME, HERMES_HOME, and the obvious-fake provider tokens. Live ~/.hermes/config.yaml, ~/.hermes/.env, and ~/.hermes/bin/ were never touched. All test artifacts removed after testing.

Ready for re-review and merge once CI completes on the new head.

…vior

The previous docs round (906b1da) described the integration the way
we wanted it to work — `http_listens` plural with a docker bridge
bind, dedicated `audit.log` for per-request JSON records.  Live
testing against the real v0.39.0 binary in [905ce58] surfaced that
neither field exists in v0.39's config schema, and the docs were
making promises the daemon couldn't keep.

This commit walks every claim in the docs back to what the binary
actually does today, while keeping the upgrade path explicit so the
docs stay coherent when the pinned `_IRON_PROXY_VERSION` bumps:

website/docs/user-guide/egress/iron-proxy.md
- Bind policy section: rewritten.  Was "loopback + docker bridge IP
  on Linux"; now "loopback only" with an explicit explanation that
  v0.39 only supports one bind per daemon and that
  host.docker.internal -> host-gateway mapping is what sandboxes
  use to reach the loopback bind.
- Bind policy section adds a note on the metrics-port pin that the
  previous round of docs didn't even mention.
- State directory layout table: `audit.log` description rewritten
  to acknowledge it's a pre-created sentinel for future binary
  versions, NOT something the v0.39 daemon writes to.
- New section "Logging on iron-proxy v0.39" replaces the old
  "Audit log vs daemon log" section.  Explicitly tells operators
  the daemon log is the single source of truth for both audiences
  on v0.39, with the upgrade path called out.
- Data-flow diagram step 7: rewritten to send per-request records
  to `iron-proxy.log` on v0.39 with cross-link to the new logging
  section.
- Diagram caption updated.
- Security-model "allowlisted-host exfiltration" line: "audit log
  captures" -> "daemon log captures".
- Security-model "LAN peer leak" line: removed the docker-bridge
  claim.
- Troubleshooting section's per-request-inspection recipes:
  rewritten to use `iron-proxy.log` and explain when the split
  stream will land.
- Limitations list gets a new bullet calling out the
  single-bind + combined-log v0.39 constraints + the auto-upgrade
  posture.

website/docs/developer-guide/egress-internals.md
- Bind policy invariant: documents the singular `http_listen` v0.39
  schema constraint + dead-code-until-upgrade status of the
  bridge-bind path.
- New "Metrics port collision" invariant documenting why
  `metrics.listen: 127.0.0.1:0` is non-negotiable.
- Audit log fail-loud invariant adds the v0.39 schema constraint
  note + the new
  `test_audit_log_kwarg_does_not_inject_audit_path_v039` regression
  test.
- "Subscribing to per-request audit events" section updated to
  send watchers at `iron-proxy.log` for v0.39 with the upgrade
  pivot called out.

website/docs/reference/cli-commands.md
- Diagnostic shortcut for tailing the audit log: `tail audit.log |
  jq` -> `tail iron-proxy.log | jq` with the v0.39 note inline.

Build verification:
- `npx docusaurus build` succeeds across all three locales
  (en + zh-Hans + ko).
- New `#logging-on-iron-proxy-v039` anchor lands in the rendered
  HTML and the in-page cross-references resolve.
- No new broken anchors introduced (pre-existing warnings on
  unrelated zh-Hans pages are unchanged).
- No leftover stale `#audit-log-vs-daemon-log` or `#http_listens`
  references anywhere on the egress pages.
@teknium1

Copy link
Copy Markdown
Contributor Author

Docs caught up to the v0.39 behavior

The previous docs round (906b1da) described the integration the way we wanted it to work — proxy.http_listens plural + log.audit_path for a dedicated per-request stream. Live testing showed neither field exists in v0.39's config schema, and the docs were making promises the daemon couldn't keep. Just pushed bde66ef37 walking every claim back to what the binary actually does today, while keeping the upgrade path explicit so the docs stay coherent when we eventually bump _IRON_PROXY_VERSION.

What changed

website/docs/user-guide/egress/iron-proxy.md

  • Bind policy — rewritten. Was "loopback + docker bridge IP on Linux"; now "loopback only" with the explanation that v0.39 only supports one bind per daemon and that the host.docker.internal -> host-gateway mapping is what sandboxes use to reach the loopback bind. Adds a paragraph on the metrics.listen: 127.0.0.1:0 pin (which the previous docs round didn't mention at all).
  • State directory layout tableaudit.log description rewritten to say it's a 0o600 sentinel for future binary versions, not something the v0.39 daemon writes to.
  • New section "Logging on iron-proxy v0.39" replaces the old "Audit log vs daemon log" section. Explicitly tells operators that iron-proxy.log is the single source of truth for both audiences on v0.39, with the upgrade path called out.
  • Data flow diagram step 7 — now sends per-request records to iron-proxy.log with a cross-link to the new logging section.
  • Security model bullets — "audit log captures" → "daemon log captures"; the LAN-peer-leak bullet drops the "+ docker bridge" claim.
  • Troubleshooting — per-request-inspection recipes use iron-proxy.log and the grep field name matches v0.39's actual schema.
  • Limitations — new bullet calling out the v0.39 single-bind + combined-log constraints with the auto-upgrade posture.

website/docs/developer-guide/egress-internals.md

  • Bind policy invariant — documents the singular http_listen v0.39 schema constraint + the dead-code-until-upgrade status of the bridge-bind path.
  • New "Metrics port collision" invariant explaining why metrics.listen: 127.0.0.1:0 is non-negotiable.
  • Audit log invariant — adds the v0.39 schema constraint note + the new test_audit_log_kwarg_does_not_inject_audit_path_v039 regression test.
  • "Subscribing to per-request audit events" — sends watchers at iron-proxy.log for v0.39, with the upgrade pivot.

website/docs/reference/cli-commands.md

  • Diagnostic shortcut: tail audit.log | jqtail iron-proxy.log | jq with the v0.39 note inline.

Build verification

cd website && npx docusaurus build
[SUCCESS] Generated static files in "build".         (en)
[SUCCESS] Generated static files in "build/ko".      (ko)
[SUCCESS] Generated static files in "build/zh-Hans". (zh-Hans)
  • New #logging-on-iron-proxy-v039 anchor lands in the rendered HTML and the in-page cross-references resolve.
  • No new broken anchors introduced (pre-existing warnings on unrelated zh-Hans pages are unchanged).
  • No leftover stale #audit-log-vs-daemon-log or #http_listens references anywhere on the egress pages.

Upgrade-day posture

The docs explicitly tell operators the contract today is:

  • iron-proxy.log = daemon + per-request records (combined on v0.39)
  • audit.log = pre-created 0o600 sentinel, not written by daemon

And tell them what will happen when we bump the pinned version:

  • Per-request stream moves to audit.log automatically
  • No operator-side reconfiguration needed
  • Logrotate / monitoring tooling wired to audit.log today keeps working through the bump

Same for the bind policy — the moment upstream adds proxy.http_listens and we re-enable the list-emit path, the docker bridge bind goes live without any operator action.

Should still be merge-ready; CI on bde66ef will catch up shortly.

@GodsBoy

GodsBoy commented May 30, 2026

Copy link
Copy Markdown
Contributor

Follow-up review on current HEAD (bde66ef37)

Took a fresh pass on the latest state, focused on the surface added after the v3 round: the v0.39 schema realignment (905ce58a1) and the docs alignment (bde66ef37). I read back through the existing threads first so this does not retread anything already raised or fixed. For the record, I rechecked and confirmed these landed cleanly (not re-listing them): the round-one P0s (default deny CIDRs, loopback bind intent, CA key 0o600), @stephenschoettler's docker_env mapped-provider-name collision (now unioned into the enforce check in docker.py), and the cross-process PID nonce (persisted to iron-proxy.nonce, re-read by status/stop with O_NOFOLLOW + uid check).

Four new items, one of them a release blocker.

[P0] The v0.39 single-bind switch drops the docker-bridge listener, so sandboxes can no longer reach the proxy on Linux

905ce58a1 removed the plural http_listens and now emits only http_listen: listens[0]:

  • build_proxy_config (agent/proxy_sources/iron_proxy.py:862-863, 898): listens = _default_http_listen(tunnel_port), then primary_listen = listens[0], and only primary_listen is written into the yaml.
  • _default_http_listen (iron_proxy.py:688-708) still computes [127.0.0.1:PORT, <bridge_ip>:PORT] on Linux, and its docstring still says the bridge bind is there "so containers can reach the proxy via host.docker.internal:host-gateway". That second entry is now silently discarded; only the loopback bind survives into the rendered config.
  • The container is pointed at the bridge, not loopback: docker.py:280 sets HTTPS_PROXY=http://host.docker.internal:{tunnel_port} and docker.py:330 adds --add-host host.docker.internal:host-gateway. On Linux, host-gateway resolves to the docker bridge gateway (172.17.0.1 by default), which a loopback-only daemon never answers.

Net effect: with enforce_on_docker: true (the default), every sandbox on a Linux Docker host gets connection-refused on its first egress call, which the agent inside sees as an endless transient retry. The wizard default is broken on Linux. The end-to-end validation in the commit message ran curl from the host (127.0.0.1), so the container vantage point was never actually exercised, which is why it passed.

v0.39 supports only one bind, so the resolution is a design call (bind the bridge gateway IP as the single Linux listener, run the second daemon the code comment already anticipates, or publish the host port into the container). Whichever route, _default_http_listen and its docstring need to be reconciled with what build_proxy_config actually emits, and the regression test should assert the rendered Linux bind is reachable from the container's vantage point (the bridge IP), not just that it is loopback. test_default_bind_uses_loopback_on_linux currently locks in the broken behavior.

[P2] The allow_env_fallback escape hatch does not work on the partial-secret path, and the error message tells operators to set it anyway

In _build_proxy_subprocess_env, the missing-secret branch raises unconditionally (iron_proxy.py:1760-1769) and its own message says to set proxy.allow_env_fallback: true to opt into the host-env fallback. That branch never checks the flag. Only the empty-token branch right below it honors it (iron_proxy.py:1787). So an operator who hits a partial BWS result and follows the error message's own instruction still cannot start.

Fix: gate the missing-secret raise on not (bitwarden_config or {}).get("allow_env_fallback"), matching the message and the sibling branch.

Worth a look while you are in there: cmd_start computes refresh_bw = credential_source == "bitwarden" and bw_cfg is not None and bw_cfg.get("enabled") (proxy_cli.py:448-452). If a config keeps credential_source: bitwarden but secrets.bitwarden.enabled later becomes false, refresh_bw is False, the pre-checks at 494+ are skipped, and start proceeds on host-env secrets with no warning. That is the same silent-degrade class the strict-mode work was meant to close.

[P2] On the pinned v0.39, audit.log is created, advertised, and documented as a working sink, but the daemon never writes to it

905ce58a1 correctly commented out log.audit_path (v0.39 rejects it), so iron-proxy is spawned with a config that has no audit path and the file stays empty for the life of the pin. The surrounding scaffolding still treats it as live:

  • ensure_audit_log (iron_proxy.py:945+) hard-fails on any OSError pre-creating the file, justified by a docstring that says it must do so "for a file the daemon will create under the default umask". On v0.39 the daemon never touches it, so there is no umask window to defend, but the wizard call at proxy_cli.py:361 still aborts setup if the pre-create fails (immutable parent dir, pre-existing file owned by another uid, full disk).
  • cmd_setup prints an unqualified ✓ audit log: {path} (proxy_cli.py:384).
  • The user guide tells operators to wire monitoring to it now: website/docs/user-guide/egress/iron-proxy.md:325 ("operators can wire downstream tooling to that path today") and :556 ("pre-created today ... so any logrotate / monitoring tooling you wire to that path keeps working").

An operator who builds a logrotate / fluent-bit / forensics pipeline on that path gets a 0-byte stream with no signal that anything is wrong. Suggest: downgrade the pre-create failure to a warning on v0.39 (the file is non-load-bearing until upstream adds the field), qualify the success line (for example "reserved, not written until v0.40+"), and soften the "today / keeps working" language in the docs so it matches the daemon's actual behavior.

[P3] metrics.listen: 127.0.0.1:0 makes metrics undiscoverable, and the code comment says otherwise

Pinning off :9090 to dodge the tunnel_port collision is the right fix, but :0 hands the kernel a fresh random port on every start and nothing records it. ProxyStatus has no metrics field (iron_proxy.py:256) and cmd_status prints no metrics row, so the comment at iron_proxy.py:917-919 ("the daemon still emits metrics for any local scraper that knows where to look (via hermes egress status)") is not accurate. Either drop that clause, or parse the bound metrics port from the daemon's startup log and surface it in status.

Verdict

The P0 bind regression is a merge blocker on Linux Docker, which is the default sandbox path. The two P2s are operability and contract traps that fail silently. The P3 is a one-line comment fix. Everything else I rechecked from the earlier rounds looks good.

teknium1 added 2 commits June 10, 2026 04:38
# Conflicts:
#	hermes_cli/main.py
#	tools/environments/docker.py
…llback gate, audit.log truth

Addresses GodsBoy's May 30 follow-up review on PR #30179.

P0 — Linux sandboxes could not reach the proxy:
- _default_http_listen now binds the docker bridge gateway on Linux
  (host.docker.internal resolves to the bridge gateway there; the old
  loopback-only bind was unreachable from containers). Loopback stays
  the default on Docker Desktop platforms; bridge-less Linux falls back
  to loopback with a warning.
- Live-testing the fix against the real v0.39 binary surfaced a second
  latent bug the host-side E2E had masked: v0.39's http_listen does NOT
  terminate CONNECT — tunnel_listen does. HTTPS_PROXY traffic through
  http_listen got 400s from upstream. build_proxy_config now binds
  tunnel_listen (CONNECT/MITM) on tunnel_port and http_listen
  (plain-HTTP forwards) on tunnel_port+1; docker.py points HTTPS_PROXY
  at tunnel_port and HTTP_PROXY at tunnel_port+1.
- Liveness probes (start_proxy poll loop, get_status) now probe the
  CONFIGURED bind host via _read_http_listen_from_config() instead of
  hardcoded loopback, which would have killed a healthy bridge-bound
  daemon as 'never came up'.

P2 — allow_env_fallback dead on the partial-secret path:
- The missing-secret branch in _build_proxy_subprocess_env now honors
  proxy.allow_env_fallback exactly as its own error message promises.
- cmd_start refuses (or warns, with the fallback flag) when
  credential_source=bitwarden but secrets.bitwarden is disabled/missing
  — closing the silent-degrade-to-host-env hole.

P2 — audit.log decoy on v0.39:
- Wizard pre-create failure downgraded to a warning (file is
  non-load-bearing until the version bump); success line qualified as
  'reserved'; user + dev docs stop telling operators to wire monitoring
  to the path today.

P3 — metrics comment no longer claims 'hermes egress status' surfaces
the ephemeral metrics port (it can't; :0 is random and unrecorded).

Validation: 180 unit tests pass (9 new), gated E2E passes, and a live
end-to-end run against the real binary verified CONNECT-MITM with
Authorization swap on tunnel_listen, plain-HTTP swap on tunnel_port+1,
403 on non-allowlisted hosts, and bind-host-aware status probing.
Docs gain a Linux firewall troubleshooting section (container→docker0
INPUT drops, e.g. ufw default-deny).
@teknium1

Copy link
Copy Markdown
Contributor Author

Thanks @GodsBoy — all four addressed in e2c2e4113 (which also merges current main).

P0 — bridge bind: fixed, and live-testing your finding surfaced a second latent bug

You were right that the loopback-only bind was unreachable from Linux containers, and your "test from the container's vantage point" instruction caught more than the bind:

  1. Bind fix: _default_http_listen now returns the docker bridge gateway as the single Linux bind (loopback stays the default on Docker Desktop; bridge-less Linux falls back to loopback with a warning). Docstring and build_proxy_config are reconciled — no more dead second list entry.
  2. Latent listener-role bug (found by following your repro advice): running real curl through the bridge-bound proxy showed v0.39's http_listen does not terminate CONNECT — it forwards it upstream as a regular request and the upstream 400s. tunnel_listen is the CONNECT/MITM listener. So even with the bind fixed, all HTTPS_PROXY traffic would have failed. build_proxy_config now binds tunnel_listen on tunnel_port (CONNECT/MITM, what HTTPS_PROXY hits) and http_listen on tunnel_port + 1 (plain-HTTP forwards, what HTTP_PROXY hits); docker.py sets the two env vars accordingly.
  3. Probe fix: start_proxy's poll loop and get_status now probe the configured bind host via the new _read_http_listen_from_config() — a hardcoded loopback probe would have killed a healthy bridge-bound daemon as "never came up".
  4. Test that locked in the broken behavior: test_default_bind_uses_loopback_on_linux replaced by test_default_bind_uses_docker_bridge_on_linux + bridge-less fallback + macOS cases, asserting the rendered bind from the container's vantage point.

Live validation against the real v0.39 binary: CONNECT-MITM with Authorization swap verified on tunnel_listen (real upstream, 200), plain-HTTP swap verified on tunnel_port+1 (synthetic upstream saw the real secret, never the proxy token), 403 on non-allowlisted hosts, status probing the bridge host. One honest caveat from the container-vantage run on this box: container→docker0 INPUT was dropped by the host firewall (ufw default-deny), which is environmental, not integration — added a troubleshooting section to the user docs covering exactly that symptom and the ufw allow in on docker0 fix.

P2 — allow_env_fallback on the partial-secret path: fixed

The missing-secret branch now honors proxy.allow_env_fallback exactly as its own error message promises (warn + fall back instead of raise). Your adjacent catch is also closed: cmd_start now refuses to start when credential_source: bitwarden but secrets.bitwarden is disabled/missing — or warns and proceeds when the fallback flag is set. Tests: test_partial_bitwarden_secrets_honor_allow_env_fallback, test_partial_bitwarden_secrets_raise_without_fallback, test_cmd_start_refuses_when_bitwarden_mode_but_disabled, test_cmd_start_bitwarden_disabled_proceeds_with_env_fallback.

P2 — audit.log decoy: fixed

Wizard pre-create failure is now a warning, not a setup abort (the file is non-load-bearing on v0.39); the ✓ line is qualified as "reserved — not written by iron-proxy v0.39"; both docs pages stop telling operators to wire monitoring to that path today and point everything at iron-proxy.log until the version bump. Test: test_cmd_setup_audit_log_failure_is_warning_not_abort.

P3 — metrics comment: fixed

The comment no longer claims hermes egress status surfaces the metrics port; it now states plainly that :0 makes metrics undiscoverable at this pin, with the fix direction (fixed port + surface in ProxyStatus) noted for later.

180 unit tests pass (9 new), gated E2E passes.

teknium1 added a commit that referenced this pull request Jun 10, 2026
…lel.py (#43646)

* fix(ci): append filesystem forensics when a per-file pytest run exhausts exit-4 retries

A PR-added test file (tests/test_iron_proxy.py, PR #30179) repeatedly
failed exactly one CI shard with 'ERROR: file or directory not found'
across 4 runs (including a fresh merge SHA on fresh runners), while the
identical slice passes locally against the same merge commit and a
tree-integrity watcher confirms no sibling test mutates the repo. Three
unrelated branches showed the same one-shard signature the same day.

We currently cannot attribute these because the log only carries
pytest's exit-4 line. This adds a forensics block to the captured
output when exit-4 survives the retry loop:

- does the file exist NOW (post-retries)
- parent dir entry count + similarly-named entries
- git status --porcelain dirty-entry count + first 10 entries

Zero behavior change: rc stays 4, retries unchanged, forensics wrapped
in a broad try/except so they can never mask the failure.

Two new tests cover the exhausted-retries and genuinely-missing paths.

* chore: drop the two forensics tests — ship the runner change only
Bartok9 added a commit to Bartok9/hermes-agent that referenced this pull request Jun 11, 2026
…search#30179

Three composable features on top of the iron-proxy egress firewall:

1. hermes egress doctor — 11-check end-to-end health check with
   --json/--check/--no-network, brew-doctor-style fix-it hints, and
   credential redaction in failure messages.

2. hermes egress audit — structured audit log viewer (tail -f / grep
   --since / stats / export json|csv) with first-time-host and >5%-403
   anomaly detection. Pure functions iter_audit_log / aggregate_audit_stats
   / detect_audit_anomalies for testability.

3. Anthropic native (x-api-key) support via 'hermes egress setup
   --with-anthropic'. Adds a parallel secrets rule matching x-api-key,
   mints a TokenMapping for api.anthropic.com, and removes Anthropic from
   the uncovered/blocked sets when opted in. Off by default. TokenMapping
   gains an auth_header field (back-compatible default 'Authorization').

Tests: +50 (30 doctor, 11 audit, 9 anthropic) + 4 CLI dispatch tests.
Existing iron-proxy suite unchanged (101 -> still pass).
No new dependencies (stdlib + existing pyyaml/rich/openssl only).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend/docker Docker container execution comp/agent Core agent loop, run_agent.py, prompt builder comp/cli CLI entry point, hermes_cli/, setup wizard P3 Low — cosmetic, nice to have type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants