Skip to content

feat(egress): CA rotation (hermes egress rotate-ca) — follow-up to #30179#35188

Open
Bartok9 wants to merge 1 commit into
NousResearch:feat/iron-proxyfrom
Bartok9:bartok9/iron-proxy-ca-rotation
Open

feat(egress): CA rotation (hermes egress rotate-ca) — follow-up to #30179#35188
Bartok9 wants to merge 1 commit into
NousResearch:feat/iron-proxyfrom
Bartok9:bartok9/iron-proxy-ca-rotation

Conversation

@Bartok9

@Bartok9 Bartok9 commented May 30, 2026

Copy link
Copy Markdown
Contributor

Why this complements #30179

#30179 explicitly cut one corner in its Scope cuts (v1) section:

The CA is a 10-year self-signed cert. Rotation is manual for now.

This PR closes that gap: a first-class, audited rotation workflow so operators never have to hand-run openssl genrsa, and the recovery path is exercised in CI instead of being first attempted under pressure.

A companion host-hardening survey PR (hermes egress harden) is open separately against the same base.

What lands

hermes egress rotate-ca:

  1. Archives the live ca.crt to ~/.hermes/proxy/ca-archive/ca-<YYYYMMDD-HHMMSS>.crt before touching the live cert. Archive dir keeps the 5 most recent; older ones are pruned.
  2. Mints a fresh CA + key via ensure_ca_cert(force=True) — the same atomic 0o600-key write + os.replace path as first-boot generation.
  3. If --no-restart isn't set and the daemon is alive: stop_proxy() then start_proxy(refresh_secrets_from_bitwarden=..., bitwarden_config=...) so the live proxy picks up the new signing key and keeps honouring the operator's existing credential-source choice.
  4. Appends a structured entry to ~/.hermes/proxy/rotation-history.jsonl:
    {"ts": "2026-05-30T05:32:00Z", "old_fingerprint_sha256": "...", "new_fingerprint_sha256": "...", "reason": "annual rotation", "operator": "alice", "subject": "/CN=hermes iron-proxy CA", "valid_until": "2036-05-28T05:32:00Z"}

Flags:

Flag Effect
--dry-run Print the new subject, the archive path, and the sandboxes that would need restarting — without changing a single file.
--reason TEXT Free-text reason recorded in rotation-history.jsonl (defaults to null).
--no-restart Mint + archive but leave the daemon untouched. Staged rollouts.
--force Skip the interactive "you have running sandboxes" confirmation.

hermes egress rotate-ca lists labeled hermes.sandbox=true containers and prompts by default when any are running — containers mount ca.crt at create time, so a sandbox started before the rotation keeps trusting the old CA until recreated.

New surfaces

Surface Kind Notes
hermes egress rotate-ca CLI command --dry-run, --reason, --no-restart, --force
rotate_ca() / plan_ca_rotation() public API exported in __all__
RotationPlan dataclass shared by dry-run preview and the real path
last_rotation_entry() / list_hermes_sandboxes() public API history read + labeled-sandbox discovery
~/.hermes/proxy/rotation-history.jsonl on-disk artifact append-only audit trail (0o644)
~/.hermes/proxy/ca-archive/ on-disk artifact 0o700, last-5 retention

Failure modes considered

Mode Handling
Daemon not running Restart skipped silently; archive + mint still happen
No docker / no sandboxes list_hermes_sandboxes() returns [], never raises
openssl fingerprint failure Redacted first-8/last-8 via _redact_fingerprint; rotation still completes (records null fingerprint)
Crash mid-rotation Old CA is archived before the live cert is replaced; mint is atomic (os.replace)
Running sandboxes trust the old CA rotate-ca lists them and prompts by default (bypass with --force); console output tells the operator to restart them
--dry-run plan_ca_rotation() is read-only; verified by stat-before/after in tests
--no-restart Daemon left on the old key (staged rollout); not restarted
Corrupt/partial rotation-history.jsonl line last_rotation_entry() scans from the end, skips unparseable lines

Bugbot review fixes baked in

🟡 Medium severity — daemon restart silently degraded Bitwarden-mode proxies

The original implementation called start_proxy() bare in the rotation restart path. Since start_proxy defaults refresh_secrets_from_bitwarden=False and bitwarden_config=None, an operator on credential_source: bitwarden would have come back up with a proxy that's running but unable to inject credentials — the worst kind of silent degradation. Caught by Cursor Bugbot on the original cross-fork iteration.

Fix: rotate_ca() now accepts refresh_secrets_from_bitwarden and bitwarden_config kwargs and forwards them to start_proxy(). cmd_rotate_ca computes these from proxy_cfg identically to cmd_start (credential_source: bitwarden + secrets.bitwarden.enabled → propagate). Regression test added: test_rotate_passes_bitwarden_args_through_to_start_proxy.

🟢 Low severity — duplicated _ca_not_before / _ca_not_after

Bugbot flagged a ~25-line duplication between two openssl-shelling date helpers. Resolved by simple elision — _ca_not_before was only used by the companion doctor check that's deferred to a follow-up PR (it needs a DoctorCheck type from a separate stack). _ca_not_after stays as the single date helper in this PR.

Validation

$ pytest tests/test_iron_proxy*.py -q
109 passed, 1 skipped in 3.88s

The 1 skip is the existing E2E test gated behind HERMES_RUN_E2E=1 (unchanged).

Metric Before (this PR's base = feat/iron-proxy) After
iron-proxy suite 82 passed, 1 skipped 109 passed, 1 skipped
New tests 8 (tests/test_iron_proxy_rotate_ca.py)

8/8 in the new test file:

  • rotate with no daemon: writes + archives, no stop_proxy / start_proxy called
  • rotate with running daemon: stop_proxy then start_proxy, in order
  • --dry-run: zero mtime changes (filesystem stat-before/after)
  • --reason recorded with valid 64-hex fingerprints (old ≠ new)
  • archive pruning: 7 archives in → only the 5 most recent retained
  • --no-restart: daemon left alone (no stop/start calls)
  • regression: bitwarden kwargs propagated through rotate_ca to start_proxy
  • fingerprint redaction: failures in fingerprint reads never echo a full SHA-256 in error messages — first-8/last-8 only

Manual end-to-end smoke (isolated HERMES_HOME): --dry-run changed nothing; real rotate-ca --reason "annual rotation" archived the old cert, minted a 10-year CA, and wrote a parseable history line.

Coverage gaps

  • No verify-trust subcommand (verifying live sandboxes still trust the new CA after restart). Deliberately deferred — documented as a known follow-up in the user guide.
  • Interactive confirmation prompt path is exercised manually, not unit-tested (the underlying rotate_ca engine is fully tested, the prompt is a thin CLI wrapper).
  • Companion ca-rotation doctor check that warns on rotation age depends on a DoctorCheck type that lives on a separate stacked branch (broader doctor / audit work). Lands once that work merges upstream.

Ambiguity flags

  1. Bitwarden propagation surface — added two kwargs (refresh_secrets_from_bitwarden, bitwarden_config) to rotate_ca() rather than introducing a RotationOptions dataclass. Two args, mirrors start_proxy signature, keeps the call site readable. If you want a single options object, that's a refactor we can do here or later.
  2. Archive retention = 5 — picked as a balance between "useful audit history" and "filesystem clutter." Controlled by _CA_ARCHIVE_KEEP constant.
  3. Running sandbox prompt is interactive — uses input(), which means non-TTY callers (CI, scripts) have to pass --force. Documented in the help text. Open to flipping to "default proceed, --prompt to ask" if that fits the project conventions better.

Attribution

Opened by Bartok9 (Daniel Pike) at Daniel's request, in response to Teknium's invitation on X to @catalinmpit for a security review of #30179. Teknium's prompt was essentially "can we get a security review of how Hermes' egress proxy holds up in a real deployment?" — and this PR turns the manual openssl rotation step #30179 mentions into a first-class audited command.

Closes the v1 scope cut in PR NousResearch#30179 — "the CA is a 10-year self-signed
cert. Rotation is manual for now."

`hermes egress rotate-ca`:

  1. Archives the live ca.crt to ~/.hermes/proxy/ca-archive/ca-<stamp>.crt
     BEFORE replacing it, so a crash mid-rotation never leaves a missing
     live cert. Archive dir keeps the 5 most recent; older pruned.
  2. Mints a fresh CA + key via ensure_ca_cert(force=True) — same atomic
     0o600-key + os.replace path as first-boot generation.
  3. If the daemon is running and --no-restart isn't set: stop_proxy()
     then start_proxy(refresh_secrets_from_bitwarden=..., bitwarden_config=...)
     so the live proxy picks up the new signing key AND keeps honouring
     the operator's existing credential-source choice.
  4. Appends a structured record to ~/.hermes/proxy/rotation-history.jsonl
     (ts / old+new SHA-256 fingerprints / reason / operator / subject /
     valid_until).

Flags:
  --dry-run    preview (new subject, archive path, sandboxes to restart)
               with no file changes
  --reason     free-text reason recorded in rotation-history.jsonl
  --no-restart mint + archive but leave the daemon on the old key
               (staged rollouts)
  --force      skip the running-sandbox confirmation prompt

The companion `ca-rotation` doctor check that warns on rotation age and
a `verify-trust` subcommand that inspects sandbox CAs are documented as
planned follow-ups — they land once the broader `doctor` / `audit` work
(currently on a separate stacked branch as a different PR) merges
upstream.

Stdlib-only. No new dependencies.

Bugbot review fixes baked in (caught on the original fork PR, retargeted
to upstream here as one clean commit):

  * Daemon restart after rotation now propagates the operator's
    credential-source choice (refresh_secrets_from_bitwarden, bitwarden_config)
    from config.yaml through rotate_ca to start_proxy. A bare start_proxy()
    call would have silently degraded a Bitwarden-mode proxy to "running
    but unable to inject credentials" — caught by Cursor Bugbot (medium
    severity). Regression test added:
    test_rotate_passes_bitwarden_args_through_to_start_proxy.

  * The duplicated _ca_not_before / _ca_not_after pair that Bugbot
    flagged (low severity / refactor) is resolved by simple elision —
    _ca_not_before was only used by the companion doctor check that's
    deferred to the doctor-stack follow-up, so it doesn't ship here at
    all. _ca_not_after stays as the single openssl-shelling date helper.

Validation:
  109 passed, 1 skipped in 3.88 s (tests/test_iron_proxy*.py — full
  iron-proxy suite, the skip is the E2E test that needs HERMES_RUN_E2E=1)

  8/8 in tests/test_iron_proxy_rotate_ca.py specifically:
    - rotate with no daemon: writes + archives, no stop/start called
    - rotate with running daemon: stop_proxy then start_proxy in order
    - --dry-run: zero mtime changes (filesystem stat-before/after)
    - --reason recorded with valid 64-hex fingerprints (old != new)
    - archive pruning: 7 archives in → only the 5 most recent retained
    - --no-restart: daemon left alone (no stop/start calls)
    - bitwarden args propagated through rotate_ca to start_proxy
      (regression test for Bugbot finding)
    - fingerprint redaction: failures in fingerprint reads never echo
      a full SHA-256 in error messages — first-8/last-8 only

Author: Bartok9 (Daniel Pike), opened in response to Teknium's invitation
to @catalinmpit for a security review of NousResearch#30179. Complements NousResearch#30179 by
turning the manual openssl-by-hand rotation step into a first-class
audited command.
@alt-glitch alt-glitch added type/feature New feature or request P3 Low — cosmetic, nice to have comp/cli CLI entry point, hermes_cli/, setup wizard area/docker Docker image, Compose, packaging labels May 30, 2026
@Bartok9

Bartok9 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Polish — improved description bullets

hermes egress rotate-ca — production-grade CA rotation with archive-before-replace safety, structured rotation history JSONL, and --dry-run / --no-restart / --reason flags
• Automatically restarts the daemon when needed, correctly propagating the operator’s Bitwarden credential-source choice (Bugbot regression test included)
• Archives the 5 most recent CAs; prunes older ones; never leaves a missing live cert even on mid-rotation crash
• Closes the manual-rotation scope cut from #30179; 109 iron-proxy tests + 8 dedicated rotation tests
• Fingerprint redaction in errors; full Bitwarden arg propagation verified; daemon restart sequencing hardened

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/docker Docker image, Compose, packaging comp/cli CLI entry point, hermes_cli/, setup wizard P3 Low — cosmetic, nice to have type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants