fix(gateway): enable faulthandler so C-extension crashes leave a traceback (#25666) by wesleysimplicio · Pull Request #25794 · NousResearch/hermes-agent

wesleysimplicio · 2026-05-14T15:08:52Z

What does this PR do?

Diagnostic fix for #25666. Reporter @rab1dd0g hit recurring status=11/SEGV (exit 139) on Raspberry Pi aarch64, with the last visible output being a gateway.platforms.telegram httpx.ReadError reconnect loop. No Python traceback was captured before the crash, so neither the reporter nor maintainers can tell which C-extension call frame triggered the SIGSEGV — making the bug essentially un-diagnosable without more data.

Root cause

The detailed rationale from the original PR body is preserved below. This template update keeps the review structure consistent with #29640.

Fix

New helper hermes_cli.gateway._enable_faulthandler_for_gateway() enables faulthandler at gateway startup.
run_gateway() calls it right after the docker-root guard and sys.path setup, so any C-extension import that follows is covered.
Opt-out via HERMES_DISABLE_FAULTHANDLER=1 for the rare platforms where faulthandler itself is unstable (accepts 1, true, yes, case-insensitive).
Best-effort: any exception during enable() is silently swallowed so an environment without writable stderr can't break gateway startup.

Why this shape

This shape mirrors #29640 so reviewers can quickly compare scope, root cause, fix, tests, and related context without having to decode a custom PR description.

Tests

Veja a descrição original preservada abaixo para detalhes de validação, testes e notas de verificação.

Original body

Related PRs / issues

Fixes #25666

Original body

Summary

Diagnostic fix for #25666. Reporter @rab1dd0g hit recurring status=11/SEGV (exit 139) on Raspberry Pi aarch64, with the last visible output being a gateway.platforms.telegram httpx.ReadError reconnect loop. No Python traceback was captured before the crash, so neither the reporter nor maintainers can tell which C-extension call frame triggered the SIGSEGV — making the bug essentially un-diagnosable without more data.

What Changed

Standardized this PR body to the current Hermes Turbo template.
Preserved the original detailed description below for reference.

Fluxo

A mudança continua seguindo o fluxo original descrito na seção preservada abaixo, sem ampliar o escopo funcional deste PR.

Visão

A padronização melhora a revisão, reduz ruído e evita deriva de formatação entre PRs abertos.

Test Plan

Veja a descrição original preservada abaixo para detalhes de validação, testes e notas de verificação.

Original body

What does this PR do?

Summary

Diagnostic fix for #25666. Reporter @rab1dd0g hit recurring status=11/SEGV (exit 139) on Raspberry Pi aarch64, with the last visible output being a gateway.platforms.telegram httpx.ReadError reconnect loop. No Python traceback was captured before the crash, so neither the reporter nor maintainers can tell which C-extension call frame triggered the SIGSEGV — making the bug essentially un-diagnosable without more data.

This PR doesn't fix the SIGSEGV. It makes the next SIGSEGV self-document so it can be fixed.

Why a diagnostic-only PR

The root cause is almost certainly in a transitive C extension (httpx → h11/h2 → openssl/cryptography, or grpc, or PyO3-based crypto on aarch64). Without a stack trace, picking the right upstream package to file against is guessing. With faulthandler enabled, the next crash gives:

Python frame at the crash site (which call into the C lib triggered it)
C-level signal type (SEGV / BUS / FPE / etc.)
All-threads dump (all_threads=True) so a thread crashing while another holds the GIL is visible

That's enough to turn this issue from "Telegram gateway crashes silently" into a pinpoint actionable report.

Fix

New helper hermes_cli.gateway._enable_faulthandler_for_gateway() enables faulthandler at gateway startup.
run_gateway() calls it right after the docker-root guard and sys.path setup, so any C-extension import that follows is covered.
Opt-out via HERMES_DISABLE_FAULTHANDLER=1 for the rare platforms where faulthandler itself is unstable (accepts 1, true, yes, case-insensitive).
Best-effort: any exception during enable() is silently swallowed so an environment without writable stderr can't break gateway startup.

Solution sketch

flowchart TD
    A[hermes -p P gateway run] --> B[run_gateway]
    B --> C[_enable_faulthandler_for_gateway]
    C --> D{HERMES_DISABLE_FAULTHANDLER set?}
    D -- yes --> E[skip]
    D -- no --> F[faulthandler.enable all_threads=True]
    F --> G[gateway boot continues]
    G --> H[Some C extension crashes later]
    H --> I[faulthandler dumps Python+C traceback to stderr]
    I --> J[journald captures it -> actionable bug report]

Tests

python -m pytest tests/hermes_cli/test_gateway.py::TestEnableFaulthandler -q
# 3 passed

3 new cases on the new TestEnableFaulthandler class:

test_enable_returns_true_when_module_available — happy path.
test_opt_out_skips_enable — env var disables it.
test_opt_out_accepts_common_truthy_values — 1, true, yes (case-insensitive) all work.

Risk

The faulthandler module is stdlib and stable on the platforms Hermes targets.
The call is guarded with try/except: pass so it can't break gateway startup even on a broken interpreter.
Behaviour with the opt-out env set is bit-for-bit identical to before.
The faulthandler output goes to stderr only on a fatal signal — no overhead in normal operation.

Duplicate check

gh pr list --state open --search "25666 in:body,title" → 0
gh pr list --state open --search "faulthandler" → 0
gh pr list --state open --search "SIGSEGV" → 0
gh pr list --search "faulthandler is:merged" → 0

Funnel discipline

Opened under doc 23. Diagnostic-only contributions for hard-to-reproduce bugs are explicitly endorsed in the doc as a legitimate funnel category — see the #22388 reference pattern in hermes/23-funnel-discipline.md.

Related Issue

Fixes #25666

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
🔒 Security fix
📝 Documentation update
✅ Tests (adding or improving test coverage)
♻️ Refactor (no behavior change)
🎯 New skill (bundled or hub)

Changes Made

preserved the existing technical rationale and validation notes inside the template body
scoped this PR description to the implementation already present on the branch
aligned the delivery format with .github/PULL_REQUEST_TEMPLATE.md

How to Test

Run python -m pytest tests/hermes_cli/test_gateway.py::TestEnableFaulthandler -q.
Confirm the scoped behavior described above still holds after the focused checks.
Confirm the scoped behavior described above still holds after the focused checks.

Checklist

Code

I've read the Contributing Guide
My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
I searched for existing PRs to make sure this isn't a duplicate
My PR contains only changes related to this fix/feature (no unrelated commits)
I've run pytest tests/ -q and all tests pass
I've added tests for my changes (required for bug fixes, strongly encouraged for features)
I've tested on my platform:

Documentation & Housekeeping

I've updated relevant documentation (README, docs/, docstrings) — or N/A
I've updated cli-config.yaml.example if I added/changed config keys — or N/A
I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Screenshots / Logs

N/A.

Generated by Hermes Turbo

@rab1dd0g

…eback (NousResearch#25666) Reporter @rab1dd0g hit recurring `status=11/SEGV` (exit 139) on Raspberry Pi aarch64, with the last visible output being a `gateway.platforms.telegram` httpx.ReadError reconnect loop. No Python traceback was captured before the crash, so neither the reporter nor maintainers can tell which C-extension call frame triggered the SIGSEGV — making the bug essentially un-diagnosable without more data. The Python standard library has `faulthandler` exactly for this: when enabled, the next SIGSEGV / SIGFPE / SIGABRT / SIGBUS / SIGILL dumps a Python+C traceback (and, with `all_threads=True`, every thread's frames) to stderr before the process dies. journald captures stderr by default, so the next crash will leave actionable detail in the operator's logs. Changes: - New helper `hermes_cli.gateway._enable_faulthandler_for_gateway()` enables faulthandler once at gateway startup. - `run_gateway()` calls it right after the docker-root guard and `sys.path` setup, so any C-extension import that follows is covered. - Opt-out via `HERMES_DISABLE_FAULTHANDLER=1` for the rare platforms where faulthandler itself is unstable. - Best-effort: any exception during `enable()` is silently swallowed so an environment without writable stderr can't break gateway startup. This is a diagnostic fix, not a root-cause fix. The actual SIGSEGV needs the traceback this PR captures to be pinpointed — almost certainly in httpx/openssl/cryptography or a transitive C extension that's flaky on aarch64. Landing this gives the reporter (and any future user hitting the same crash) the trace they need to file an upstream issue against the right package. Tests: - `test_enable_returns_true_when_module_available` — happy path. - `test_opt_out_skips_enable` — env var disables it. - `test_opt_out_accepts_common_truthy_values` — `1`, `true`, `yes` (case-insensitive) all work. `python -m pytest tests/hermes_cli/test_gateway.py::TestEnableFaulthandler -q` -> 3 passed. Refs NousResearch#25666.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

rab1dd0g · 2026-05-14T20:10:31Z

Thanks for putting this together.

Happy to test this branch on the architecture if that would be useful. Otherwise I’ll wait for it to merge, and capture the next gateway crash output with faulthandler enabled.

Copilot AI review requested due to automatic review settings May 14, 2026 15:08

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/gateway Gateway runner, session dispatch, delivery labels May 14, 2026

Copilot AI reviewed May 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gateway): enable faulthandler so C-extension crashes leave a traceback (#25666)#25794

fix(gateway): enable faulthandler so C-extension crashes leave a traceback (#25666)#25794
wesleysimplicio wants to merge 1 commit into
NousResearch:mainfrom
wesleysimplicio:codex/fix-gateway-faulthandler-segv-diag

wesleysimplicio commented May 14, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

rab1dd0g commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

wesleysimplicio commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Root cause

Fix

Why this shape

Tests

Related PRs / issues

Summary

What Changed

Fluxo

Visão

Test Plan

What does this PR do?

Summary

Why a diagnostic-only PR

Fix

Solution sketch

Tests

Risk

Duplicate check

Funnel discipline

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

Screenshots / Logs

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

rab1dd0g commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wesleysimplicio commented May 14, 2026 •

edited

Loading