Skip to content

fix(gateway): enable faulthandler so C-extension crashes leave a traceback (#25666)#25794

Open
wesleysimplicio wants to merge 1 commit into
NousResearch:mainfrom
wesleysimplicio:codex/fix-gateway-faulthandler-segv-diag
Open

fix(gateway): enable faulthandler so C-extension crashes leave a traceback (#25666)#25794
wesleysimplicio wants to merge 1 commit into
NousResearch:mainfrom
wesleysimplicio:codex/fix-gateway-faulthandler-segv-diag

Conversation

@wesleysimplicio

@wesleysimplicio wesleysimplicio commented May 14, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Diagnostic fix for #25666. Reporter @rab1dd0g hit recurring status=11/SEGV (exit 139) on Raspberry Pi aarch64, with the last visible output being a gateway.platforms.telegram httpx.ReadError reconnect loop. No Python traceback was captured before the crash, so neither the reporter nor maintainers can tell which C-extension call frame triggered the SIGSEGV — making the bug essentially un-diagnosable without more data.

Root cause

The detailed rationale from the original PR body is preserved below. This template update keeps the review structure consistent with #29640.

Fix

  • New helper hermes_cli.gateway._enable_faulthandler_for_gateway() enables faulthandler at gateway startup.
  • run_gateway() calls it right after the docker-root guard and sys.path setup, so any C-extension import that follows is covered.
  • Opt-out via HERMES_DISABLE_FAULTHANDLER=1 for the rare platforms where faulthandler itself is unstable (accepts 1, true, yes, case-insensitive).
  • Best-effort: any exception during enable() is silently swallowed so an environment without writable stderr can't break gateway startup.

Why this shape

This shape mirrors #29640 so reviewers can quickly compare scope, root cause, fix, tests, and related context without having to decode a custom PR description.

Tests

  • Veja a descrição original preservada abaixo para detalhes de validação, testes e notas de verificação.
Original body

Related PRs / issues

Fixes #25666

Original body

Summary

Diagnostic fix for #25666. Reporter @rab1dd0g hit recurring status=11/SEGV (exit 139) on Raspberry Pi aarch64, with the last visible output being a gateway.platforms.telegram httpx.ReadError reconnect loop. No Python traceback was captured before the crash, so neither the reporter nor maintainers can tell which C-extension call frame triggered the SIGSEGV — making the bug essentially un-diagnosable without more data.

What Changed

  • Standardized this PR body to the current Hermes Turbo template.
  • Preserved the original detailed description below for reference.

Fluxo

A mudança continua seguindo o fluxo original descrito na seção preservada abaixo, sem ampliar o escopo funcional deste PR.

Visão

A padronização melhora a revisão, reduz ruído e evita deriva de formatação entre PRs abertos.

Test Plan

  • Veja a descrição original preservada abaixo para detalhes de validação, testes e notas de verificação.
Original body

What does this PR do?

Summary

Diagnostic fix for #25666. Reporter @rab1dd0g hit recurring status=11/SEGV (exit 139) on Raspberry Pi aarch64, with the last visible output being a gateway.platforms.telegram httpx.ReadError reconnect loop. No Python traceback was captured before the crash, so neither the reporter nor maintainers can tell which C-extension call frame triggered the SIGSEGV — making the bug essentially un-diagnosable without more data.

This PR doesn't fix the SIGSEGV. It makes the next SIGSEGV self-document so it can be fixed.

Why a diagnostic-only PR

The root cause is almost certainly in a transitive C extension (httpx → h11/h2 → openssl/cryptography, or grpc, or PyO3-based crypto on aarch64). Without a stack trace, picking the right upstream package to file against is guessing. With faulthandler enabled, the next crash gives:

  • Python frame at the crash site (which call into the C lib triggered it)
  • C-level signal type (SEGV / BUS / FPE / etc.)
  • All-threads dump (all_threads=True) so a thread crashing while another holds the GIL is visible

That's enough to turn this issue from "Telegram gateway crashes silently" into a pinpoint actionable report.

Fix

  • New helper hermes_cli.gateway._enable_faulthandler_for_gateway() enables faulthandler at gateway startup.
  • run_gateway() calls it right after the docker-root guard and sys.path setup, so any C-extension import that follows is covered.
  • Opt-out via HERMES_DISABLE_FAULTHANDLER=1 for the rare platforms where faulthandler itself is unstable (accepts 1, true, yes, case-insensitive).
  • Best-effort: any exception during enable() is silently swallowed so an environment without writable stderr can't break gateway startup.

Solution sketch

flowchart TD
    A[hermes -p P gateway run] --> B[run_gateway]
    B --> C[_enable_faulthandler_for_gateway]
    C --> D{HERMES_DISABLE_FAULTHANDLER set?}
    D -- yes --> E[skip]
    D -- no --> F[faulthandler.enable all_threads=True]
    F --> G[gateway boot continues]
    G --> H[Some C extension crashes later]
    H --> I[faulthandler dumps Python+C traceback to stderr]
    I --> J[journald captures it -> actionable bug report]
Loading

Tests

python -m pytest tests/hermes_cli/test_gateway.py::TestEnableFaulthandler -q
# 3 passed

3 new cases on the new TestEnableFaulthandler class:

  • test_enable_returns_true_when_module_available — happy path.
  • test_opt_out_skips_enable — env var disables it.
  • test_opt_out_accepts_common_truthy_values1, true, yes (case-insensitive) all work.

Risk

  • The faulthandler module is stdlib and stable on the platforms Hermes targets.
  • The call is guarded with try/except: pass so it can't break gateway startup even on a broken interpreter.
  • Behaviour with the opt-out env set is bit-for-bit identical to before.
  • The faulthandler output goes to stderr only on a fatal signal — no overhead in normal operation.

Duplicate check

  • gh pr list --state open --search "25666 in:body,title" → 0
  • gh pr list --state open --search "faulthandler" → 0
  • gh pr list --state open --search "SIGSEGV" → 0
  • gh pr list --search "faulthandler is:merged" → 0

Funnel discipline

Opened under doc 23. Diagnostic-only contributions for hard-to-reproduce bugs are explicitly endorsed in the doc as a legitimate funnel category — see the #22388 reference pattern in hermes/23-funnel-discipline.md.

Related Issue

Fixes #25666

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • preserved the existing technical rationale and validation notes inside the template body
  • scoped this PR description to the implementation already present on the branch
  • aligned the delivery format with .github/PULL_REQUEST_TEMPLATE.md

How to Test

  1. Run python -m pytest tests/hermes_cli/test_gateway.py::TestEnableFaulthandler -q.
  2. Confirm the scoped behavior described above still holds after the focused checks.
  3. Confirm the scoped behavior described above still holds after the focused checks.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform:

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Screenshots / Logs

  • N/A.

Generated by Hermes Turbo


Generated by Hermes Turbo

…eback (NousResearch#25666)

Reporter @rab1dd0g hit recurring `status=11/SEGV` (exit 139) on
Raspberry Pi aarch64, with the last visible output being a
`gateway.platforms.telegram` httpx.ReadError reconnect loop. No
Python traceback was captured before the crash, so neither the
reporter nor maintainers can tell which C-extension call frame
triggered the SIGSEGV — making the bug essentially un-diagnosable
without more data.

The Python standard library has `faulthandler` exactly for this:
when enabled, the next SIGSEGV / SIGFPE / SIGABRT / SIGBUS / SIGILL
dumps a Python+C traceback (and, with `all_threads=True`, every
thread's frames) to stderr before the process dies. journald
captures stderr by default, so the next crash will leave actionable
detail in the operator's logs.

Changes:

- New helper `hermes_cli.gateway._enable_faulthandler_for_gateway()`
  enables faulthandler once at gateway startup.
- `run_gateway()` calls it right after the docker-root guard and
  `sys.path` setup, so any C-extension import that follows is
  covered.
- Opt-out via `HERMES_DISABLE_FAULTHANDLER=1` for the rare
  platforms where faulthandler itself is unstable.
- Best-effort: any exception during `enable()` is silently
  swallowed so an environment without writable stderr can't break
  gateway startup.

This is a diagnostic fix, not a root-cause fix. The actual SIGSEGV
needs the traceback this PR captures to be pinpointed — almost
certainly in httpx/openssl/cryptography or a transitive C extension
that's flaky on aarch64. Landing this gives the reporter (and any
future user hitting the same crash) the trace they need to file an
upstream issue against the right package.

Tests:
- `test_enable_returns_true_when_module_available` — happy path.
- `test_opt_out_skips_enable` — env var disables it.
- `test_opt_out_accepts_common_truthy_values` — `1`, `true`,
  `yes` (case-insensitive) all work.

`python -m pytest tests/hermes_cli/test_gateway.py::TestEnableFaulthandler -q` -> 3 passed.

Refs NousResearch#25666.
Copilot AI review requested due to automatic review settings May 14, 2026 15:08
@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/gateway Gateway runner, session dispatch, delivery labels May 14, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@rab1dd0g

Copy link
Copy Markdown

Thanks for putting this together.

Happy to test this branch on the architecture if that would be useful. Otherwise I’ll wait for it to merge, and capture the next gateway crash output with faulthandler enabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Telegram gateway SIGSEGV / exit 139 during httpx.ReadError reconnect loop on Raspberry Pi aarch64

4 participants