fix(gateway): enable faulthandler so C-extension crashes leave a traceback (#25666)#25794
Open
wesleysimplicio wants to merge 1 commit into
Open
Conversation
…eback (NousResearch#25666) Reporter @rab1dd0g hit recurring `status=11/SEGV` (exit 139) on Raspberry Pi aarch64, with the last visible output being a `gateway.platforms.telegram` httpx.ReadError reconnect loop. No Python traceback was captured before the crash, so neither the reporter nor maintainers can tell which C-extension call frame triggered the SIGSEGV — making the bug essentially un-diagnosable without more data. The Python standard library has `faulthandler` exactly for this: when enabled, the next SIGSEGV / SIGFPE / SIGABRT / SIGBUS / SIGILL dumps a Python+C traceback (and, with `all_threads=True`, every thread's frames) to stderr before the process dies. journald captures stderr by default, so the next crash will leave actionable detail in the operator's logs. Changes: - New helper `hermes_cli.gateway._enable_faulthandler_for_gateway()` enables faulthandler once at gateway startup. - `run_gateway()` calls it right after the docker-root guard and `sys.path` setup, so any C-extension import that follows is covered. - Opt-out via `HERMES_DISABLE_FAULTHANDLER=1` for the rare platforms where faulthandler itself is unstable. - Best-effort: any exception during `enable()` is silently swallowed so an environment without writable stderr can't break gateway startup. This is a diagnostic fix, not a root-cause fix. The actual SIGSEGV needs the traceback this PR captures to be pinpointed — almost certainly in httpx/openssl/cryptography or a transitive C extension that's flaky on aarch64. Landing this gives the reporter (and any future user hitting the same crash) the trace they need to file an upstream issue against the right package. Tests: - `test_enable_returns_true_when_module_available` — happy path. - `test_opt_out_skips_enable` — env var disables it. - `test_opt_out_accepts_common_truthy_values` — `1`, `true`, `yes` (case-insensitive) all work. `python -m pytest tests/hermes_cli/test_gateway.py::TestEnableFaulthandler -q` -> 3 passed. Refs NousResearch#25666.
|
Thanks for putting this together. Happy to test this branch on the architecture if that would be useful. Otherwise I’ll wait for it to merge, and capture the next gateway crash output with faulthandler enabled. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Diagnostic fix for #25666. Reporter @rab1dd0g hit recurring
status=11/SEGV(exit 139) on Raspberry Pi aarch64, with the last visible output being agateway.platforms.telegramhttpx.ReadErrorreconnect loop. No Python traceback was captured before the crash, so neither the reporter nor maintainers can tell which C-extension call frame triggered the SIGSEGV — making the bug essentially un-diagnosable without more data.Root cause
The detailed rationale from the original PR body is preserved below. This template update keeps the review structure consistent with #29640.
Fix
hermes_cli.gateway._enable_faulthandler_for_gateway()enables faulthandler at gateway startup.run_gateway()calls it right after the docker-root guard andsys.pathsetup, so any C-extension import that follows is covered.HERMES_DISABLE_FAULTHANDLER=1for the rare platforms where faulthandler itself is unstable (accepts1,true,yes, case-insensitive).enable()is silently swallowed so an environment without writable stderr can't break gateway startup.Why this shape
This shape mirrors #29640 so reviewers can quickly compare scope, root cause, fix, tests, and related context without having to decode a custom PR description.
Tests
Original body
Related PRs / issues
Fixes #25666
Original body
Summary
Diagnostic fix for #25666. Reporter @rab1dd0g hit recurring
status=11/SEGV(exit 139) on Raspberry Pi aarch64, with the last visible output being agateway.platforms.telegramhttpx.ReadErrorreconnect loop. No Python traceback was captured before the crash, so neither the reporter nor maintainers can tell which C-extension call frame triggered the SIGSEGV — making the bug essentially un-diagnosable without more data.What Changed
Fluxo
A mudança continua seguindo o fluxo original descrito na seção preservada abaixo, sem ampliar o escopo funcional deste PR.
Visão
A padronização melhora a revisão, reduz ruído e evita deriva de formatação entre PRs abertos.
Test Plan
Original body
What does this PR do?
Summary
Diagnostic fix for #25666. Reporter @rab1dd0g hit recurring
status=11/SEGV(exit 139) on Raspberry Pi aarch64, with the last visible output being agateway.platforms.telegramhttpx.ReadErrorreconnect loop. No Python traceback was captured before the crash, so neither the reporter nor maintainers can tell which C-extension call frame triggered the SIGSEGV — making the bug essentially un-diagnosable without more data.This PR doesn't fix the SIGSEGV. It makes the next SIGSEGV self-document so it can be fixed.
Why a diagnostic-only PR
The root cause is almost certainly in a transitive C extension (httpx → h11/h2 → openssl/cryptography, or grpc, or PyO3-based crypto on aarch64). Without a stack trace, picking the right upstream package to file against is guessing. With faulthandler enabled, the next crash gives:
all_threads=True) so a thread crashing while another holds the GIL is visibleThat's enough to turn this issue from "Telegram gateway crashes silently" into a pinpoint actionable report.
Fix
hermes_cli.gateway._enable_faulthandler_for_gateway()enables faulthandler at gateway startup.run_gateway()calls it right after the docker-root guard andsys.pathsetup, so any C-extension import that follows is covered.HERMES_DISABLE_FAULTHANDLER=1for the rare platforms where faulthandler itself is unstable (accepts1,true,yes, case-insensitive).enable()is silently swallowed so an environment without writable stderr can't break gateway startup.Solution sketch
flowchart TD A[hermes -p P gateway run] --> B[run_gateway] B --> C[_enable_faulthandler_for_gateway] C --> D{HERMES_DISABLE_FAULTHANDLER set?} D -- yes --> E[skip] D -- no --> F[faulthandler.enable all_threads=True] F --> G[gateway boot continues] G --> H[Some C extension crashes later] H --> I[faulthandler dumps Python+C traceback to stderr] I --> J[journald captures it -> actionable bug report]Tests
3 new cases on the new
TestEnableFaulthandlerclass:test_enable_returns_true_when_module_available— happy path.test_opt_out_skips_enable— env var disables it.test_opt_out_accepts_common_truthy_values—1,true,yes(case-insensitive) all work.Risk
try/except: passso it can't break gateway startup even on a broken interpreter.Duplicate check
gh pr list --state open --search "25666 in:body,title"→ 0gh pr list --state open --search "faulthandler"→ 0gh pr list --state open --search "SIGSEGV"→ 0gh pr list --search "faulthandler is:merged"→ 0Funnel discipline
Opened under doc 23. Diagnostic-only contributions for hard-to-reproduce bugs are explicitly endorsed in the doc as a legitimate funnel category — see the #22388 reference pattern in
hermes/23-funnel-discipline.md.Related Issue
Fixes #25666
Type of Change
Changes Made
.github/PULL_REQUEST_TEMPLATE.mdHow to Test
python -m pytest tests/hermes_cli/test_gateway.py::TestEnableFaulthandler -q.Checklist
Code
fix(scope):,feat(scope):, etc.)pytest tests/ -qand all tests passDocumentation & Housekeeping
docs/, docstrings) — or N/Acli-config.yaml.exampleif I added/changed config keys — or N/ACONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows — or N/AScreenshots / Logs
Generated by Hermes Turbo
Generated by Hermes Turbo