Skip to content

fix(agent): retry TLS certificate transport errors#14374

Closed
sgaofen wants to merge 1 commit into
NousResearch:mainfrom
sgaofen:codex/fix-14367-ssl-cert-retry
Closed

fix(agent): retry TLS certificate transport errors#14374
sgaofen wants to merge 1 commit into
NousResearch:mainfrom
sgaofen:codex/fix-14367-ssl-cert-retry

Conversation

@sgaofen

@sgaofen sgaofen commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

Root cause

ssl.SSLCertVerificationError inherits from ValueError, so the inline local-validation guard treated TLS certificate failures as non-retryable request-shaping bugs. Those failures are provider transport errors and should stay on the normal retry/failover path.

Closes #14367.

Fix

  • Add a small _is_local_validation_error() helper for the request-shaping guard.
  • Exclude ssl.SSLError from local-validation classification, which covers SSLCertVerificationError and related TLS transport failures.
  • Add regression coverage proving certificate errors are not treated as local validation while ordinary ValueError/TypeError still are.

Tests

  • uv run --frozen --python 3.11 --extra dev pytest -o addopts= tests/run_agent/test_run_agent.py::test_ssl_cert_verification_error_is_not_local_validation tests/run_agent/test_run_agent.py::test_request_shaping_errors_are_local_validation tests/run_agent/test_run_agent.py::test_unicode_encode_error_is_not_local_validation -q -> 4 passed
  • uv run --frozen --python 3.11 --extra dev pytest -o addopts= tests/agent/test_error_classifier.py -q -> 111 passed
  • uv run --frozen --python 3.11 --extra dev pytest -o addopts= tests/run_agent/test_run_agent.py -q -> 297 passed
  • git diff --check

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder labels Apr 23, 2026
@sgaofen sgaofen force-pushed the codex/fix-14367-ssl-cert-retry branch from 088f4da to 556e104 Compare April 27, 2026 04:13
@sgaofen

sgaofen commented Apr 27, 2026

Copy link
Copy Markdown
Contributor Author

Rebased this PR onto current origin/main (859e09b7). During conflict resolution I kept the extracted _is_local_validation_error() helper from this PR and folded in the newer main-branch json.JSONDecodeError exclusion, so the rebase does not regress the provider malformed-response retry behavior.

Validation after rebase:

  • /Users/stephenyu/Documents/hermes-agent/.venv/bin/python -m pytest tests/run_agent/test_run_agent.py::test_ssl_cert_verification_error_is_not_local_validation tests/run_agent/test_run_agent.py::test_request_shaping_errors_are_local_validation tests/run_agent/test_run_agent.py::test_unicode_encode_error_is_not_local_validation tests/run_agent/test_run_agent.py::test_json_decode_error_is_not_local_validation -q --tb=short -> 5 passed
  • /Users/stephenyu/Documents/hermes-agent/.venv/bin/python -m pytest tests/agent/test_error_classifier.py -q --tb=short -> 118 passed
  • /Users/stephenyu/Documents/hermes-agent/.venv/bin/python -m pytest tests/run_agent/test_run_agent.py -q --tb=short -> 308 passed
  • git diff --check -> passed

@sgaofen

sgaofen commented Apr 27, 2026

Copy link
Copy Markdown
Contributor Author

CI follow-up: the refreshed full test job is red with the same repo-wide baseline cluster seen on sibling PRs, not with failures in this TLS/local-validation change. The failures are concentrated in Discord fixture guild attributes, npm install vs ci expectations, PTY/web-server/tui_gateway drift, WSL clipboard detection, and tool-arg coercion. The targeted validation for this PR remains green (test_run_agent.py 308 passed and test_error_classifier.py 118 passed).

@teknium1

Copy link
Copy Markdown
Contributor

Thanks for the fix @sgaofen — the core change here landed on main independently before this PR could be merged.

Automated hermes-sweeper review.

  • The ssl.SSLError exclusion from is_local_validation_error is present at run_agent.py:11169-11176 via commit 4e27e498f1b438b2a380cd4be83dc37761fd1412 (fix(agent): exclude ssl.SSLError from is_local_validation_error to prevent non-retryable abort), which also closes ssl.SSLCertVerificationError misclassified as local validation error #14367.
  • The fix is not yet in a release tag (landed after v2026.4.23), but it is on main.
  • One thing worth noting: the landing commit did not include the three regression tests you authored (test_ssl_cert_verification_error_is_not_local_validation, etc.). If a maintainer wants to rescue the test additions from this PR, that +20-line patch to tests/run_agent/test_run_agent.py is still valuable coverage.

@teknium1 teknium1 closed this Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ssl.SSLCertVerificationError misclassified as local validation error

3 participants