Skip to content

fix: exclude json.JSONDecodeError from local validation error check#14366

Closed
aj-nt wants to merge 1 commit into
NousResearch:mainfrom
aj-nt:fix/json-decode-error-misclassification
Closed

fix: exclude json.JSONDecodeError from local validation error check#14366
aj-nt wants to merge 1 commit into
NousResearch:mainfrom
aj-nt:fix/json-decode-error-misclassification

Conversation

@aj-nt

@aj-nt aj-nt commented Apr 23, 2026

Copy link
Copy Markdown

Problem

Closes #14271

json.JSONDecodeError inherits from ValueError. When the OpenAI SDK fails to parse a provider response (empty body, truncated JSON, garbled bytes), it raises json.JSONDecodeError. The is_local_validation_error check in run_agent.py catches it as isinstance(api_error, ValueError) and triggers a non-retryable abort — but the error is transient (provider-side), not a programming bug.

The error classifier (agent/error_classifier.py) already correctly returns FailoverReason.unknown, retryable=True for these errors. The bug is the inline isinstance check that overrides the classifier.

Secondary bug found

UnicodeDecodeError (also a ValueError subclass) from garbled provider responses gets the same wrong treatment. The existing code only excluded UnicodeEncodeError but a garbled response body can also cause UnicodeDecodeError.

Fix

Updated is_local_validation_error predicate in run_agent.py:

Before (upstream had only):

is_local_validation_error = (
    isinstance(api_error, (ValueError, TypeError))
    and not isinstance(api_error, UnicodeEncodeError)
)

After (this PR, merged with upstream ssl.SSLError fix from #14367):

is_local_validation_error = (
    isinstance(api_error, (ValueError, TypeError))
    and not isinstance(api_error, (UnicodeError, json.JSONDecodeError))
    and not isinstance(api_error, ssl.SSLError)
)

Three exclusions from the ValueError catch:

  • UnicodeError (parent class) covers UnicodeEncodeError, UnicodeDecodeError, and UnicodeTranslateError — all indicate provider-side transport issues
  • json.JSONDecodeError — malformed/truncated provider response body
  • ssl.SSLError — TLS transport failure (inherits ValueError via Python MRO, cherry-picked from upstream ssl.SSLCertVerificationError misclassified as local validation error #14367 / commit 4e27e49)

Testing

9 tests in tests/test_json_decode_error_misclassification.py:

  • 5 RED tests (would fail on old code): json.JSONDecodeError, UnicodeDecodeError, UnicodeTranslateError must NOT be classified as local validation errors; classifier must return retryable=True
  • 4 GREEN tests (sanity): plain ValueError, TypeError, UnicodeEncodeError, and ssl.SSLError still correctly excluded

COUPLING note in both production and test file — update both if changing the predicate.

Rebase notes

Rebased onto origin/main (commit 4fade39). Conflict in run_agent.py resolved by merging our UnicodeError broadening with upstream's ssl.SSLError exclusion — both are complementary fixes for the same class of bug (ValueError subclasses from provider transport, not local code).

Impact

Low risk, high value. The change only affects the is_local_validation_error boolean — the classifier pipeline and all other recovery paths are untouched. Genuine ValueError/TypeError from coding bugs still trigger non-retryable abort. Only provider-originated ValueError subclasses now correctly fall through to the retryable recovery path.

@aj-nt aj-nt force-pushed the fix/json-decode-error-misclassification branch from 625726a to 6e0c7c3 Compare April 23, 2026 04:36
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder labels Apr 23, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of PR #14293 — same root cause (JSONDecodeError as ValueError subclass in is_local_validation_error), same fix for #14271.

…ion error check

json.JSONDecodeError inherits from ValueError. When the OpenAI SDK fails
to parse a provider response (empty body, truncated JSON), it raises
json.JSONDecodeError. The is_local_validation_error check in run_agent.py
catches it as isinstance(api_error, ValueError) and triggers a
non-retryable abort -- but the error is transient (provider-side), not a
programming bug.

Similarly, UnicodeDecodeError (also ValueError subclass) from garbled
provider responses was incorrectly flagged. The existing code only
excluded UnicodeEncodeError.

The error classifier already correctly returns retryable=True for these
errors; the bug was the isinstance check overriding the classifier.

Fix: Exclude json.JSONDecodeError and UnicodeError (parent of
Encode/Decode/Translate) from the is_local_validation_error check,
so they fall through to the classifier's retryable path.

Closes NousResearch#14271

Co-authored-by: AJ <aj@users.noreply.github.com>
@aj-nt aj-nt force-pushed the fix/json-decode-error-misclassification branch from 91e3dca to cc92a82 Compare April 25, 2026 03:57
@aj-nt aj-nt closed this Apr 25, 2026
@aj-nt aj-nt deleted the fix/json-decode-error-misclassification branch April 25, 2026 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

JSONDecodeError misclassified as local validation error causes non-retryable abort (HTTP None)

2 participants