Skip to content

JSONDecodeError misclassified as local validation error causes non-retryable abort (HTTP None) #14271

@cat5edopeHA

Description

@cat5edopeHA

Bug Description

Transient provider JSONDecodeError failures are being treated as local validation/programming errors and immediately routed to the non-retryable client-error path.

Because json.JSONDecodeError subclasses ValueError, current logic in run_agent.py classifies it as is_local_validation_error, which bypasses normal retry handling. This causes premature aborts for recoverable upstream/provider parse failures (e.g., empty or non-JSON transient responses).

Steps to Reproduce

Deterministic (unit-style repro)

  1. Instantiate AIAgent.
  2. Patch _interruptible_api_call with side effects:
    • first call: json.JSONDecodeError("Expecting value", "", 0)
    • second call: valid completion response
  3. Call run_conversation("hello").

Real-world repro seen in production

  1. Configure custom provider to https://api.llmgateway.io/v1 (OpenAI-compatible chat completions).
  2. Use model kimi-k2.6.
  3. Send a normal prompt during a period where upstream intermittently returns empty/non-JSON response.
  4. Observe first parse failure aborting as non-retryable instead of retrying.

Expected Behavior

  • JSONDecodeError from provider response parsing should go through normal API error classification/retry/fallback flow.
  • If next attempt succeeds, conversation should complete normally.

Actual Behavior

The first JSONDecodeError is treated as non-retryable local validation (HTTP None) and aborts early.

Environment

  • OS: Proxmox LXC guest (Debian-based) on Proxmox VE
  • Hermes repo: branch main, commit 402d048e (observed before local fix)
  • Python: Python 3.11.15
  • API mode: chat_completions
  • Provider: custom (LLMGateway), model kimi-k2.6

Error Output

⚠️  API call failed (attempt 1/3): JSONDecodeError
   📝 Error: Expecting value: line 1 column 1 (char 0)
⚠️ Non-retryable error (HTTP None) — trying fallback...
❌ Non-retryable client error (HTTP None). Aborting.

Request dump captured:

  • /root/.hermes/sessions/request_dump_20260423_011633_775f73_20260423_011738_719481.json
  • "error": {"type": "JSONDecodeError", "message": "Expecting value: line 1 column 1 (char 0)"}

Suspected Root Cause

In run_agent.py, around the client-error classification block (line ~10669 in current working tree):

is_local_validation_error = (
    isinstance(api_error, (ValueError, TypeError))
    and not isinstance(api_error, UnicodeEncodeError)
)

Since json.JSONDecodeError is a ValueError, it is incorrectly flagged as local validation.

Proposed Fix

Exclude json.JSONDecodeError from local-validation fast-fail classification so it can use normal retry logic:

is_local_validation_error = (
    isinstance(api_error, (ValueError, TypeError))
    and not isinstance(api_error, (UnicodeEncodeError, json.JSONDecodeError))
)

Regression Test Suggestion

Add test in tests/run_agent/test_run_agent.py asserting:

  • first API attempt raises JSONDecodeError
  • second attempt succeeds
  • run_conversation returns completed response (no immediate non-retryable abort path triggered)

Example test name:

  • test_jsondecode_error_is_retried_not_treated_as_local_validation

(Locally validated in working tree at tests/run_agent/test_run_agent.py line ~2822.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/agentCore agent loop, run_agent.py, prompt buildersweeper:implemented-on-mainSweeper: behavior already present on current maintype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions