Bug Description
Transient provider JSONDecodeError failures are being treated as local validation/programming errors and immediately routed to the non-retryable client-error path.
Because json.JSONDecodeError subclasses ValueError, current logic in run_agent.py classifies it as is_local_validation_error, which bypasses normal retry handling. This causes premature aborts for recoverable upstream/provider parse failures (e.g., empty or non-JSON transient responses).
Steps to Reproduce
Deterministic (unit-style repro)
- Instantiate
AIAgent.
- Patch
_interruptible_api_call with side effects:
- first call:
json.JSONDecodeError("Expecting value", "", 0)
- second call: valid completion response
- Call
run_conversation("hello").
Real-world repro seen in production
- Configure custom provider to
https://api.llmgateway.io/v1 (OpenAI-compatible chat completions).
- Use model
kimi-k2.6.
- Send a normal prompt during a period where upstream intermittently returns empty/non-JSON response.
- Observe first parse failure aborting as non-retryable instead of retrying.
Expected Behavior
JSONDecodeError from provider response parsing should go through normal API error classification/retry/fallback flow.
- If next attempt succeeds, conversation should complete normally.
Actual Behavior
The first JSONDecodeError is treated as non-retryable local validation (HTTP None) and aborts early.
Environment
- OS:
Proxmox LXC guest (Debian-based) on Proxmox VE
- Hermes repo: branch
main, commit 402d048e (observed before local fix)
- Python:
Python 3.11.15
- API mode:
chat_completions
- Provider:
custom (LLMGateway), model kimi-k2.6
Error Output
⚠️ API call failed (attempt 1/3): JSONDecodeError
📝 Error: Expecting value: line 1 column 1 (char 0)
⚠️ Non-retryable error (HTTP None) — trying fallback...
❌ Non-retryable client error (HTTP None). Aborting.
Request dump captured:
/root/.hermes/sessions/request_dump_20260423_011633_775f73_20260423_011738_719481.json
"error": {"type": "JSONDecodeError", "message": "Expecting value: line 1 column 1 (char 0)"}
Suspected Root Cause
In run_agent.py, around the client-error classification block (line ~10669 in current working tree):
is_local_validation_error = (
isinstance(api_error, (ValueError, TypeError))
and not isinstance(api_error, UnicodeEncodeError)
)
Since json.JSONDecodeError is a ValueError, it is incorrectly flagged as local validation.
Proposed Fix
Exclude json.JSONDecodeError from local-validation fast-fail classification so it can use normal retry logic:
is_local_validation_error = (
isinstance(api_error, (ValueError, TypeError))
and not isinstance(api_error, (UnicodeEncodeError, json.JSONDecodeError))
)
Regression Test Suggestion
Add test in tests/run_agent/test_run_agent.py asserting:
- first API attempt raises
JSONDecodeError
- second attempt succeeds
run_conversation returns completed response (no immediate non-retryable abort path triggered)
Example test name:
test_jsondecode_error_is_retried_not_treated_as_local_validation
(Locally validated in working tree at tests/run_agent/test_run_agent.py line ~2822.)
Bug Description
Transient provider
JSONDecodeErrorfailures are being treated as local validation/programming errors and immediately routed to the non-retryable client-error path.Because
json.JSONDecodeErrorsubclassesValueError, current logic inrun_agent.pyclassifies it asis_local_validation_error, which bypasses normal retry handling. This causes premature aborts for recoverable upstream/provider parse failures (e.g., empty or non-JSON transient responses).Steps to Reproduce
Deterministic (unit-style repro)
AIAgent._interruptible_api_callwith side effects:json.JSONDecodeError("Expecting value", "", 0)run_conversation("hello").Real-world repro seen in production
https://api.llmgateway.io/v1(OpenAI-compatible chat completions).kimi-k2.6.Expected Behavior
JSONDecodeErrorfrom provider response parsing should go through normal API error classification/retry/fallback flow.Actual Behavior
The first
JSONDecodeErroris treated as non-retryable local validation (HTTP None) and aborts early.Environment
Proxmox LXC guest (Debian-based) on Proxmox VEmain, commit402d048e(observed before local fix)Python 3.11.15chat_completionscustom(LLMGateway), modelkimi-k2.6Error Output
Request dump captured:
/root/.hermes/sessions/request_dump_20260423_011633_775f73_20260423_011738_719481.json"error": {"type": "JSONDecodeError", "message": "Expecting value: line 1 column 1 (char 0)"}Suspected Root Cause
In
run_agent.py, around the client-error classification block (line ~10669 in current working tree):Since
json.JSONDecodeErroris aValueError, it is incorrectly flagged as local validation.Proposed Fix
Exclude
json.JSONDecodeErrorfrom local-validation fast-fail classification so it can use normal retry logic:Regression Test Suggestion
Add test in
tests/run_agent/test_run_agent.pyasserting:JSONDecodeErrorrun_conversationreturns completed response (no immediate non-retryable abort path triggered)Example test name:
test_jsondecode_error_is_retried_not_treated_as_local_validation(Locally validated in working tree at
tests/run_agent/test_run_agent.pyline ~2822.)