Skip to content

fix(agents): recognize OpenAI/Codex server_error as retryable for model fallback#45761

Closed
ToneLoke wants to merge 2 commits intoopenclaw:mainfrom
ToneLoke:fix/server-error-retryable
Closed

fix(agents): recognize OpenAI/Codex server_error as retryable for model fallback#45761
ToneLoke wants to merge 2 commits intoopenclaw:mainfrom
ToneLoke:fix/server-error-retryable

Conversation

@ToneLoke
Copy link
Copy Markdown

Problem

The transient HTTP error detector (isJsonApiInternalServerError) only recognized Anthropic's api_error JSON format. OpenAI and Codex use a different envelope:

{"type":"error","error":{"type":"server_error","code":"server_error","message":"An error occurred..."}}

These transient 500s were treated as fatal errors, causing the model fallback chain to skip retrying — resulting in unnecessary failures.

Fix

Extend isJsonApiInternalServerError() to also match the OpenAI/Codex server_error pattern, so these transient errors trigger proper fallback behavior.

Impact

  • 7-line change in src/agents/pi-embedded-helpers/errors.ts
  • No breaking changes
  • OpenAI/Codex transient 500s now correctly trigger model fallback instead of hard-failing

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: XS labels Mar 14, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 14, 2026

Greptile Summary

This PR extends isJsonApiInternalServerError() in src/agents/pi-embedded-helpers/errors.ts to recognise OpenAI/Codex transient 500 responses (which use {"type":"server_error","code":"server_error"}) as retryable errors, in addition to the existing Anthropic api_error format. The fix correctly routes these errors through the "timeout" classification in classifyFailoverReason(), triggering model fallback instead of a hard failure.

  • The logic change is minimal, well-scoped, and the dual-condition check ("type":"server_error" and "code":"server_error") makes false positives unlikely.
  • No test was added for the new code path, even though the existing Anthropic branch has a dedicated test case at line 847 of the test file.
  • The .includes() string-matching approach (shared with the Anthropic branch) will silently miss pretty-printed JSON responses where key-value pairs are separated by a space ("type": "server_error").

Confidence Score: 4/5

  • Safe to merge; the new branch correctly classifies OpenAI server_error responses as retryable, with no breaking changes.
  • The core logic is correct and well-targeted. Score is 4 rather than 5 due to the absence of a test case for the new code path (unlike the existing Anthropic branch which is tested) and the inherited fragility of substring matching on potentially pretty-printed JSON.
  • No files require special attention beyond the minor style concerns noted in src/agents/pi-embedded-helpers/errors.ts.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/agents/pi-embedded-helpers/errors.ts
Line: 859-863

Comment:
**Missing test coverage for new code path**

The existing test file (`src/agents/pi-embedded-helpers.isbillingerrormessage.test.ts`, line 847) has a dedicated test case for the Anthropic `api_error` pattern:

```ts
it("classifies JSON api_error internal server failures as timeout", () => {
  expect(
    classifyFailoverReason(
      '{"type":"error","error":{"type":"api_error","message":"Internal server error"}}',
    ),
  ).toBe("timeout");
});
```

A parallel test for the new OpenAI/Codex `server_error` path should be added alongside it to ensure the new branch stays covered and to document the expected format:

```ts
it("classifies JSON server_error internal server failures as timeout", () => {
  expect(
    classifyFailoverReason(
      '{"type":"error","error":{"type":"server_error","code":"server_error","message":"An error occurred..."}}',
    ),
  ).toBe("timeout");
});
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/agents/pi-embedded-helpers/errors.ts
Line: 861

Comment:
**Fragile string-matching may miss pretty-printed JSON**

The check uses `value.includes('"type":"server_error"')` (no space after the colon), which will silently fail to match if OpenAI ever sends a pretty-printed response like `"type": "server_error"`. The Anthropic branch has the same pre-existing fragility. Since you're touching this function, consider switching to a safe JSON parse + property access, or at minimum accepting an optional space:

```suggestion
  if (value.includes('"type":"server_error"') && value.includes('"code":"server_error"')) {
```

A more robust alternative would be:
```ts
try {
  const parsed = JSON.parse(raw);
  const errType = parsed?.error?.type;
  const errCode = parsed?.error?.code;
  if (errType === "server_error" && errCode === "server_error") {
    return true;
  }
} catch {
  // not valid JSON, fall through
}
```

This is a pre-existing issue but worth addressing while the function is being modified.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 0b58993

Comment on lines +859 to +863
// OpenAI/Codex wraps transient 500s as:
// {"type":"error","error":{"type":"server_error","code":"server_error","message":"An error occurred..."}}
if (value.includes('"type":"server_error"') && value.includes('"code":"server_error"')) {
return true;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage for new code path

The existing test file (src/agents/pi-embedded-helpers.isbillingerrormessage.test.ts, line 847) has a dedicated test case for the Anthropic api_error pattern:

it("classifies JSON api_error internal server failures as timeout", () => {
  expect(
    classifyFailoverReason(
      '{"type":"error","error":{"type":"api_error","message":"Internal server error"}}',
    ),
  ).toBe("timeout");
});

A parallel test for the new OpenAI/Codex server_error path should be added alongside it to ensure the new branch stays covered and to document the expected format:

it("classifies JSON server_error internal server failures as timeout", () => {
  expect(
    classifyFailoverReason(
      '{"type":"error","error":{"type":"server_error","code":"server_error","message":"An error occurred..."}}',
    ),
  ).toBe("timeout");
});
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/pi-embedded-helpers/errors.ts
Line: 859-863

Comment:
**Missing test coverage for new code path**

The existing test file (`src/agents/pi-embedded-helpers.isbillingerrormessage.test.ts`, line 847) has a dedicated test case for the Anthropic `api_error` pattern:

```ts
it("classifies JSON api_error internal server failures as timeout", () => {
  expect(
    classifyFailoverReason(
      '{"type":"error","error":{"type":"api_error","message":"Internal server error"}}',
    ),
  ).toBe("timeout");
});
```

A parallel test for the new OpenAI/Codex `server_error` path should be added alongside it to ensure the new branch stays covered and to document the expected format:

```ts
it("classifies JSON server_error internal server failures as timeout", () => {
  expect(
    classifyFailoverReason(
      '{"type":"error","error":{"type":"server_error","code":"server_error","message":"An error occurred..."}}',
    ),
  ).toBe("timeout");
});
```

How can I resolve this? If you propose a fix, please make it concise.

}
// OpenAI/Codex wraps transient 500s as:
// {"type":"error","error":{"type":"server_error","code":"server_error","message":"An error occurred..."}}
if (value.includes('"type":"server_error"') && value.includes('"code":"server_error"')) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fragile string-matching may miss pretty-printed JSON

The check uses value.includes('"type":"server_error"') (no space after the colon), which will silently fail to match if OpenAI ever sends a pretty-printed response like "type": "server_error". The Anthropic branch has the same pre-existing fragility. Since you're touching this function, consider switching to a safe JSON parse + property access, or at minimum accepting an optional space:

Suggested change
if (value.includes('"type":"server_error"') && value.includes('"code":"server_error"')) {
if (value.includes('"type":"server_error"') && value.includes('"code":"server_error"')) {

A more robust alternative would be:

try {
  const parsed = JSON.parse(raw);
  const errType = parsed?.error?.type;
  const errCode = parsed?.error?.code;
  if (errType === "server_error" && errCode === "server_error") {
    return true;
  }
} catch {
  // not valid JSON, fall through
}

This is a pre-existing issue but worth addressing while the function is being modified.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/pi-embedded-helpers/errors.ts
Line: 861

Comment:
**Fragile string-matching may miss pretty-printed JSON**

The check uses `value.includes('"type":"server_error"')` (no space after the colon), which will silently fail to match if OpenAI ever sends a pretty-printed response like `"type": "server_error"`. The Anthropic branch has the same pre-existing fragility. Since you're touching this function, consider switching to a safe JSON parse + property access, or at minimum accepting an optional space:

```suggestion
  if (value.includes('"type":"server_error"') && value.includes('"code":"server_error"')) {
```

A more robust alternative would be:
```ts
try {
  const parsed = JSON.parse(raw);
  const errType = parsed?.error?.type;
  const errCode = parsed?.error?.code;
  if (errType === "server_error" && errCode === "server_error") {
    return true;
  }
} catch {
  // not valid JSON, fall through
}
```

This is a pre-existing issue but worth addressing while the function is being modified.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0b589939f2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

}
// OpenAI/Codex wraps transient 500s as:
// {"type":"error","error":{"type":"server_error","code":"server_error","message":"An error occurred..."}}
if (value.includes('"type":"server_error"') && value.includes('"code":"server_error"')) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Match server_error without requiring code field

The new OpenAI/Codex branch only returns true when both "type":"server_error" and "code":"server_error" are present, but we already handle payloads in this repo where server_error appears without a code field (for example src/agents/pi-embedded-helpers.formatassistanterrortext.test.ts uses {"error":{"type":"server_error"...}}). In that common shape, this detector still returns false, so transient provider failures remain non-retryable and model fallback can still hard-fail instead of retrying.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 84b7d324a8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +866 to +867
if (errType === "server_error" && errCode === "server_error") {
return true;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Match server_error even when code is missing

The new OpenAI/Codex branch only returns true when both error.type and error.code are exactly "server_error", but this repo already contains server_error payload shapes without a code field (for example src/agents/pi-embedded-helpers.formatassistanterrortext.test.ts:73). In that common shape, classifyFailoverReason() still won’t classify the error as retryable (timeout), so transient provider failures can continue to bypass model fallback and hard-fail requests.

Useful? React with 👍 / 👎.

@dsantoreis

This comment was marked as spam.

…el fallback

The transient HTTP error detector only recognized Anthropic's api_error
format. OpenAI and Codex use a different envelope:
{"type":"server_error","code":"server_error","message":"..."}

These transient 500s were treated as fatal, skipping model fallback
entirely. Now they're correctly identified as retryable.
…error

- Add parallel test case for OpenAI/Codex server_error classification
- Replace fragile .includes() string matching with JSON.parse + property
  access, falling back to substring check for non-JSON payloads
- Handles pretty-printed JSON with spaces after colons
@ToneLoke ToneLoke force-pushed the fix/server-error-retryable branch from 84b7d32 to b7a7e55 Compare March 19, 2026 07:41
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b7a7e55167

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

const parsed = JSON.parse(raw);
const errType = parsed?.error?.type;
const errCode = parsed?.error?.code;
if (errType === "server_error" && errCode === "server_error") {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Accept server_error even when code is absent

This check only classifies the payload as retryable when both error.type and error.code are exactly "server_error", but OpenAI/Codex error payloads commonly omit code, so the same transient 5xx still falls through without failover. Fresh evidence in this repo: src/agents/pi-embedded-helpers.formatassistanterrortext.test.ts:79 uses a server_error payload without a code field, and parseApiErrorInfo already treats err.type as primary while err.code is optional (src/agents/pi-embedded-helpers/errors.ts:627-632).

Useful? React with 👍 / 👎.

@steipete
Copy link
Copy Markdown
Contributor

Closing this as implemented after Codex review.

Current main already classifies OpenAI/Codex server_error payloads as retryable timeout failures, so this PR’s requested behavior is implemented via the broader failover matcher path rather than the older isJsonApiInternalServerError() helper.

What I checked:

So I’m closing this as already implemented rather than keeping a duplicate issue open.

Review notes: reviewed against f03252aaf9cc; fix evidence: commit 37c37eecfbe4.

@steipete steipete closed this Apr 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling size: XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants