Skip to content

fix: fail fast on qa live auth errors#63333

Merged
shakkernerd merged 12 commits intomainfrom
fix/qa-live-auth-fail-fast
Apr 8, 2026
Merged

fix: fail fast on qa live auth errors#63333
shakkernerd merged 12 commits intomainfrom
fix/qa-live-auth-fail-fast

Conversation

@shakkernerd
Copy link
Copy Markdown
Member

@shakkernerd shakkernerd commented Apr 8, 2026

Summary

  • Problem: live QA scenario flows could hide real provider auth failures behind long generic timeouts, even when the underlying gateway had already produced an immediate actionable error.
  • Why it matters: this made live QA runs much slower to debug and much harder to trust, because provider setup failures looked like scenario timeouts instead of the real root cause.
  • What changed: QA now treats known fallback and error replies as terminal failures across both outbound-message waits and raw scenario condition waits, and external missing-API-key failures now preserve direct auth guidance instead of collapsing to the generic fallback reply.
  • What did NOT change (scope boundary): this does not change model auth itself; it fixes how QA surfaces those failures in scenario flows.

Change Type

  • Bug fix
  • Test

Root Cause

  • Root cause: the live QA scenario path accepted the channel fallback reply as ordinary outbound text, or kept polling success-specific conditions without checking whether a classified failure reply had already arrived.
  • Contributing context: the underlying gateway already knew the real error immediately, but the scenario flow surfaced it too late and too generically.

User-visible Changes

  • Live QA suite runs now fail fast when the channel reply is an obvious runtime failure reply.
  • That fail-fast wiring now covers both success-specific outbound waits and raw scenario waitForCondition(...) checks.
  • Missing API key failures now preserve direct provider guidance instead of only showing Something went wrong while processing your request.
  • Repeated auth failure text like x | x is now collapsed into one clean message in the final QA report.

Why This Matters

This bug was first captured clearly through Kova.

Kova showed a sharp mismatch between two otherwise similar live runs:

  • openai/gpt-5.4 failed after a long timeout
  • openai-codex/gpt-5.4 passed quickly on the same machine

Example Kova evidence from the investigation:

candidate: openai/gpt-5.4
run: kova_20260408_183035268_cc484f
verdict: fail
duration: 380.9s
detail: timed out after 360000ms
candidate: openai-codex/gpt-5.4
run: kova_20260408_185441906_df9a53
verdict: pass
duration: 65.0s
judge score: 8.8

That made the bug obvious:

  • provider auth was not generically broken
  • the live QA scenario flow was hiding the real auth failure

Verification

Targeted tests

  • pnpm test extensions/qa-lab/src/reply-failure.test.ts extensions/qa-lab/src/suite.test.ts src/auto-reply/reply/agent-runner-execution.test.ts

Live repro before the fix

  • Direct manual lane already failed fast with the correct auth message:
    • pnpm openclaw qa manual --provider-mode live-frontier --model openai/gpt-5.4 --message 'Reply exactly: OPENAI_OK' --timeout-ms 30000
  • But the scenario lane could stall and surface a timeout instead of the auth failure.

Live repro after the fix

  • pnpm openclaw qa suite --provider-mode live-frontier --model openai/gpt-5.4 --alt-model openai/gpt-5.4 --scenario memory-recall --output-dir .artifacts/qa-e2e/repro-auth-timeout-memory-recall

Result:

  • suite finished in about 6.8s
  • scenario failed immediately with:
    • agent.wait returned error: FailoverError: No API key found for provider "openai". You are authenticated with OpenAI Codex OAuth. Use openai-codex/gpt-5.4 (OAuth) or set OPENAI_API_KEY to use openai/gpt-5.4.

Snippet from the generated QA report:

### Memory recall after context switch

- Status: fail
- Details: agent.wait returned error: FailoverError: No API key found for provider "openai". You are authenticated with OpenAI Codex OAuth. Use openai-codex/gpt-5.4 (OAuth) or set OPENAI_API_KEY to use openai/gpt-5.4.

Files

  • extensions/qa-lab/src/reply-failure.ts
  • extensions/qa-lab/src/reply-failure.test.ts
  • extensions/qa-lab/src/suite.ts
  • extensions/qa-lab/src/suite.test.ts
  • src/auto-reply/reply/agent-runner-execution.ts
  • src/auto-reply/reply/agent-runner-execution.test.ts

Risks and Mitigations

  • Risk: QA could now fail faster on replies that are intentional warning text.
  • Mitigation: the failure classifier is intentionally narrow and only matches known runtime fallback and error reply patterns.

@aisle-research-bot
Copy link
Copy Markdown

aisle-research-bot Bot commented Apr 8, 2026

🔒 Aisle Security Analysis

We found 2 potential security issue(s) in this PR:

# Severity Title
1 🟡 Medium Outbound assistant text propagated into thrown Errors and QA reports (potential secret leakage)
2 🟡 Medium User-facing replies disclose gateway auth/config state via missing-API-key errors
1. 🟡 Outbound assistant text propagated into thrown Errors and QA reports (potential secret leakage)
Property Value
Severity Medium
CWE CWE-532
Location extensions/qa-lab/src/suite.ts:194-202

Description

waitForOutboundMessage / waitForCondition in extensions/qa-lab/src/suite.ts now treat certain assistant replies as failures and throw Error(...) using the full outbound message text.

This error is then:

  • converted to a string via formatErrorMessage(error) in runScenario() and stored as details
  • rendered verbatim into the generated Markdown report via renderQaMarkdownReport() (code fences preserve the full content)
  • optionally printed to stderr when OPENCLAW_QA_DEBUG=1

If an outbound assistant message contains sensitive data (e.g., tokens accidentally echoed by the model, tool outputs, internal URLs, credentials included in scenario prompts, etc.), that content will be persisted into CI artifacts/logs.

Vulnerable code:

if (failureMessage) {
  throw new Error(extractQaFailureReplyText(failureMessage.text) ?? failureMessage.text);
}

Recommendation

Avoid embedding full assistant text into exception messages that are later logged or persisted.

Options:

  1. Throw a structured error with a stable code, and redact/limit the text included in .message:
class QaFailureReplyError extends Error {
  constructor(public readonly failureKind: string) {
    super(`QA failure reply detected: ${failureKind}`);
    this.name = "QaFailureReplyError";
  }
}// ...
const failureText = extractQaFailureReplyText(failureMessage.text);
if (failureText) {// keep full text only in memory if needed, but do not log/persist by default
  throw new QaFailureReplyError(failureText.slice(0, 80));
}
  1. If the full text is needed for debugging, store it in a separate field that is only written when an explicit debug flag is enabled, and/or redact obvious secret patterns before writing to reports.

  2. Add a maximum length cap to any logged/persisted model output (e.g., first 200 chars).

2. 🟡 User-facing replies disclose gateway auth/config state via missing-API-key errors
Property Value
Severity Medium
CWE CWE-209
Location src/auto-reply/reply/agent-runner-execution.ts:334-349

Description

buildExternalRunFailureText() converts upstream errors like No API key found for provider "..." into end-user facing reply text.

  • The reply explicitly reveals that the backend is using a gateway and that it lacks an API key for a specific provider.
  • For OpenAI it additionally advises setting OPENAI_API_KEY or using a specific OAuth model string, which leaks internal configuration details and can confuse/assist untrusted users.
  • These messages are returned on the normal external reply path when shouldSurfaceToControlUi is false (i.e., typical end-user channels), meaning an untrusted user may be able to trigger them (by selecting/forcing a provider) to infer provider availability and the gateway’s auth state.

Vulnerable code:

const SAFE_MISSING_API_KEY_PROVIDERS = new Set(["anthropic", "google", "openai", "openai-codex"]);

function buildMissingApiKeyFailureText(message: string): string | null {
  const providerMatch = normalizedMessage.match(/No API key found for provider "([^"]+)"/u);
  const provider = providerMatch?.[1]?.trim().toLowerCase();
  ...
  if (provider === "openai" && normalizedMessage.includes("OpenAI Codex OAuth")) {
    return "⚠️ Missing API key for OpenAI on the gateway. Use `openai-codex/gpt-5.4` for OAuth, or set `OPENAI_API_KEY`, then try again.";
  }
  if (SAFE_MISSING_API_KEY_PROVIDERS.has(provider)) {
    return `⚠️ Missing API key for provider "${provider}". Configure the gateway auth for that provider, then try again.`;
  }
}

Recommendation

Only surface detailed provider/auth configuration guidance in admin/control UI contexts.

  • For end-user channels, return a generic message (no provider names, no env var names), and log the detailed error server-side.
  • If you still want to help legitimate users, gate the detailed message behind an explicit admin/debug flag.

Example:

function buildMissingApiKeyFailureText(message: string, opts: { admin: boolean }): string | null {
  const provider = ...;
  if (!provider) return null;

  if (!opts.admin) {
    return "⚠️ Model configuration error. Please contact an administrator.";
  }

  return `⚠️ Gateway is missing credentials for provider "${provider}". Configure gateway auth and retry.`;
}

Analyzed PR: #63333 at commit 00bace5

Last updated on: 2026-04-08T21:02:26Z

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 342dde197f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/auto-reply/reply/agent-runner-execution.ts Outdated
Comment thread extensions/qa-lab/src/suite.ts
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 8, 2026

Greptile Summary

This PR adds fail-fast behavior to live QA scenario flows when the gateway returns a known auth/runtime error reply, and ensures that "No API key found for provider" errors preserve their direct user guidance instead of collapsing to the generic fallback message. The changes are narrow and well-targeted: a new extractQaFailureReplyText classifier in extensions/qa-lab/src/reply-failure.ts, a hook into waitForOutboundMessage in suite.ts, a collapseRepeatedFailureDetail deduplication helper, and a targeted override in buildExternalRunFailureText for the missing-API-key case.

Confidence Score: 5/5

Safe to merge — logic is correct, tests cover the key behaviors, and all remaining observations are P2.

No P0 or P1 issues. The classifier, deduplication helper, and buildExternalRunFailureText override are all correct. The only note is a design limitation (predicate-gated fail-fast) that is documented in the PR description and does not regress any existing behavior.

No files require special attention.

Vulnerabilities

No security concerns identified. The buildExternalRunFailureText change does forward a gateway-produced error string to the user reply, but provider names in that path are controlled by the system config, not user input. The failure classifier in reply-failure.ts is read-only and purely for error detection.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: extensions/qa-lab/src/suite.ts
Line: 184-187

Comment:
**Fail-fast is predicate-gated**

The failure check only fires when a message already matched `predicate`. A scenario step that waits for a specific reply text (e.g. `msg.text.includes("OPENAI_OK")`) won't short-circuit on an auth error — the error message won't satisfy that predicate, so the step still times out. This is a known design boundary (acknowledged in the PR), but worth a brief inline note so future scenario authors understand why their narrow-predicate steps can still time out on auth failures even after this fix.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix: fail fast on qa live auth errors" | Re-trigger Greptile

Comment thread extensions/qa-lab/src/suite.ts
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 97732e7469

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread extensions/qa-lab/src/suite.ts Outdated
Comment thread extensions/qa-lab/src/reply-failure.ts
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2a719f9231

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

}

function createScenarioWaitForCondition(state: QaBusState) {
const sinceIndex = state.getSnapshot().messages.length;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Recompute failure cursor after scenario bus resets

createScenarioWaitForCondition snapshots sinceIndex once from the pre-flow message length and reuses it for every poll. Several flows invoke reset() after API creation (for example qa/scenarios/model-switch-tool-continuity.md), which clears state.messages; with the stale pre-reset index, findFailureOutboundMessage slices past new outbound failures and the scenario can fall through to a timeout instead of failing fast on the auth/error reply.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5ce39b7ddf

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

}

function createScenarioWaitForCondition(state: QaBusState) {
const sinceIndex = state.getSnapshot().messages.length;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Recompute failure cursor after bus resets

createScenarioWaitForCondition snapshots sinceIndex once from state.getSnapshot().messages.length, but scenario flows frequently call reset() after API creation; reset() clears messages, so this stale pre-reset index points past newly-added outbound errors. In that case findFailureOutboundMessage(..., { cursorSpace: "all" }) slices away fresh failure replies and the wait falls through to a timeout instead of failing fast on the real auth/error response.

Useful? React with 👍 / 👎.

@shakkernerd shakkernerd self-assigned this Apr 8, 2026
@shakkernerd shakkernerd force-pushed the fix/qa-live-auth-fail-fast branch from 5ce39b7 to 11fc6bf Compare April 8, 2026 20:50
@shakkernerd shakkernerd merged commit 540fcd4 into main Apr 8, 2026
9 checks passed
@shakkernerd shakkernerd deleted the fix/qa-live-auth-fail-fast branch April 8, 2026 20:55
@shakkernerd
Copy link
Copy Markdown
Member Author

Landed via rebase.

  • Prepared head SHA: 00bace5
  • Merge commit: 540fcd4
  • Local verification: focused QA/auth Vitest targets passed, pnpm build passed, and pnpm tsgo still reproduced the unrelated baseline failure already present on latest main.

Thanks @shakkernerd!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 00bace5815

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

}

function createScenarioWaitForCondition(state: QaBusState) {
const sinceIndex = state.getSnapshot().messages.length;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Recompute scenario failure cursor after bus resets

createScenarioWaitForCondition snapshots sinceIndex once from state.getSnapshot().messages.length and reuses it for every poll, but many scenario flows call reset() after API creation (for example qa/scenarios/model-switch-tool-continuity.md). Because reset() clears messages, this stale pre-reset index can point past all newly added messages, so findFailureOutboundMessage(..., { cursorSpace: "all" }) skips fresh auth/runtime failure replies and the scenario falls through to a timeout instead of failing fast on the real error.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant