Skip to content

Empty claude-cli subprocess responses misclassified as billing cooldown #83231

@neilosdenning

Description

@neilosdenning

Version

OpenClaw 2026.5.12 (build f066dd2), Node claude-cli provider, Linux.

Summary

When the bundled Claude CLI subprocess returns a zero-token, no-text completion (no error, no abort, no timeout), the provider classifier records it as a billing failure in ~/.openclaw/agents/main/agent/auth-state.json. Three such responses trip a cooldown on the profile, after which every subsequent run on that profile aborts in ~300 ms with no model call and no trajectory file.

This is not a real billing/wallet condition — the user's Claude account is funded and other clients on the same account succeed.

Reproduction signature

After a run completes with status: "success" but an empty assistant reply, auth-state.json shows:

"usageStats": {
  "anthropic:claude-cli": {
    "errorCount": 1,
    "failureCounts": { "billing": 1 },
    "lastFailureAt": <ts>
  }
}

The corresponding trajectory's model.completed event has:

lastCallUsage: { input_tokens: 0, output_tokens: 0, total_tokens: 0 }
assistantTexts: []
aborted: false
timedOut: false
promptErrorSource: null

After three such events, the profile is cooled down:

"disabledUntil": <future-ts>,
"disabledReason": "billing",
"errorCount": 3,
"failureCounts": { "billing": 3 }

Subsequent runs fail in ~300 ms with:

FallbackSummaryError: All models failed (N):
  claude-cli/claude-sonnet-4-6: Provider claude-cli is in cooldown (suspending lanes) (billing)
  | anthropic/claude-haiku-4-5-...: Provider anthropic is in cooldown (suspending lanes) (billing)

and no <sessionId>.jsonl / .trajectory.jsonl is created — the run aborts before any model call.

Expected

A zero-token, no-text, no-error response from the Claude CLI subprocess should not be classified as a billing failure. Either:

  1. Treat empty completions as a distinct failure mode (e.g. empty-response) and apply a separate cooldown policy, or
  2. Don't increment any failure counter for empty responses (treat as a no-op / retry candidate), or
  3. Inspect the CLI exit code and stderr before classifying — an empty stdout with exit 0 is not the same as a billing rejection.

Actual

The classifier maps the empty response to billing, the cooldown trips after 3 consecutive empty responses (which happen organically during normal Discord/cron load), and all dependent jobs fail until the cooldown window elapses or the state file is manually edited.

Workaround attempted

Adding a systemd ExecStartPre that clears disabledUntil / disabledReason / errorCount / failureCounts from auth-state.json on gateway start works at startup, but is not sufficient — during a 16-minute window between gateway restart and a manual cron trigger, two ordinary Discord/cron sessions re-tripped failureCounts.billing to 3.

ExecStartPre snippet (for reference):

ExecStartPre=/usr/bin/python3 -c "import json,os;f=os.path.expanduser('~/.openclaw/agents/main/agent/auth-state.json');d=json.load(open(f));[s.pop('disabledUntil',None) or s.pop('disabledReason',None) or s.update(errorCount=0,failureCounts={}) for s in d.get('usageStats',{}).values()];json.dump(d,open(f,'w'),indent=2)"

Suggested fix locations

The classifier path that maps subprocess result → failure category should distinguish between:

  • non-zero exit / stderr containing a billing-shaped signal → billing (current behaviour, correct)
  • exit 0 with empty stdout / usage.total_tokens == 0 and assistantTexts == []new bucket (or no-op)

Impact

Every cron job and Discord session that depends on anthropic:claude-cli becomes unreliable after ~3 empty-response coincidences. Operators see this as "claude-cli is dead" or "billing issue" with no actual billing problem. The cooldown is invisible from openclaw health (it shows Discord: configured and gateway healthy) and only visible by reading auth-state.json directly or seeing the FallbackSummaryError in openclaw cron runs.

Metadata

Metadata

Assignees

Labels

P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:auth-providerAuth, provider routing, model choice, or SecretRef resolution may break.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions