Skip to content

[Bug]: Provider cooldown reports 'rate_limit' regardless of actual failure reason (timeout, billing, etc.) #5240

@Alf-Bee

Description

@Alf-Bee

Summary

When a provider enters cooldown due to a timeout error, subsequent error messages incorrectly report the reason as rate_limit instead of the actual cause (timeout). This makes debugging infrastructure issues (like unreachable Ollama servers or dropped SSH tunnels) very difficult.

Observed Behavior

Error logs show:

21:04:54.568Z: ollama/qwen2.5:32b-instruct-q4_K_M: LLM request timed out. (unknown)
21:04:54.570Z: ollama/...Provider ollama is in cooldown (all profiles unavailable) (rate_limit)

The first message correctly identifies the timeout. But immediately after, when the provider is in cooldown, it reports rate_limit instead of timeout.

Root Cause

In dist/agents/model-fallback.js (line ~165), when a provider is skipped due to cooldown, the reason is hardcoded:

if (profileIds.length > 0 && !isAnyProfileAvailable) {
    attempts.push({
        provider: candidate.provider,
        model: candidate.model,
        error: `Provider ${candidate.provider} is in cooldown (all profiles unavailable)`,
        reason: "rate_limit",  // ← HARDCODED - should use actual reason
    });
    continue;
}

The actual failure reason IS correctly stored in auth-profiles.json under usageStats[profileId].failureCounts:

{
  "ollama:local": {
    "failureCounts": {
      "timeout": 1  // ← Correct reason is stored here
    }
  }
}

But this information is not used when reporting why the provider is in cooldown.

Expected Behavior

The error message should report the actual reason that caused the cooldown:

  • If cooldown was caused by timeout → report (timeout)
  • If cooldown was caused by rate_limit → report (rate_limit)
  • If cooldown was caused by billing → report (billing)

Impact

  1. Misleading error messages - Users see "rate_limit" when the actual issue is a timeout/connectivity problem
  2. Difficult debugging - Can't distinguish between API rate limits vs infrastructure issues (server down, SSH tunnel dropped, network issues)
  3. Incorrect assumptions - Operators might wait for "rate limit reset" when the actual fix is restarting a service

Environment

  • OpenClaw version: v0.0.929
  • Provider affected: ollama (but bug affects all providers)
  • Actual cause: Ollama server unreachable (SSH tunnel issues)

Suggested Fix

The cooldown skip logic should read the actual reason from failureCounts and report it:

if (profileIds.length > 0 && !isAnyProfileAvailable) {
    // Get the actual reason from the profile's failure counts
    const profileStats = authStore.usageStats?.[profileIds[0]];
    const actualReason = profileStats?.failureCounts 
        ? Object.keys(profileStats.failureCounts).sort((a, b) => 
            (profileStats.failureCounts[b] ?? 0) - (profileStats.failureCounts[a] ?? 0)
          )[0] ?? "unknown"
        : "unknown";
    
    attempts.push({
        provider: candidate.provider,
        model: candidate.model,
        error: `Provider ${candidate.provider} is in cooldown (all profiles unavailable)`,
        reason: actualReason,  // ← Use actual reason
    });
    continue;
}

Related

The (unknown) categorization for timeout errors (seen in the first log line) may also be worth investigating - timeouts should be consistently categorized as timeout, not unknown.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions