-
-
Notifications
You must be signed in to change notification settings - Fork 53k
Description
Summary
When a provider enters cooldown due to a timeout error, subsequent error messages incorrectly report the reason as rate_limit instead of the actual cause (timeout). This makes debugging infrastructure issues (like unreachable Ollama servers or dropped SSH tunnels) very difficult.
Observed Behavior
Error logs show:
21:04:54.568Z: ollama/qwen2.5:32b-instruct-q4_K_M: LLM request timed out. (unknown)
21:04:54.570Z: ollama/...Provider ollama is in cooldown (all profiles unavailable) (rate_limit)
The first message correctly identifies the timeout. But immediately after, when the provider is in cooldown, it reports rate_limit instead of timeout.
Root Cause
In dist/agents/model-fallback.js (line ~165), when a provider is skipped due to cooldown, the reason is hardcoded:
if (profileIds.length > 0 && !isAnyProfileAvailable) {
attempts.push({
provider: candidate.provider,
model: candidate.model,
error: `Provider ${candidate.provider} is in cooldown (all profiles unavailable)`,
reason: "rate_limit", // ← HARDCODED - should use actual reason
});
continue;
}The actual failure reason IS correctly stored in auth-profiles.json under usageStats[profileId].failureCounts:
{
"ollama:local": {
"failureCounts": {
"timeout": 1 // ← Correct reason is stored here
}
}
}But this information is not used when reporting why the provider is in cooldown.
Expected Behavior
The error message should report the actual reason that caused the cooldown:
- If cooldown was caused by timeout → report
(timeout) - If cooldown was caused by rate_limit → report
(rate_limit) - If cooldown was caused by billing → report
(billing)
Impact
- Misleading error messages - Users see "rate_limit" when the actual issue is a timeout/connectivity problem
- Difficult debugging - Can't distinguish between API rate limits vs infrastructure issues (server down, SSH tunnel dropped, network issues)
- Incorrect assumptions - Operators might wait for "rate limit reset" when the actual fix is restarting a service
Environment
- OpenClaw version: v0.0.929
- Provider affected: ollama (but bug affects all providers)
- Actual cause: Ollama server unreachable (SSH tunnel issues)
Suggested Fix
The cooldown skip logic should read the actual reason from failureCounts and report it:
if (profileIds.length > 0 && !isAnyProfileAvailable) {
// Get the actual reason from the profile's failure counts
const profileStats = authStore.usageStats?.[profileIds[0]];
const actualReason = profileStats?.failureCounts
? Object.keys(profileStats.failureCounts).sort((a, b) =>
(profileStats.failureCounts[b] ?? 0) - (profileStats.failureCounts[a] ?? 0)
)[0] ?? "unknown"
: "unknown";
attempts.push({
provider: candidate.provider,
model: candidate.model,
error: `Provider ${candidate.provider} is in cooldown (all profiles unavailable)`,
reason: actualReason, // ← Use actual reason
});
continue;
}Related
The (unknown) categorization for timeout errors (seen in the first log line) may also be worth investigating - timeouts should be consistently categorized as timeout, not unknown.