Skip to content

fix: improve stability across gateway, cli, and channels#45590

Closed
yiShanXin wants to merge 2 commits intoopenclaw:mainfrom
yiShanXin:improve-gateway-stability
Closed

fix: improve stability across gateway, cli, and channels#45590
yiShanXin wants to merge 2 commits intoopenclaw:mainfrom
yiShanXin:improve-gateway-stability

Conversation

@yiShanXin
Copy link
Copy Markdown
Contributor

This PR addresses several high-priority stability and reliability issues:

  1. Device Auth Bypass ([Bug]: dangerouslyDisableDeviceAuth: true ignored - device identity still required #45398, [Bug]: Control UI over LAN HTTP still requires device identity and does not send auth token #45481): Fixes dangerouslyDisableDeviceAuth being ignored for Control UI connections, especially over LAN.
  2. Devices CLI Reliability ([Bug]: 2026.3.12: openclaw devices list / devices approve fail against local loopback gateway, while web UI remains functional #45504): Improves local fallback logic when targeting loopback gateways, handling connection timeouts and closes gracefully.
  3. ByteString Header Errors ([Bug]: Cannot convert argument to a ByteString when calling model provider (zhipu-0ki/GLM-5) #45436): Adds header sanitization to prevent TypeError: Cannot convert argument to a ByteString when non-string values are passed to fetch.
  4. WhatsApp Inbound Delivery ([Bug]: WhatsApp channel shows linked/OK but inbound messages are not delivered (single tick), with repeated 440/401 reconnect loop #45474): Ensures messages from unknown JIDs (like @lid) are not silently dropped by falling back to raw identifiers when E.164 mapping is missing.
  5. Cron Job Visibility ([Bug]: Cron agent jobs silently time out during sustained LLM API outages instead of fast-failing on definitive errors #45494): Improves logging for job timeouts and auto-disabling, helping diagnose sustained LLM API outages.
  6. Usage Normalization: Enhances robustness when handling providers that return multiple token fields (previously part of fix: robust token usage normalization for OpenAI-compatible providers #45535 but included here for completeness if not yet merged).

@openclaw-barnacle openclaw-barnacle Bot added channel: whatsapp-web Channel integration: whatsapp-web gateway Gateway runtime cli CLI command changes agents Agent runtime and tooling size: S labels Mar 14, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 14, 2026

Greptile Summary

This PR bundles six targeted stability fixes across the gateway auth flow, CLI devices command, fetch infrastructure, WhatsApp inbound pipeline, cron logging, and usage normalization. Most changes are well-scoped and address real reported bugs. Two areas warrant closer attention before merging:

  • src/gateway/server/ws-connection/connect-policy.ts: The fix correctly makes dangerouslyDisableDeviceAuth respected in evaluateMissingDeviceIdentity, but the new condition (allowBypass alone, without sharedAuthOk) is asymmetric with the companion function shouldSkipControlUiPairing, which requires both allowBypass && sharedAuthOk. Confirm that credential verification is always enforced upstream before this function is reached.

  • src/cli/devices-cli.ts: The new isConnectionError detection uses message.includes("closed") and message.includes("timeout"), which are very common substrings. An auth-failure error that closes the socket would match and silently fall through to the local-pairing file, hiding the real error from the user. The isLoopbackHost guard does limit the scope, but consider tightening the patterns.

Other changes:

  • sanitizeHeaders in fetch-guard.ts cleanly prevents the ByteString TypeError across all three HeadersInit forms.
  • The WhatsApp JID fallback in monitor.ts + access-control changes correctly thread unmappable JIDs through the pipeline without dropping messages or corrupting E.164 comparisons.
  • The pick() helper in usage.ts improves handling of providers with multiple overlapping token fields; the two-pass positive-first logic is a behavior change worth documenting inline.
  • Cron logging addition is a no-risk observability improvement.

Confidence Score: 3/5

  • Safe to merge after verifying the auth ordering assumption in connect-policy.ts and tightening the error-string matching in devices-cli.ts.
  • The majority of changes are well-reasoned bug fixes with bounded scope. Two changes require author confirmation: the allowBypass-without-sharedAuthOk asymmetry in the gateway auth flow (potential security implication if the calling site doesn't enforce credential checks upstream), and the overly-broad "closed"/"timeout" substring matching in the CLI that could suppress real errors. Neither is a showstopper, but both need explicit verification or hardening before the PR is merged into a production path.
  • src/gateway/server/ws-connection/connect-policy.ts (auth ordering assumption) and src/cli/devices-cli.ts (broad error-string matching).
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/cli/devices-cli.ts
Line: 103-107

Comment:
**Overly broad `"closed"` match may trigger false-positive fallbacks**

`message.includes("closed")` will match a wide range of error strings beyond network connection failures — e.g. `"WebSocket closed due to authentication failure"`, `"Connection closed after auth rejection"`, or even `"Session closed after PIN mismatch"`. Any of these would incorrectly trigger the local pairing fallback even when the gateway is reachable and the error is not a connectivity issue.

Similarly, `"timeout"` alone (without word boundaries) can match domain-specific error messages unrelated to a TCP/network timeout.

Consider tightening the patterns to reduce false positives:
```suggestion
  const isConnectionError =
    message.includes("connection closed") ||
    message.includes("connection timeout") ||
    message.includes("socket closed") ||
    message.includes("fetch failed") ||
    message.includes("econnrefused");
```
The `isLoopbackHost` guard downstream does limit the blast radius, but an auth-error response that contains "closed" would still silently fall back to local files rather than surfacing the real auth problem to the user.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/infra/net/fetch-guard.ts
Line: 231

Comment:
**`sanitizeHeaders` can silently strip `undefined`-valued headers without any warning**

When a caller passes a headers object that has keys with `undefined` or `null` values, `sanitizeHeaders` silently drops those keys. If the caller intended to unset a header that a prior redirect handler set (e.g. `Authorization: undefined`), the drop is correct. But if a non-string header value was accidentally introduced (the original bug this is fixing), the silent coercion via `String(value)` on line 153 could mask configuration bugs in upstream code.

Consider logging a `logWarn` the first time a non-string header value is encountered, so the real root-cause is surfaced rather than silently patched:
```typescript
if (typeof value !== "string") {
  logWarn(`fetch-guard: coercing non-string header "${key}" (${typeof value}) to string`);
}
out[key] = String(value);
```
This keeps the fix in place while aiding future debugging.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/agents/usage.ts
Line: 93-107

Comment:
**`pick()` changes field priority when a primary field is explicitly set to `0`**

The old `?? chain` would short-circuit at the first non-`null`/non-`undefined` value, including `0`. The new two-pass `pick()` skips any value that is `0` or negative in the first pass, meaning if a provider explicitly sets `raw.input = 0` alongside `raw.inputTokens = 100`, the old code would have returned `0` (the authoritative value for that field) while the new code returns `100` (a secondary alias).

This is a subtle but real behavior change. While the pi-ai negative-value case is well-handled, it's worth documenting that this is intentional so future readers understand why a provider explicitly reporting `0` for one field gets overridden by another field. A brief inline comment would help:
```typescript
// First pass: prefer any positive value (handles providers that set a primary field to 0
// or negative when a secondary alias has the real count, e.g. pi-ai cached-token subtraction).
for (const arg of args) {
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/gateway/server/ws-connection/connect-policy.ts
Line: 82-83

Comment:
**Inconsistency between `allowBypass` handling here vs. `shouldSkipControlUiPairing`**

`shouldSkipControlUiPairing` (lines 35–44) requires both `policy.allowBypass && sharedAuthOk` to skip the pairing step — the `sharedAuthOk` flag is a second gate even when `dangerouslyDisableDeviceAuth` is enabled. The new condition here, however, allows the missing-device-identity check to pass on `allowBypass` alone, with no corresponding `sharedAuthOk` verification.

Depending on how these two functions are composed in the connection handshake, this difference in gate requirements could mean that a Control UI client without valid credentials passes the identity check via this path while the pairing-skip path still enforces `sharedAuthOk`.

If the intent is that credential verification is always handled upstream before `evaluateMissingDeviceIdentity` is reached, please add a comment here stating that so the ordering assumption is explicit.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 23230b0

Comment thread src/cli/devices-cli.ts
Comment on lines +103 to +107
const isConnectionError =
message.includes("closed") ||
message.includes("timeout") ||
message.includes("fetch failed") ||
message.includes("econnrefused");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overly broad "closed" match may trigger false-positive fallbacks

message.includes("closed") will match a wide range of error strings beyond network connection failures — e.g. "WebSocket closed due to authentication failure", "Connection closed after auth rejection", or even "Session closed after PIN mismatch". Any of these would incorrectly trigger the local pairing fallback even when the gateway is reachable and the error is not a connectivity issue.

Similarly, "timeout" alone (without word boundaries) can match domain-specific error messages unrelated to a TCP/network timeout.

Consider tightening the patterns to reduce false positives:

Suggested change
const isConnectionError =
message.includes("closed") ||
message.includes("timeout") ||
message.includes("fetch failed") ||
message.includes("econnrefused");
const isConnectionError =
message.includes("connection closed") ||
message.includes("connection timeout") ||
message.includes("socket closed") ||
message.includes("fetch failed") ||
message.includes("econnrefused");

The isLoopbackHost guard downstream does limit the blast radius, but an auth-error response that contains "closed" would still silently fall back to local files rather than surfacing the real auth problem to the user.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/devices-cli.ts
Line: 103-107

Comment:
**Overly broad `"closed"` match may trigger false-positive fallbacks**

`message.includes("closed")` will match a wide range of error strings beyond network connection failures — e.g. `"WebSocket closed due to authentication failure"`, `"Connection closed after auth rejection"`, or even `"Session closed after PIN mismatch"`. Any of these would incorrectly trigger the local pairing fallback even when the gateway is reachable and the error is not a connectivity issue.

Similarly, `"timeout"` alone (without word boundaries) can match domain-specific error messages unrelated to a TCP/network timeout.

Consider tightening the patterns to reduce false positives:
```suggestion
  const isConnectionError =
    message.includes("connection closed") ||
    message.includes("connection timeout") ||
    message.includes("socket closed") ||
    message.includes("fetch failed") ||
    message.includes("econnrefused");
```
The `isLoopbackHost` guard downstream does limit the blast radius, but an auth-error response that contains "closed" would still silently fall back to local files rather than surfacing the real auth problem to the user.

How can I resolve this? If you propose a fix, please make it concise.


const init: RequestInit & { dispatcher?: Dispatcher } = {
...(currentInit ? { ...currentInit } : {}),
headers: sanitizeHeaders(currentInit?.headers),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sanitizeHeaders can silently strip undefined-valued headers without any warning

When a caller passes a headers object that has keys with undefined or null values, sanitizeHeaders silently drops those keys. If the caller intended to unset a header that a prior redirect handler set (e.g. Authorization: undefined), the drop is correct. But if a non-string header value was accidentally introduced (the original bug this is fixing), the silent coercion via String(value) on line 153 could mask configuration bugs in upstream code.

Consider logging a logWarn the first time a non-string header value is encountered, so the real root-cause is surfaced rather than silently patched:

if (typeof value !== "string") {
  logWarn(`fetch-guard: coercing non-string header "${key}" (${typeof value}) to string`);
}
out[key] = String(value);

This keeps the fix in place while aiding future debugging.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/net/fetch-guard.ts
Line: 231

Comment:
**`sanitizeHeaders` can silently strip `undefined`-valued headers without any warning**

When a caller passes a headers object that has keys with `undefined` or `null` values, `sanitizeHeaders` silently drops those keys. If the caller intended to unset a header that a prior redirect handler set (e.g. `Authorization: undefined`), the drop is correct. But if a non-string header value was accidentally introduced (the original bug this is fixing), the silent coercion via `String(value)` on line 153 could mask configuration bugs in upstream code.

Consider logging a `logWarn` the first time a non-string header value is encountered, so the real root-cause is surfaced rather than silently patched:
```typescript
if (typeof value !== "string") {
  logWarn(`fetch-guard: coercing non-string header "${key}" (${typeof value}) to string`);
}
out[key] = String(value);
```
This keeps the fix in place while aiding future debugging.

How can I resolve this? If you propose a fix, please make it concise.

Comment thread src/agents/usage.ts
Comment on lines +93 to +107
const pick = (...args: unknown[]) => {
for (const arg of args) {
const val = asFiniteNumber(arg);
if (val !== undefined && val > 0) {
return val;
}
}
for (const arg of args) {
const val = asFiniteNumber(arg);
if (val !== undefined) {
return val;
}
}
return undefined;
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pick() changes field priority when a primary field is explicitly set to 0

The old ?? chain would short-circuit at the first non-null/non-undefined value, including 0. The new two-pass pick() skips any value that is 0 or negative in the first pass, meaning if a provider explicitly sets raw.input = 0 alongside raw.inputTokens = 100, the old code would have returned 0 (the authoritative value for that field) while the new code returns 100 (a secondary alias).

This is a subtle but real behavior change. While the pi-ai negative-value case is well-handled, it's worth documenting that this is intentional so future readers understand why a provider explicitly reporting 0 for one field gets overridden by another field. A brief inline comment would help:

// First pass: prefer any positive value (handles providers that set a primary field to 0
// or negative when a secondary alias has the real count, e.g. pi-ai cached-token subtraction).
for (const arg of args) {
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/usage.ts
Line: 93-107

Comment:
**`pick()` changes field priority when a primary field is explicitly set to `0`**

The old `?? chain` would short-circuit at the first non-`null`/non-`undefined` value, including `0`. The new two-pass `pick()` skips any value that is `0` or negative in the first pass, meaning if a provider explicitly sets `raw.input = 0` alongside `raw.inputTokens = 100`, the old code would have returned `0` (the authoritative value for that field) while the new code returns `100` (a secondary alias).

This is a subtle but real behavior change. While the pi-ai negative-value case is well-handled, it's worth documenting that this is intentional so future readers understand why a provider explicitly reporting `0` for one field gets overridden by another field. A brief inline comment would help:
```typescript
// First pass: prefer any positive value (handles providers that set a primary field to 0
// or negative when a secondary alias has the real count, e.g. pi-ai cached-token subtraction).
for (const arg of args) {
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +82 to 83
if (params.isControlUi && (params.trustedProxyAuthOk || params.controlUiAuthPolicy.allowBypass)) {
return { kind: "allow" };
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistency between allowBypass handling here vs. shouldSkipControlUiPairing

shouldSkipControlUiPairing (lines 35–44) requires both policy.allowBypass && sharedAuthOk to skip the pairing step — the sharedAuthOk flag is a second gate even when dangerouslyDisableDeviceAuth is enabled. The new condition here, however, allows the missing-device-identity check to pass on allowBypass alone, with no corresponding sharedAuthOk verification.

Depending on how these two functions are composed in the connection handshake, this difference in gate requirements could mean that a Control UI client without valid credentials passes the identity check via this path while the pairing-skip path still enforces sharedAuthOk.

If the intent is that credential verification is always handled upstream before evaluateMissingDeviceIdentity is reached, please add a comment here stating that so the ordering assumption is explicit.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/server/ws-connection/connect-policy.ts
Line: 82-83

Comment:
**Inconsistency between `allowBypass` handling here vs. `shouldSkipControlUiPairing`**

`shouldSkipControlUiPairing` (lines 35–44) requires both `policy.allowBypass && sharedAuthOk` to skip the pairing step — the `sharedAuthOk` flag is a second gate even when `dangerouslyDisableDeviceAuth` is enabled. The new condition here, however, allows the missing-device-identity check to pass on `allowBypass` alone, with no corresponding `sharedAuthOk` verification.

Depending on how these two functions are composed in the connection handshake, this difference in gate requirements could mean that a Control UI client without valid credentials passes the identity check via this path while the pairing-skip path still enforces `sharedAuthOk`.

If the intent is that credential verification is always handled upstream before `evaluateMissingDeviceIdentity` is reached, please add a comment here stating that so the ordering assumption is explicit.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5b59d51139

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +82 to 83
if (params.isControlUi && (params.trustedProxyAuthOk || params.controlUiAuthPolicy.allowBypass)) {
return { kind: "allow" };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restrict control-ui bypass to authenticated operator paths

This condition now returns allow for any Control UI client whenever dangerouslyDisableDeviceAuth is enabled, which bypasses the later roleCanSkipDeviceIdentity check. In practice, a client can present client.id as Control UI with role: "node" and (for example) run under auth.mode="none", and the handshake no longer requires a device identity; before this change, node role could not skip device identity. That broadens a UI-only bypass into a general role bypass and can unintentionally grant node-capable sessions without pairing.

Useful? React with 👍 / 👎.

Comment thread src/agents/usage.ts
Comment on lines +96 to +97
if (val !== undefined && val > 0) {
return val;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve field precedence when selecting usage aliases

The new pick helper prefers any positive alias over earlier fields, which changes normalization from precedence-based to value-based selection. When providers include mixed aliases (for example input as net cached-adjusted value and prompt_tokens as raw prompt count), this can select prompt_tokens even when input is intentionally 0/negative, and then cacheRead is added again downstream, inflating prompt/cost totals. This regression is specific to multi-alias payloads, but those are explicitly what this change targets.

Useful? React with 👍 / 👎.

@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 26, 2026

Closing this as duplicate or superseded after Codex automated review.

PR #45590 is a stale broad stability bundle. Current main has safer implementations for the gateway bypass, ByteString secret normalization, and usage normalization pieces, while the remaining WhatsApp, cron, and loopback-node/device work is better tracked in narrower related items. The PR also contains review-flagged regressions and obsolete WhatsApp paths.

Best possible solution:

Close the stale bundled PR as superseded. Keep the safer current-main implementations for gateway auth, ByteString credential normalization, and usage normalization, and handle remaining behavior through the narrower related items: #45474, #45494, #45535, and #48285.

What I checked:

So I’m closing this here and keeping the remaining discussion on the canonical linked item.

Codex Review notes: model gpt-5.5, reasoning high; reviewed against 8bbb143ab87e.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling channel: whatsapp-web Channel integration: whatsapp-web cli CLI command changes gateway Gateway runtime size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant