fix: improve stability across gateway, cli, and channels#45590
fix: improve stability across gateway, cli, and channels#45590yiShanXin wants to merge 2 commits intoopenclaw:mainfrom
Conversation
Greptile SummaryThis PR bundles six targeted stability fixes across the gateway auth flow, CLI devices command, fetch infrastructure, WhatsApp inbound pipeline, cron logging, and usage normalization. Most changes are well-scoped and address real reported bugs. Two areas warrant closer attention before merging:
Other changes:
Confidence Score: 3/5
Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/cli/devices-cli.ts
Line: 103-107
Comment:
**Overly broad `"closed"` match may trigger false-positive fallbacks**
`message.includes("closed")` will match a wide range of error strings beyond network connection failures — e.g. `"WebSocket closed due to authentication failure"`, `"Connection closed after auth rejection"`, or even `"Session closed after PIN mismatch"`. Any of these would incorrectly trigger the local pairing fallback even when the gateway is reachable and the error is not a connectivity issue.
Similarly, `"timeout"` alone (without word boundaries) can match domain-specific error messages unrelated to a TCP/network timeout.
Consider tightening the patterns to reduce false positives:
```suggestion
const isConnectionError =
message.includes("connection closed") ||
message.includes("connection timeout") ||
message.includes("socket closed") ||
message.includes("fetch failed") ||
message.includes("econnrefused");
```
The `isLoopbackHost` guard downstream does limit the blast radius, but an auth-error response that contains "closed" would still silently fall back to local files rather than surfacing the real auth problem to the user.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/infra/net/fetch-guard.ts
Line: 231
Comment:
**`sanitizeHeaders` can silently strip `undefined`-valued headers without any warning**
When a caller passes a headers object that has keys with `undefined` or `null` values, `sanitizeHeaders` silently drops those keys. If the caller intended to unset a header that a prior redirect handler set (e.g. `Authorization: undefined`), the drop is correct. But if a non-string header value was accidentally introduced (the original bug this is fixing), the silent coercion via `String(value)` on line 153 could mask configuration bugs in upstream code.
Consider logging a `logWarn` the first time a non-string header value is encountered, so the real root-cause is surfaced rather than silently patched:
```typescript
if (typeof value !== "string") {
logWarn(`fetch-guard: coercing non-string header "${key}" (${typeof value}) to string`);
}
out[key] = String(value);
```
This keeps the fix in place while aiding future debugging.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/agents/usage.ts
Line: 93-107
Comment:
**`pick()` changes field priority when a primary field is explicitly set to `0`**
The old `?? chain` would short-circuit at the first non-`null`/non-`undefined` value, including `0`. The new two-pass `pick()` skips any value that is `0` or negative in the first pass, meaning if a provider explicitly sets `raw.input = 0` alongside `raw.inputTokens = 100`, the old code would have returned `0` (the authoritative value for that field) while the new code returns `100` (a secondary alias).
This is a subtle but real behavior change. While the pi-ai negative-value case is well-handled, it's worth documenting that this is intentional so future readers understand why a provider explicitly reporting `0` for one field gets overridden by another field. A brief inline comment would help:
```typescript
// First pass: prefer any positive value (handles providers that set a primary field to 0
// or negative when a secondary alias has the real count, e.g. pi-ai cached-token subtraction).
for (const arg of args) {
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/gateway/server/ws-connection/connect-policy.ts
Line: 82-83
Comment:
**Inconsistency between `allowBypass` handling here vs. `shouldSkipControlUiPairing`**
`shouldSkipControlUiPairing` (lines 35–44) requires both `policy.allowBypass && sharedAuthOk` to skip the pairing step — the `sharedAuthOk` flag is a second gate even when `dangerouslyDisableDeviceAuth` is enabled. The new condition here, however, allows the missing-device-identity check to pass on `allowBypass` alone, with no corresponding `sharedAuthOk` verification.
Depending on how these two functions are composed in the connection handshake, this difference in gate requirements could mean that a Control UI client without valid credentials passes the identity check via this path while the pairing-skip path still enforces `sharedAuthOk`.
If the intent is that credential verification is always handled upstream before `evaluateMissingDeviceIdentity` is reached, please add a comment here stating that so the ordering assumption is explicit.
How can I resolve this? If you propose a fix, please make it concise.Last reviewed commit: 23230b0 |
| const isConnectionError = | ||
| message.includes("closed") || | ||
| message.includes("timeout") || | ||
| message.includes("fetch failed") || | ||
| message.includes("econnrefused"); |
There was a problem hiding this comment.
Overly broad "closed" match may trigger false-positive fallbacks
message.includes("closed") will match a wide range of error strings beyond network connection failures — e.g. "WebSocket closed due to authentication failure", "Connection closed after auth rejection", or even "Session closed after PIN mismatch". Any of these would incorrectly trigger the local pairing fallback even when the gateway is reachable and the error is not a connectivity issue.
Similarly, "timeout" alone (without word boundaries) can match domain-specific error messages unrelated to a TCP/network timeout.
Consider tightening the patterns to reduce false positives:
| const isConnectionError = | |
| message.includes("closed") || | |
| message.includes("timeout") || | |
| message.includes("fetch failed") || | |
| message.includes("econnrefused"); | |
| const isConnectionError = | |
| message.includes("connection closed") || | |
| message.includes("connection timeout") || | |
| message.includes("socket closed") || | |
| message.includes("fetch failed") || | |
| message.includes("econnrefused"); |
The isLoopbackHost guard downstream does limit the blast radius, but an auth-error response that contains "closed" would still silently fall back to local files rather than surfacing the real auth problem to the user.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/devices-cli.ts
Line: 103-107
Comment:
**Overly broad `"closed"` match may trigger false-positive fallbacks**
`message.includes("closed")` will match a wide range of error strings beyond network connection failures — e.g. `"WebSocket closed due to authentication failure"`, `"Connection closed after auth rejection"`, or even `"Session closed after PIN mismatch"`. Any of these would incorrectly trigger the local pairing fallback even when the gateway is reachable and the error is not a connectivity issue.
Similarly, `"timeout"` alone (without word boundaries) can match domain-specific error messages unrelated to a TCP/network timeout.
Consider tightening the patterns to reduce false positives:
```suggestion
const isConnectionError =
message.includes("connection closed") ||
message.includes("connection timeout") ||
message.includes("socket closed") ||
message.includes("fetch failed") ||
message.includes("econnrefused");
```
The `isLoopbackHost` guard downstream does limit the blast radius, but an auth-error response that contains "closed" would still silently fall back to local files rather than surfacing the real auth problem to the user.
How can I resolve this? If you propose a fix, please make it concise.|
|
||
| const init: RequestInit & { dispatcher?: Dispatcher } = { | ||
| ...(currentInit ? { ...currentInit } : {}), | ||
| headers: sanitizeHeaders(currentInit?.headers), |
There was a problem hiding this comment.
sanitizeHeaders can silently strip undefined-valued headers without any warning
When a caller passes a headers object that has keys with undefined or null values, sanitizeHeaders silently drops those keys. If the caller intended to unset a header that a prior redirect handler set (e.g. Authorization: undefined), the drop is correct. But if a non-string header value was accidentally introduced (the original bug this is fixing), the silent coercion via String(value) on line 153 could mask configuration bugs in upstream code.
Consider logging a logWarn the first time a non-string header value is encountered, so the real root-cause is surfaced rather than silently patched:
if (typeof value !== "string") {
logWarn(`fetch-guard: coercing non-string header "${key}" (${typeof value}) to string`);
}
out[key] = String(value);This keeps the fix in place while aiding future debugging.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/net/fetch-guard.ts
Line: 231
Comment:
**`sanitizeHeaders` can silently strip `undefined`-valued headers without any warning**
When a caller passes a headers object that has keys with `undefined` or `null` values, `sanitizeHeaders` silently drops those keys. If the caller intended to unset a header that a prior redirect handler set (e.g. `Authorization: undefined`), the drop is correct. But if a non-string header value was accidentally introduced (the original bug this is fixing), the silent coercion via `String(value)` on line 153 could mask configuration bugs in upstream code.
Consider logging a `logWarn` the first time a non-string header value is encountered, so the real root-cause is surfaced rather than silently patched:
```typescript
if (typeof value !== "string") {
logWarn(`fetch-guard: coercing non-string header "${key}" (${typeof value}) to string`);
}
out[key] = String(value);
```
This keeps the fix in place while aiding future debugging.
How can I resolve this? If you propose a fix, please make it concise.| const pick = (...args: unknown[]) => { | ||
| for (const arg of args) { | ||
| const val = asFiniteNumber(arg); | ||
| if (val !== undefined && val > 0) { | ||
| return val; | ||
| } | ||
| } | ||
| for (const arg of args) { | ||
| const val = asFiniteNumber(arg); | ||
| if (val !== undefined) { | ||
| return val; | ||
| } | ||
| } | ||
| return undefined; | ||
| }; |
There was a problem hiding this comment.
pick() changes field priority when a primary field is explicitly set to 0
The old ?? chain would short-circuit at the first non-null/non-undefined value, including 0. The new two-pass pick() skips any value that is 0 or negative in the first pass, meaning if a provider explicitly sets raw.input = 0 alongside raw.inputTokens = 100, the old code would have returned 0 (the authoritative value for that field) while the new code returns 100 (a secondary alias).
This is a subtle but real behavior change. While the pi-ai negative-value case is well-handled, it's worth documenting that this is intentional so future readers understand why a provider explicitly reporting 0 for one field gets overridden by another field. A brief inline comment would help:
// First pass: prefer any positive value (handles providers that set a primary field to 0
// or negative when a secondary alias has the real count, e.g. pi-ai cached-token subtraction).
for (const arg of args) {Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/usage.ts
Line: 93-107
Comment:
**`pick()` changes field priority when a primary field is explicitly set to `0`**
The old `?? chain` would short-circuit at the first non-`null`/non-`undefined` value, including `0`. The new two-pass `pick()` skips any value that is `0` or negative in the first pass, meaning if a provider explicitly sets `raw.input = 0` alongside `raw.inputTokens = 100`, the old code would have returned `0` (the authoritative value for that field) while the new code returns `100` (a secondary alias).
This is a subtle but real behavior change. While the pi-ai negative-value case is well-handled, it's worth documenting that this is intentional so future readers understand why a provider explicitly reporting `0` for one field gets overridden by another field. A brief inline comment would help:
```typescript
// First pass: prefer any positive value (handles providers that set a primary field to 0
// or negative when a secondary alias has the real count, e.g. pi-ai cached-token subtraction).
for (const arg of args) {
```
How can I resolve this? If you propose a fix, please make it concise.| if (params.isControlUi && (params.trustedProxyAuthOk || params.controlUiAuthPolicy.allowBypass)) { | ||
| return { kind: "allow" }; |
There was a problem hiding this comment.
Inconsistency between allowBypass handling here vs. shouldSkipControlUiPairing
shouldSkipControlUiPairing (lines 35–44) requires both policy.allowBypass && sharedAuthOk to skip the pairing step — the sharedAuthOk flag is a second gate even when dangerouslyDisableDeviceAuth is enabled. The new condition here, however, allows the missing-device-identity check to pass on allowBypass alone, with no corresponding sharedAuthOk verification.
Depending on how these two functions are composed in the connection handshake, this difference in gate requirements could mean that a Control UI client without valid credentials passes the identity check via this path while the pairing-skip path still enforces sharedAuthOk.
If the intent is that credential verification is always handled upstream before evaluateMissingDeviceIdentity is reached, please add a comment here stating that so the ordering assumption is explicit.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/server/ws-connection/connect-policy.ts
Line: 82-83
Comment:
**Inconsistency between `allowBypass` handling here vs. `shouldSkipControlUiPairing`**
`shouldSkipControlUiPairing` (lines 35–44) requires both `policy.allowBypass && sharedAuthOk` to skip the pairing step — the `sharedAuthOk` flag is a second gate even when `dangerouslyDisableDeviceAuth` is enabled. The new condition here, however, allows the missing-device-identity check to pass on `allowBypass` alone, with no corresponding `sharedAuthOk` verification.
Depending on how these two functions are composed in the connection handshake, this difference in gate requirements could mean that a Control UI client without valid credentials passes the identity check via this path while the pairing-skip path still enforces `sharedAuthOk`.
If the intent is that credential verification is always handled upstream before `evaluateMissingDeviceIdentity` is reached, please add a comment here stating that so the ordering assumption is explicit.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5b59d51139
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (params.isControlUi && (params.trustedProxyAuthOk || params.controlUiAuthPolicy.allowBypass)) { | ||
| return { kind: "allow" }; |
There was a problem hiding this comment.
Restrict control-ui bypass to authenticated operator paths
This condition now returns allow for any Control UI client whenever dangerouslyDisableDeviceAuth is enabled, which bypasses the later roleCanSkipDeviceIdentity check. In practice, a client can present client.id as Control UI with role: "node" and (for example) run under auth.mode="none", and the handshake no longer requires a device identity; before this change, node role could not skip device identity. That broadens a UI-only bypass into a general role bypass and can unintentionally grant node-capable sessions without pairing.
Useful? React with 👍 / 👎.
| if (val !== undefined && val > 0) { | ||
| return val; |
There was a problem hiding this comment.
Preserve field precedence when selecting usage aliases
The new pick helper prefers any positive alias over earlier fields, which changes normalization from precedence-based to value-based selection. When providers include mixed aliases (for example input as net cached-adjusted value and prompt_tokens as raw prompt count), this can select prompt_tokens even when input is intentionally 0/negative, and then cacheRead is added again downstream, inflating prompt/cost totals. This regression is specific to multi-alias payloads, but those are explicitly what this change targets.
Useful? React with 👍 / 👎.
|
Closing this as duplicate or superseded after Codex automated review. PR #45590 is a stale broad stability bundle. Current main has safer implementations for the gateway bypass, ByteString secret normalization, and usage normalization pieces, while the remaining WhatsApp, cron, and loopback-node/device work is better tracked in narrower related items. The PR also contains review-flagged regressions and obsolete WhatsApp paths. Best possible solution: Close the stale bundled PR as superseded. Keep the safer current-main implementations for gateway auth, ByteString credential normalization, and usage normalization, and handle remaining behavior through the narrower related items: #45474, #45494, #45535, and #48285. What I checked:
So I’m closing this here and keeping the remaining discussion on the canonical linked item. Codex Review notes: model gpt-5.5, reasoning high; reviewed against 8bbb143ab87e. |
This PR addresses several high-priority stability and reliability issues:
dangerouslyDisableDeviceAuthbeing ignored for Control UI connections, especially over LAN.TypeError: Cannot convert argument to a ByteStringwhen non-string values are passed to fetch.