fix(wacli-keepalive): defer initial backfill until --follow is connected#138
fix(wacli-keepalive): defer initial backfill until --follow is connected#138auroracapital merged 6 commits intomainfrom
Conversation
Before this change the keepalive ran auto_detect_empty_chats → detect_missed_messages
→ run_backfill at startup, before `wacli sync --follow` had a chance to establish
a persistent WebSocket. Each `wacli history backfill` subprocess opens its own
connection to web.whatsapp.com, so with 40–50 queued chats the DNS lookups and
new-session handshakes race each other, hitting WhatsApp's fresh-session rate
limit. In practice this produced either floods of `failed to dial whatsapp web
websocket: ... lookup web.whatsapp.com: no such host` errors, or 30-second
on-demand history sync timeouts — blocking the transition to --follow for 5–20
minutes on every launchd restart and looking like wacli-sync was dying.
Fix: remove the pre-follow backfill block and set LAST_BACKFILL_TIME=0 so the
first iteration of the --follow supervisor loop fires periodic_backfill
immediately. periodic_backfill already pauses --follow via acquire_wacli_batch,
runs the same three functions through the warm session, and releases — no new
code path needed. The effect is that the persistent connection comes up in
seconds instead of minutes, and backfill runs serially against the existing
socket instead of flooding WhatsApp with parallel handshakes.
Net behaviour: --follow is now the first thing that gets established after
bootstrap + refresh_wacli_cache, exactly matching the design documented in the
script header ("Run wacli sync --follow for persistent message streaming").
…before doppler
The daemon that cron-schedules memory-extractor runs under launchd without
interactive Doppler auth, so `doppler secrets get ANTHROPIC_API_KEY --plain`
returned empty and the script died with
FATAL: ANTHROPIC_API_KEY not set and doppler unavailable
every 30 minutes. No memory files were ever written.
Fix: extend resolve_api_key() with a macOS Keychain lookup that runs BEFORE
the doppler path — `security find-generic-password -s ANTHROPIC_API_KEY -w`.
This works in launchd contexts (keychain is user-scoped, not shell-scoped)
and survives plugin upgrades. The doppler path is also made explicit about
project/config via new OPS_DOPPLER_PROJECT / OPS_DOPPLER_CONFIG env vars so
users whose daemon env IS scoped to Doppler still get a deterministic lookup
instead of depending on ambient CLI auth.
New precedence (top-down):
1. ANTHROPIC_API_KEY already in env
2. macOS Keychain service=ANTHROPIC_API_KEY (any account)
3. Doppler with OPS_DOPPLER_PROJECT + OPS_DOPPLER_CONFIG
4. Doppler with ambient scope
Seed the keychain entry with:
security add-generic-password -U -s ANTHROPIC_API_KEY -a ops-daemon \
-w "$(doppler secrets get ANTHROPIC_API_KEY --plain \
--project <your-project> --config <your-config>)"
The setup wizard should do this automatically during ops:setup — tracked
separately.
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 49 minutes and 34 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughAdd macOS Keychain as a fallback when resolving Changes
Sequence Diagram(s)sequenceDiagram
participant Script as Script
participant Env as Environment
participant Keychain as macOS Keychain
participant Doppler as Doppler Service
Script->>Env: check ANTHROPIC_API_KEY
alt found in environment
Env-->>Script: return API key
else not found
Script->>Keychain: security find-generic-password -s ANTHROPIC_API_KEY -w
alt found in keychain
Keychain-->>Script: return API key
else not found
Script->>Doppler: doppler secrets get ANTHROPIC_API_KEY<br/>(+ --project/--config if set)
alt retrieved from Doppler
Doppler-->>Script: return API key
else all sources exhausted
Script-->>Script: error listing env, Keychain, Doppler attempts
end
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Address three race conditions flagged in PR review: 1. LAST_BACKFILL_TIME=0 fired backfill immediately on the first supervisor tick, tearing down the just-started --follow connection before it could stabilise. Replaced with a 30s stabilisation delay (INITIAL_BACKFILL_DELAY) so --follow has time to establish the persistent WebSocket. 2. detect_missed_messages called release_wacli_batch internally, removing BATCH_MARKER while the background subshell was still running. The main supervisor loop then saw dead sync + no marker and broke out, creating a restart loop. Fixed by holding the batch lock at the subshell level via _WACLI_BATCH_HELD, making inner acquire/release calls no-ops. 3. auto_detect_empty_chats ran inside periodic_backfill before acquire_wacli_batch, failing silently due to store-lock contention with the running --follow process. Now runs after the subshell acquires the batch lock.
Claude Code stores an OAuth access token in the macOS Keychain under service
"Claude Code-credentials" (JSON blob → .claudeAiOauth.accessToken, format
sk-ant-oat01-*). When present and non-expired, that token authorises calls to
api.anthropic.com via `Authorization: Bearer <token>` + the
`anthropic-beta: oauth-2025-04-20` header, and the usage is billed against
the user's Max/Pro subscription instead of per-token API rates.
This matters here because memory-extractor runs every 30 min under launchd
and was previously blocked on a missing ANTHROPIC_API_KEY. Even after my
previous commit added a keychain + doppler fallback for the API key, a user
on a Claude Max plan would still be paying API rates for background work.
There is also a known behavioural gotcha with Claude Code itself: if
ANTHROPIC_API_KEY is exported in the shell (profile, direnv, etc.) Claude
Code prefers that key over the signed-in OAuth session, silently switching
the user from subscription billing to metered billing. Preferring OAuth
here — and pointedly NOT exporting ANTHROPIC_API_KEY — sidesteps that.
Changes in this commit:
- `resolve_api_key` replaced by `resolve_auth` with precedence:
1. Claude Code OAuth token from keychain (requires ≥ 60 s remaining life
to avoid mid-call expiry)
2. $ANTHROPIC_API_KEY env var
3. macOS Keychain service "ANTHROPIC_API_KEY"
4. Doppler (respects OPS_DOPPLER_PROJECT / OPS_DOPPLER_CONFIG)
Emits three globals consumed by call_claude: OPS_AUTH_HEADER,
OPS_AUTH_MODE ("oauth"|"apikey"), OPS_AUTH_EXTRA_HEADERS (array; OAuth
mode adds the anthropic-beta header).
- `resolve_api_key` kept as a back-compat alias.
- `call_claude` now emits `OPS_AUTH_HEADER` and expands
`OPS_AUTH_EXTRA_HEADERS[@]` instead of hardcoding `x-api-key`.
- Neither path exports the credential to child processes — reduces blast
radius and avoids poisoning Claude Code sessions launched from the same
shell.
Drive-by fix: two `<<PYEOF` heredocs were unquoted, so bash expanded
backticks inside the embedded Python regex
re.sub(r'^```(?:json)?\s*', '', raw.strip())
as command substitution, which blew up with
"syntax error near unexpected token `?\s*'".
This made the JSON-parse step silently fail and produced "Total memory
size: 0 bytes" on every successful API call. Quoted both heredoc
delimiters to `<<'PYEOF'`. Args after the delimiter on the same line are
unaffected by quoting.
Validated end-to-end with ANTHROPIC_API_KEY unset and the keychain API-key
entry removed: OAuth path activated, call_claude logged
`Auth: Claude Code OAuth (subscription — no per-token billing)`, API
returned 200, and 12 memory files were written (9 contacts, preferences,
topics_active, donts) totalling 8.8 KB.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
claude-ops/scripts/wacli-keepalive.sh (1)
440-457:⚠️ Potential issue | 🟠 MajorKeep the restart signal separate from the active batch lock.
The background subshell now holds
BATCH_MARKER, but the supervisor still removes that same marker at Line 564 before restarting. That can restartsync --followwhile the subshell is still backfilling; if the subshell releases first, Line 560 can instead treat the self-killed sync as an unexpected exit.Use separate “batch active” and “restart requested” markers, or wait for the active marker to clear before restarting without deleting it from the supervisor path.
Suggested direction
+# Separate "active batch" from "sync was intentionally killed for a batch". +BATCH_RESTART_MARKER="$STORE/.batch_restart_wacli" + acquire_wacli_batch() { # Reentrant: if an outer caller already holds the batch, skip the real work. # _WACLI_BATCH_HELD is set by periodic_backfill's subshell to prevent inner # functions (detect_missed_messages, write_backfill_memory) from releasing the # marker while the subshell is still running. if [[ "${_WACLI_BATCH_HELD:-0}" == "1" ]]; then return 0; fi - touch "$BATCH_MARKER" + touch "$BATCH_MARKER" "$BATCH_RESTART_MARKER" ... } ... ( # Hold the batch lock for the entire subshell so the supervisor loop # always sees BATCH_MARKER while we are working. _WACLI_BATCH_HELD # makes the inner acquire/release calls in detect_missed_messages and # write_backfill_memory no-ops, preventing premature marker removal. acquire_wacli_batch _WACLI_BATCH_HELD=1 + trap '_WACLI_BATCH_HELD=0; release_wacli_batch' EXIT auto_detect_empty_chats detect_missed_messages if [[ -f "$STORE/.backfill_jids" ]] && [[ -s "$STORE/.backfill_jids" ]]; then run_backfill write_backfill_memory fi - _WACLI_BATCH_HELD=0 - release_wacli_batch ) & ... - if ! check_pause_signal && [[ ! -f "$BATCH_MARKER" ]]; then + if ! check_pause_signal && [[ ! -f "$BATCH_RESTART_MARKER" ]]; then break fi - # Clear any stale batch marker before restarting - rm -f "$BATCH_MARKER" + while [[ -f "$BATCH_MARKER" ]]; do + sleep 2 + done + rm -f "$BATCH_RESTART_MARKER" doneAlso applies to: 560-564
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@claude-ops/scripts/wacli-keepalive.sh` around lines 440 - 457, The supervisor is removing the same BATCH_MARKER used by the background subshell, causing races between backfill and restart; change the protocol so batch locking and restart requests use separate markers and checks: keep acquire_wacli_batch/release_wacli_batch and _WACLI_BATCH_HELD to manage the batch marker only, add a new request_wacli_restart (or a $STORE/.restart_requested marker) that the subshell sets when it wants a restart instead of deleting the batch marker, and modify the supervisor restart logic to check for the batch-active marker (via is_wacli_batch_active or checking BATCH_MARKER) and wait for it to clear before proceeding while only removing the restart marker (not the batch marker); ensure any code path that currently deletes the batch marker is updated to delete the restart marker instead.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@claude-ops/scripts/wacli-keepalive.sh`:
- Around line 440-457: The supervisor is removing the same BATCH_MARKER used by
the background subshell, causing races between backfill and restart; change the
protocol so batch locking and restart requests use separate markers and checks:
keep acquire_wacli_batch/release_wacli_batch and _WACLI_BATCH_HELD to manage the
batch marker only, add a new request_wacli_restart (or a
$STORE/.restart_requested marker) that the subshell sets when it wants a restart
instead of deleting the batch marker, and modify the supervisor restart logic to
check for the batch-active marker (via is_wacli_batch_active or checking
BATCH_MARKER) and wait for it to clear before proceeding while only removing the
restart marker (not the batch marker); ensure any code path that currently
deletes the batch marker is updated to delete the restart marker instead.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0fc8a440-fdab-45b7-a73f-e94112f8252b
📒 Files selected for processing (1)
claude-ops/scripts/wacli-keepalive.sh
…ions The find-generic-password call omitted -a ops-daemon, while the seed command in the header comments used it. Without the account filter, security(1) returns the first matching service entry across all accounts — wrong key if multiple ANTHROPIC_API_KEY entries exist. Add -a ops-daemon to the lookup and update the precedence comment from '(any account)' to 'account "ops-daemon"'.
…in account filter - Make periodic_backfill synchronous (remove background subshell) to fix a race where a fast-finishing backfill removes BATCH_MARKER before the supervisor loop checks it, causing the script to exit and launchd to restart in a boot loop. - Leave BATCH_MARKER in place after backfill so the supervisor restarts sync instead of exiting; add early break in the monitor loop to skip refresh_wacli_cache when sync is already dead. - Add -a ops-daemon to the keychain lookup in ops-memory-extractor.sh so it matches the seed command and reliably retrieves the daemon-specific API key entry.
CodeRabbit (ops-memory-extractor.sh:117): add `-a ops-daemon` to the `security find-generic-password` lookup so it matches the seed command and reliably returns the daemon-specific keychain entry. CodeRabbit/blocksorg (wacli-keepalive.sh): close the BATCH_MARKER race where the supervisor could `rm -f BATCH_MARKER` and restart --follow while the periodic_backfill subshell was still executing run_backfill. Fix: capture `_BACKFILL_PID` after forking the subshell and wait for it before clearing the marker in the outer restart block. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| -H "x-api-key: ${ANTHROPIC_API_KEY}" \ | ||
| -H "${OPS_AUTH_HEADER}" \ | ||
| -H "anthropic-version: 2023-06-01" \ | ||
| "${OPS_AUTH_EXTRA_HEADERS[@]}" \ |
There was a problem hiding this comment.
🔴 Empty array expansion of OPS_AUTH_EXTRA_HEADERS crashes script on macOS bash 3.2 under set -u
When using API key auth (not OAuth), OPS_AUTH_EXTRA_HEADERS is set to an empty array () at ops-memory-extractor.sh:82. The curl command at line 306 expands "${OPS_AUTH_EXTRA_HEADERS[@]}", which under set -u (line 5) causes a fatal "unbound variable" error on bash < 4.4. macOS ships with bash 3.2 (due to GPLv3 licensing), and this script explicitly targets macOS (security keychain, BSD date -v, launchd references). This means the entire API key fallback path — the primary auth method for users without Claude Code OAuth — is broken on macOS with the default system bash.
| "${OPS_AUTH_EXTRA_HEADERS[@]}" \ | |
| ${OPS_AUTH_EXTRA_HEADERS[@]+"${OPS_AUTH_EXTRA_HEADERS[@]}"} \ |
Was this helpful? React with 👍 or 👎 to provide feedback.
| -H "x-api-key: ${ANTHROPIC_API_KEY}" \ | ||
| -H "${OPS_AUTH_HEADER}" \ | ||
| -H "anthropic-version: 2023-06-01" \ | ||
| "${OPS_AUTH_EXTRA_HEADERS[@]}" \ |
There was a problem hiding this comment.
Empty-array expansion with set -u will crash on bash < 4.4 (macOS system bash 3.2)
Both scripts declare set -euo pipefail. In the API-key auth path resolve_auth sets OPS_AUTH_EXTRA_HEADERS=() (empty array). On bash < 4.4 expanding an empty array subscript under set -u throws:
-bash: OPS_AUTH_EXTRA_HEADERS[@]: unbound variable
This kills the curl call and causes die to fire — the memory extractor exits without writing any files. It's a silent regression for users whose launchd PATH resolves to /bin/bash (macOS system bash 3.2) rather than a Homebrew bash 5.x, which is the common launchd case when the plist doesn't explicitly extend PATH.
The OAuth path is unaffected (array is non-empty: (-H "anthropic-beta: oauth-2025-04-20")), but any user falling through to the API-key path will hit this.
Fix — use the conditional-expansion form, which is safe on bash 3.2 with set -u:
# Before
"${OPS_AUTH_EXTRA_HEADERS[@]}" \
# After
${OPS_AUTH_EXTRA_HEADERS[@]+"${OPS_AUTH_EXTRA_HEADERS[@]}"} \${var[@]+word} does not trigger nounset — it explicitly tests whether the parameter is set before expanding — and with an empty array the condition is false so nothing is emitted, avoiding the unbound-variable error.
Minor release bundling three feature PRs and multiple security/stability fixes: - PR #141: /gtm — cross-channel go-to-market planning skill - PR #139: /ops:projects portfolio dashboard + GSD registry sync - PR #140: ops-speedup v2 parity (GPU/ANE monitoring, power hogs, OS actions) - PR #138: ops-memory-extractor Claude Code OAuth support + wacli persistent --follow fix Bug fixes (see CHANGELOG.md [1.7.0] for full detail): - SEV-9: ops-speedup eval shell-injection (Seer) - SEV-9: ops-projects hardcoded /Users/ path breaking for all other users (Seer + blocksorg + cursor + devin + codex) - SEV-8: ops-speedup RETURN trap race + systemd mask allowlist - SEV-7: ops-speedup lsof +D probe wedge, daemon-services backing-script gap, ops-projects AskUserQuestion allowed-tools mismatch - wacli --follow torn down by immediate backfill (now INITIAL_BACKFILL_DELAY=30 + reentrant guard) Bumps claude-ops/package.json, claude-ops/.claude-plugin/plugin.json, and .claude-plugin/marketplace.json plugins[0].version 1.6.2 → 1.7.0.
Summary
--followbackfill block (auto_detect_empty_chats→detect_missed_messages→run_backfill) that ran at lines 347–349 ofclaude-ops/scripts/wacli-keepalive.sh.LAST_BACKFILL_TIME=0so the first iteration of the--followsupervisor loop firesperiodic_backfillimmediately — reusing the warm session viaacquire_wacli_batchinstead of opening dozens of fresh WebSockets.Why
Symptom reported after a fresh daemon restart:
wacli-syncmarked disconnected,/ops:commscommands slow, daemon logs showHEALTH: wacli-sync pid=N is deadevery ~90s, keepalive log flooded with:…or on the better-connected path:
Root cause: each
wacli history backfill --chat=<jid>invocation opens its own WhatsApp WebSocket. On startup the script queues 40–50 chats (auto_detect_empty_chats+detect_missed_messages) and immediately runs them serially throughrun_backfill— beforewacli sync --followhas a chance to establish the persistent connection the rest of the system expects.When that many subprocess handshakes fire in quick succession, WhatsApp's fresh-session rate limit trips. The failures arrive as DNS lookup errors (throttled resolver →
no such host) or 30 s timeouts waiting for the on-demand history sync response. Either way, the script never reaches line 494 (SYNC: starting persistent sync --follow) for 5–20 minutes after every launchd restart, and user-facingwacli doctorreportsconnected:falsethe whole time.The fix
periodic_backfillalready encapsulates the correct ordering: pause--followviaacquire_wacli_batch→ run the same three functions (auto_detect_empty_chats+detect_missed_messages+run_backfill+write_backfill_memory) → release. Forcing its first iteration to fire immediately (viaLAST_BACKFILL_TIME=0) gives us the exact same behaviour without the pre---followrace — backfills run one at a time against the warm session, and the persistent connection comes up in seconds.This also matches what the script's own header promises:
"3. Runwacli sync --followfor persistent message streaming".Diff stats
1 file changed, 15 insertions(+), 4 deletions(-)
Test plan
bash -n claude-ops/scripts/wacli-keepalive.sh— syntax OKtests/test-no-secrets.sh— 11/11 passlaunchctl kickstart -k gui/$(id -u)/com.claude-ops.daemon) and confirmwacli doctorreportsconnected:truewithin ~30 s instead of 5+ min.periodic_backfillcycle runs within the first monitor-loop iteration after--followstarts (checkLAST_BACKFILL_TIMElog line andPERIODIC-BACKFILL: checking for chats needing backfillappearing near the top ofwacli-keepalive.log).BACKFILL_INTERVAL(default 30 min).Out of scope
SC2069warning from shellcheck on line 177 (2>&1 > "$probe_log") — not touched here.ENSURE:step auto-reinstalling a stalecom.claude-ops.wacli-keepaliveplist from a prior plugin version — a separate issue.Note
Medium Risk
Moderate risk: changes request authentication headers for Anthropic API calls and adjusts WhatsApp backfill scheduling/locking, which could cause auth failures or missed/duplicated backfills if edge cases aren’t covered.
Overview
Ops memory extraction now prefers Claude Code OAuth over API keys.
ops-memory-extractor.shaddsresolve_authto fetch an unexpired OAuth token from macOS Keychain (with required extra header), falls back to API keys from env/keychain/Doppler, and updatescurlcalls to use the resolved header/mode.wacli keepalive defers initial backfill and hardens batch locking.
wacli-keepalive.shremoves the pre---followstartup backfill, schedules the firstperiodic_backfillafter a short stabilization delay, and makesacquire_wacli_batch/release_wacli_batchre-entrant so the batch marker remains set for the entire backfill subshell.Reviewed by Cursor Bugbot for commit 666c3be. Bugbot is set up for automated code reviews on this repo. Configure here.
Summary by CodeRabbit