fix(continuous-learning-v2): prevent observer memory explosion#536
Conversation
Three fixes for the positive feedback loop causing runaway memory usage:
1. SIGUSR1 throttling in observe.sh: Signal observer only every 20
observations (configurable via ECC_OBSERVER_SIGNAL_EVERY_N) instead
of on every tool call. Uses a counter file to track invocations.
2. Re-entrancy guard in observer-loop.sh on_usr1(): ANALYZING flag
prevents parallel Claude analysis processes from spawning when
signals arrive while analysis is already running.
3. Cooldown + tail-based sampling in observer-loop.sh:
- 60s cooldown between analyses (ECC_OBSERVER_ANALYSIS_COOLDOWN)
- Only last 500 lines sent to LLM (ECC_OBSERVER_MAX_ANALYSIS_LINES)
instead of the entire observations file
Closes #521
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR implements throttling and batching mechanisms to prevent the feedback loop causing memory exhaustion in the observer system. Changes include signal throttling via counter-based gating in observe.sh, re-entrancy guards and cooldown constraints in observer-loop.sh, tail-based sampling of observations, and comprehensive test coverage validating all new mechanisms. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Possibly related PRs
Poem
✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile SummaryThis PR addresses the observer memory explosion from issue #521 by introducing three complementary throttling mechanisms: SIGUSR1 signal throttling in Key issues found:
Confidence Score: 2/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Hook as observe.sh (hook)
participant Counter as .observer-signal-counter
participant Observer as observer-loop.sh (bg)
participant Claude as claude (analysis)
Note over Hook,Counter: Every tool call
Hook->>Counter: read counter
Counter-->>Hook: N
Hook->>Counter: write N+1
alt N+1 >= SIGNAL_EVERY_N (20)
Hook->>Counter: write 0 (reset)
Hook->>Observer: kill -USR1
Note over Observer: on_usr1() fires
alt ANALYZING == 1
Observer-->>Observer: skip (re-entrancy guard)
else elapsed < ANALYSIS_COOLDOWN (60s)
Observer-->>Observer: skip (cooldown)
else
Observer->>Observer: ANALYZING=1
Observer->>Observer: tail last 500 lines → analysis_file
Observer->>Claude: spawn claude --print < prompt_file
Note over Claude: runs up to 120s / 10 turns
Claude-->>Observer: exit
Observer->>Observer: LAST_ANALYSIS_EPOCH = now
Observer->>Observer: ANALYZING=0
Observer->>Observer: archive observations.jsonl
end
end
Note over Observer: Periodic (every 300s, main loop)
Observer->>Observer: analyze_observations()
Note over Observer: ⚠️ ANALYZING flag NOT set here
Observer->>Claude: spawn claude
alt SIGUSR1 fires during wait
Hook->>Observer: kill -USR1
Note over Observer: ANALYZING==0 → guard bypassed!
Observer->>Claude: spawn 2nd concurrent claude
end
|
| should_signal=0 | ||
| if [ -f "$SIGNAL_COUNTER_FILE" ]; then | ||
| counter=$(cat "$SIGNAL_COUNTER_FILE" 2>/dev/null || echo 0) | ||
| counter=$((counter + 1)) | ||
| if [ "$counter" -ge "$SIGNAL_EVERY_N" ]; then | ||
| should_signal=1 | ||
| counter=0 | ||
| fi | ||
| done | ||
| echo "$counter" > "$SIGNAL_COUNTER_FILE" | ||
| else | ||
| echo "1" > "$SIGNAL_COUNTER_FILE" | ||
| fi |
There was a problem hiding this comment.
Counter file has no locking — concurrent invocations can double-signal
The counter read-increment-write is not atomic. Claude Code can invoke hooks concurrently for parallel tool calls, meaning two observe.sh processes can:
- Both read the counter file at the same value (e.g.,
19) - Both increment to
20, both evaluate>= SIGNAL_EVERY_Nas true - Both set
should_signal=1and sendkill -USR1to the observer
This means two signals get sent simultaneously, and then two sequential on_usr1 calls fire. If the first on_usr1 completes quickly (e.g., hits an early-return branch) and resets ANALYZING=0 before the second one executes, two full analyses can be spawned — recreating the root cause of #521.
Consider protecting the counter update with flock (consistent with the existing locking pattern used above for the lazy-start logic):
if command -v flock >/dev/null 2>&1; then
(
flock -x 9
counter=$(cat "$SIGNAL_COUNTER_FILE" 2>/dev/null || echo 0)
counter=$((counter + 1))
if [ "$counter" -ge "$SIGNAL_EVERY_N" ]; then
should_signal=1
counter=0
fi
echo "$counter" > "$SIGNAL_COUNTER_FILE"
) 9>"${SIGNAL_COUNTER_FILE}.lock"
fi|
|
||
| ANALYZING=1 | ||
| analyze_observations | ||
| LAST_ANALYSIS_EPOCH=$(date +%s) | ||
| ANALYZING=0 |
There was a problem hiding this comment.
LAST_ANALYSIS_EPOCH updated even when analyze_observations skips early
analyze_observations has several early-return paths (missing file, obs_count < MIN_OBSERVATIONS, session-guardian denying, claude not found). When these paths are taken, no analysis actually occurs — but LAST_ANALYSIS_EPOCH is still updated on line 166, starting the 60-second cooldown. This can delay legitimate analyses if, for example, the observer file doesn't exist yet during the first few signals.
Consider only updating LAST_ANALYSIS_EPOCH when analysis actually ran (e.g., after successfully spawning claude), or at least moving the update inside analyze_observations after the guards pass.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3a26747b80
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| analyze_observations | ||
| LAST_ANALYSIS_EPOCH=$(date +%s) | ||
| ANALYZING=0 |
There was a problem hiding this comment.
Update cooldown timestamp only after real analysis runs
on_usr1() always sets LAST_ANALYSIS_EPOCH after calling analyze_observations, even when that function exits early (for example when the file is missing or obs_count < MIN_OBSERVATIONS). In those no-op cases, subsequent SIGUSR1 events are still suppressed by cooldown, which can delay the first eligible analysis by at least ECC_OBSERVER_ANALYSIS_COOLDOWN seconds despite no Claude run having happened.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
4 issues found across 3 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="tests/hooks/observer-memory.test.js">
<violation number="1" location="tests/hooks/observer-memory.test.js:342">
P2: Use the same platform-aware Python detection in the fallback check; `python3`-only probing can incorrectly skip failures on Windows.</violation>
</file>
<file name="skills/continuous-learning-v2/hooks/observe.sh">
<violation number="1" location="skills/continuous-learning-v2/hooks/observe.sh:371">
P2: Validate `ECC_OBSERVER_SIGNAL_EVERY_N` before numeric comparison; non-integer values currently break throttle evaluation and can disable signaling.</violation>
<violation number="2" location="skills/continuous-learning-v2/hooks/observe.sh:376">
P2: The counter-file throttle is racy: concurrent observe.sh runs can emit multiple SIGUSR1 signals for the same threshold crossing, weakening the memory-loop protection.</violation>
</file>
<file name="skills/continuous-learning-v2/agents/observer-loop.sh">
<violation number="1" location="skills/continuous-learning-v2/agents/observer-loop.sh:166">
P2: Only update `LAST_ANALYSIS_EPOCH` when an analysis run actually starts/completes; updating it after no-op early returns incorrectly triggers cooldown and delays legitimate analyses.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| ); | ||
| } else { | ||
| // If python3 is not available the hook exits early - that is acceptable | ||
| const hasPython = spawnSync('python3', ['--version']).status === 0; |
There was a problem hiding this comment.
P2: Use the same platform-aware Python detection in the fallback check; python3-only probing can incorrectly skip failures on Windows.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At tests/hooks/observer-memory.test.js, line 342:
<comment>Use the same platform-aware Python detection in the fallback check; `python3`-only probing can incorrectly skip failures on Windows.</comment>
<file context>
@@ -0,0 +1,360 @@
+ );
+ } else {
+ // If python3 is not available the hook exits early - that is acceptable
+ const hasPython = spawnSync('python3', ['--version']).status === 0;
+ if (hasPython) {
+ assert.fail('Counter file should exist after running observe.sh');
</file context>
| # Throttle SIGUSR1: only signal observer every N observations (#521) | ||
| # This prevents rapid signaling when tool calls fire every second, | ||
| # which caused runaway parallel Claude analysis processes. | ||
| SIGNAL_EVERY_N="${ECC_OBSERVER_SIGNAL_EVERY_N:-20}" |
There was a problem hiding this comment.
P2: Validate ECC_OBSERVER_SIGNAL_EVERY_N before numeric comparison; non-integer values currently break throttle evaluation and can disable signaling.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At skills/continuous-learning-v2/hooks/observe.sh, line 371:
<comment>Validate `ECC_OBSERVER_SIGNAL_EVERY_N` before numeric comparison; non-integer values currently break throttle evaluation and can disable signaling.</comment>
<file context>
@@ -365,24 +365,45 @@ if [ "$OBSERVER_ENABLED" = "true" ]; then
+# Throttle SIGUSR1: only signal observer every N observations (#521)
+# This prevents rapid signaling when tool calls fire every second,
+# which caused runaway parallel Claude analysis processes.
+SIGNAL_EVERY_N="${ECC_OBSERVER_SIGNAL_EVERY_N:-20}"
+SIGNAL_COUNTER_FILE="${PROJECT_DIR}/.observer-signal-counter"
+
</file context>
| SIGNAL_EVERY_N="${ECC_OBSERVER_SIGNAL_EVERY_N:-20}" | |
| SIGNAL_EVERY_N="${ECC_OBSERVER_SIGNAL_EVERY_N:-20}" | |
| case "$SIGNAL_EVERY_N" in | |
| ''|*[!0-9]*|0) SIGNAL_EVERY_N=20 ;; | |
| esac |
|
|
||
| should_signal=0 | ||
| if [ -f "$SIGNAL_COUNTER_FILE" ]; then | ||
| counter=$(cat "$SIGNAL_COUNTER_FILE" 2>/dev/null || echo 0) |
There was a problem hiding this comment.
P2: The counter-file throttle is racy: concurrent observe.sh runs can emit multiple SIGUSR1 signals for the same threshold crossing, weakening the memory-loop protection.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At skills/continuous-learning-v2/hooks/observe.sh, line 376:
<comment>The counter-file throttle is racy: concurrent observe.sh runs can emit multiple SIGUSR1 signals for the same threshold crossing, weakening the memory-loop protection.</comment>
<file context>
@@ -365,24 +365,45 @@ if [ "$OBSERVER_ENABLED" = "true" ]; then
+
+should_signal=0
+if [ -f "$SIGNAL_COUNTER_FILE" ]; then
+ counter=$(cat "$SIGNAL_COUNTER_FILE" 2>/dev/null || echo 0)
+ counter=$((counter + 1))
+ if [ "$counter" -ge "$SIGNAL_EVERY_N" ]; then
</file context>
|
|
||
| ANALYZING=1 | ||
| analyze_observations | ||
| LAST_ANALYSIS_EPOCH=$(date +%s) |
There was a problem hiding this comment.
P2: Only update LAST_ANALYSIS_EPOCH when an analysis run actually starts/completes; updating it after no-op early returns incorrectly triggers cooldown and delays legitimate analyses.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At skills/continuous-learning-v2/agents/observer-loop.sh, line 166:
<comment>Only update `LAST_ANALYSIS_EPOCH` when an analysis run actually starts/completes; updating it after no-op early returns incorrectly triggers cooldown and delays legitimate analyses.</comment>
<file context>
@@ -130,7 +146,25 @@ on_usr1() {
+
+ ANALYZING=1
analyze_observations
+ LAST_ANALYSIS_EPOCH=$(date +%s)
+ ANALYZING=0
}
</file context>
Merge upstream/main which includes: - fix: observer memory explosion with throttling and tail sampling (affaan-m#536) - fix: observer hooks hardening, secret scrubbing (affaan-m#348) - fix: lazy-start observer logic (affaan-m#508) - fix: read tool_response field in observe.sh (affaan-m#377) - feat: add C++, Java, Rust, PyTorch language support - feat: SQLite state store, skill evolution, session adapters - feat: orchestration harness and selective install - feat: DevFleet multi-agent orchestration skill Conflict resolution: - observe.sh: kept our hook_event_name + tool_response fixes, merged upstream's $PYTHON_CMD, secret scrubbing, lazy-start, throttled signaling, and session guards - package.json: accepted upstream's new scripts and dependencies
…d tail sampling (affaan-m#536) Three fixes for the positive feedback loop causing runaway memory usage: 1. SIGUSR1 throttling in observe.sh: Signal observer only every 20 observations (configurable via ECC_OBSERVER_SIGNAL_EVERY_N) instead of on every tool call. Uses a counter file to track invocations. 2. Re-entrancy guard in observer-loop.sh on_usr1(): ANALYZING flag prevents parallel Claude analysis processes from spawning when signals arrive while analysis is already running. 3. Cooldown + tail-based sampling in observer-loop.sh: - 60s cooldown between analyses (ECC_OBSERVER_ANALYSIS_COOLDOWN) - Only last 500 lines sent to LLM (ECC_OBSERVER_MAX_ANALYSIS_LINES) instead of the entire observations file Closes affaan-m#521
Summary
Fixes #521 — the background observer agent caused memory explosion and crash due to a positive feedback loop between
observe.shSIGUSR1 signaling andobserver-loop.shanalysis.Three root causes addressed:
observe.sh): Every tool call was sending a signal to the observer. Now uses a counter file to only signal every 20 observations (configurable viaECC_OBSERVER_SIGNAL_EVERY_N).observer-loop.shon_usr1()): AddedANALYZINGflag that prevents new analysis from spawning while one is already running. Previously, signals during analysis spawned parallel Claude processes.observer-loop.sh): 60s cooldown between analyses (ECC_OBSERVER_ANALYSIS_COOLDOWN). Only sends the last 500 lines to the LLM (ECC_OBSERVER_MAX_ANALYSIS_LINES) instead of the entire 4.8MB file.Test plan
tests/hooks/observer-memory.test.js— all passingnode tests/run-all.js)Summary by cubic
Prevents the observer’s memory explosion by breaking the SIGUSR1/analysis feedback loop. Adds throttling, a re-entrancy guard, a cooldown, and tail-based sampling to keep analyses bounded. Fixes #521.
observe.sh: signal every N events (default 20 viaECC_OBSERVER_SIGNAL_EVERY_N) using a.observer-signal-counterfile.observer-loop.sh:ANALYZINGflag blocks overlapping analyses; add 60s cooldown (ECC_OBSERVER_ANALYSIS_COOLDOWN).observer-loop.sh: analyze only the last 500 lines (ECC_OBSERVER_MAX_ANALYSIS_LINES) via a temp file; clean up after run.tests/hooks/observer-memory.test.jscovering throttling, sampling, and re-entrancy.Written for commit 3a26747. Summary will update on new commits.
Summary by CodeRabbit
Bug Fixes
Tests