Skip to content

v0.28.3 feat(recipes): restart-sweep — detect dropped Telegram messages after gateway restarts#675

Merged
garrytan merged 5 commits intogarrytan:masterfrom
garrytan-agents:feat/restart-sweep
May 6, 2026
Merged

v0.28.3 feat(recipes): restart-sweep — detect dropped Telegram messages after gateway restarts#675
garrytan merged 5 commits intogarrytan:masterfrom
garrytan-agents:feat/restart-sweep

Conversation

@garrytan-agents
Copy link
Copy Markdown
Contributor

@garrytan-agents garrytan-agents commented May 6, 2026

Summary

Reshape PR #675's recipes/restart-sweep/ directory into a single self-contained recipes/restart-sweep.md recipe with the (fixed) script inlined as a fenced code block. Apply 8 code-quality fixes, port + extend the test suite to bun:test (12 ported + 14 new = 26 cases + 1 sentinel guard = 27 total).

Why land it as a recipe, not a default behavior: restart-sweep is host-specific to OpenClaw + Telegram + webhook mode. CLAUDE.md is explicit that host-specific operational tooling lives as plugin handlers in the host's own repo, not in gbrain core. So it ships as an opt-in recipe alongside twilio-voice-brain, email-to-brain, etc. — discoverable via gbrain integrations list, only "configured" when the user sets the OpenClaw envs. The recipe body documents the v2 upgrade path: registered Minion handler in the openclaw repo against gbrain/minions (see docs/guides/plugin-handlers.md).

Commits:

  • feat(recipes): reshape restart-sweep into single .md recipe + harden script — the meat of the change. New recipes/restart-sweep.md with frontmatter (no expect_exit_code, schema doesn't have it) + agent-facing setup body + inlined ~325-line script with 8 fixes. New test/restart-sweep.test.ts with 27 bun:test cases anchored on a <!-- restart-sweep:script --> sentinel comment. Old recipes/restart-sweep/ directory deleted.
  • chore: bump version and changelog (v0.28.3) — VERSION + package.json + CHANGELOG entry written in the GStack release-summary voice.
  • docs: sync README + CLAUDE.md for v0.28.3 restart-sweep recipe — README's recipes table gets the new row, CLAUDE.md's test inventory gets the new test annotation, llms-full.txt regenerated via bun run build:llms.

Test Coverage

CODE PATHS                                            STATUS
[+] recipes/restart-sweep.md (inlined script ~325 lines)
  ├── determineAlertMode (3 modes)                    ✓ 3 cases
  ├── filterTelegramSessions (3 paths)                ✓ 3 cases
  ├── detectDroppedMessages
  │   ├── abortedLastRun primary                      ✓ tested
  │   ├── topic extraction                            ✓ tested
  │   ├── malformed key fallback                      ✓ tested
  │   ├── AGGRESSIVE=unset (silent)                   ✓ NEW
  │   └── AGGRESSIVE=1 (fires)                        ✓ NEW
  ├── timing window correctness                        ✓ NEW
  ├── log timestamp regex (Gateway + OpenClaw)        ✓ 2 cases
  ├── loadAlerted (missing/corrupt/prune)             ✓ 3 NEW
  ├── saveAlerted (atomic tmp+rename)                 ✓ NEW
  ├── cooldown layer (not-in-map / suppress / expire) ✓ 3 NEW
  ├── round-trip (2nd invocation skips alerted)       ✓ NEW
  ├── alert formatting (real \n)                      ✓ NEW
  ├── execFile argv shape (no shell)                  ✓ NEW
  ├── GBRAIN_HOME path override                       ✓ NEW
  ├── constructor-time env reads                      ✓ NEW
  └── sentinel-shape guard                            ✓ NEW

COVERAGE: 27/27 paths (100%)  |  bun test test/restart-sweep.test.ts → 27 pass / 0 fail
Tests: 3902 → 3929 (+27 new)

Coverage gate: PASS (100%).

Pre-Landing Review

Already cleared via /plan-eng-review (6 issues, all 6 resolved with recommended option) and /codex consult mode (8 findings, all 8 resolved). The plan file at ~/.claude/plans/figure-out-if-we-eager-coral.md carries the full review trace.

Codex caught 2 silent-correctness bugs the eng review missed:

  • C1 (idempotency key collapse): original (sessionKey, restartTimeIso) key changes every run when the bootstrap log is missing, so the same stale session re-alerts forever. Fixed by adding a (sessionKey, lastAlertedAt) cooldown layer with 6h re-alert threshold.
  • C2 (import-time env snapshot): original script snapshotted env at module load — tests mutating process.env after import were semantically bogus. Fixed by moving env reads into the MessageSweepDetector constructor.

Eval Results

No prompt-related files changed — evals skipped.

Greptile Review

No Greptile comments on the PR.

Scope Drift

CLEAN. Branch intent: reshape PR #675's recipe shape + apply the 8 code fixes + add proper bun:test coverage. Delivered: same. No files outside recipes/restart-sweep.{md,mjs}, test/restart-sweep.test.ts, or the doc-sync targets.

Plan Completion

  • [DONE] Single self-contained recipes/restart-sweep.md (D2)
  • [DONE] No expect_exit_code in command health_check (D1)
  • [DONE] Atomic tmp+rename write for alerted.json (D3)
  • [DONE] Corrupt-JSON recovery in loadAlerted (D4)
  • [DONE] 12 ported + 14 new test cases (D5, +1 sentinel guard = 27 total)
  • [DONE] AGGRESSIVE-flip recipe-body callout (D6)
  • [DONE] Cooldown layer for synthesized restart-time bug (C1)
  • [DONE] Constructor-time env reads (C2)
  • [DONE] D3-claim wording corrected — atomicity ≠ no-dupes (C3)
  • [DONE] Cron environment troubleshooting subsection (C4)
  • [DONE] Plugin-handler v2 upgrade-path TODO (C5)
  • [DONE] Sentinel-anchored test extractor + ESM-cache-bypass salting (C6)
  • [DONE] Recipe-listing-vs-env-presence wording fixed (C7)
  • [DONE] Test-runner cite includes both parallel + shard scripts (C8)

12 plan items, 12 done. 0 deferred.

Verification Results

  • bun test test/restart-sweep.test.ts → 27 pass / 0 fail
  • gbrain integrations show restart-sweep → renders cleanly
  • gbrain integrations test recipes/restart-sweep.md → frontmatter validates
  • gbrain integrations doctor (with OPENCLAW_OWNER_IDS=test OPENCLAW_TELEGRAM_GROUP=-100) → all 3 health checks pass
  • bun run typecheck → clean
  • bun run verify → all 7 pre-test gates pass (privacy, jsonb, progress, test-isolation, wasm, admin-build, typecheck)
  • bun run test → 3,929 pass / 0 fail across 8 parallel shards + serial pass

TODOS

No TODO items completed in this PR.

Documentation

Updated three files to sync with v0.28.3:

  • README.md — added Restart Sweep to the "Getting Data In" recipes table
  • CLAUDE.md — added test/restart-sweep.test.ts annotation to the unit-test inventory
  • llms-full.txt — regenerated via bun run build:llms

Test plan

  • bun test test/restart-sweep.test.ts (27 pass / 0 fail)
  • bun run verify (privacy + jsonb + progress + test-isolation + wasm + admin-build + typecheck — all pass)
  • bun run test (3,929 pass / 0 fail, no regressions)
  • gbrain integrations show restart-sweep renders cleanly
  • gbrain integrations test recipes/restart-sweep.md frontmatter validates
  • gbrain integrations doctor restart-sweep (with envs set) — all 3 health checks pass
  • Real cron-driven dry run on a deployed OpenClaw setup (manual, post-merge)

🤖 Generated with Claude Code

…way restarts

Adds a tool to detect Telegram messages dropped during OpenClaw gateway restarts
by analyzing session state patterns.

Features:
- Detects sessions with abortedLastRun flag (primary heuristic)
- Identifies timing gaps (active before restart, silent after)
- Configurable alert modes (Telegram, stdout)
- Environment-based configuration
- Comprehensive test suite
- PII-scrubbed for public use

The tool addresses webhook message loss that occurs when the gateway restarts
while messages are in-flight. Unlike long-polling, webhooks cannot replay
missed messages, making this detection crucial for production reliability.
@garrytan garrytan changed the title feat(recipes): add restart-sweep — detect dropped messages after gateway restarts v0.28.3 feat(recipes): add restart-sweep — detect dropped messages after gateway restarts May 6, 2026
garrytan and others added 3 commits May 6, 2026 11:12
…script

Reshape the directory-shaped recipes/restart-sweep/ into a single
self-contained recipes/restart-sweep.md with the (fixed) script inlined
as a fenced code block. The recipe loader at integrations.ts:445-485 only
discovers *.md, so the directory shape was invisible.

Eight script fixes:
1. Newline double-escape ('\\n' → '\n') at 8 sites
2. Hard-coded /tmp/ paths → ~/.gbrain/integrations/restart-sweep/ (honors
   GBRAIN_HOME); bootstrap-log path env-overridable via OPENCLAW_BOOTSTRAP_LOG
3. exec() of interpolated string → execFile with argv array (no shell)
4. Idempotency: loadAlerted/saveAlerted helpers, atomic tmp+rename, corrupt-
   JSON recovery, 30-day prune
5. Aggressive heuristic gated behind OPENCLAW_RESTART_SWEEP_AGGRESSIVE=1
   (default OFF — false-positive prone during quiet periods)
6. Old directory shape removed
7. Env reads moved from module top-level to constructor (fixes the import-
   time-snapshot bug that made tests semantically bogus)
8. Cooldown layer keyed on (sessionKey, lastAlertedAt) with 6h re-alert
   threshold — prevents re-alerting forever when the bootstrap log is
   missing and restartTime is synthesized fresh each run

Recipe body adds a Cron environment troubleshooting section with the
wrapper-script pattern (set -a; source .env; set +a; exec node ...) plus
explicit PATH= line for the cron entry. Plus a TODO line pointing at
docs/guides/plugin-handlers.md as the v2 upgrade path (registered Minion
handler in the openclaw repo for queue-backed idempotency).

Tests: 27 bun:test cases (12 ported + 14 new + 1 sentinel-shape guard).
The extractor anchors on <!-- restart-sweep:script --> sentinel and salts
the tmp filename to bypass the ESM import cache. A separate test asserts
the sentinel itself is present so future doc edits dropping it fail loud.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- README.md: add restart-sweep row to "Getting Data In" recipes table
- CLAUDE.md: add test/restart-sweep.test.ts to the unit-test inventory
- llms-full.txt: regenerated via bun run build:llms

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan changed the title v0.28.3 feat(recipes): add restart-sweep — detect dropped messages after gateway restarts v0.28.3 feat(recipes): restart-sweep — detect dropped Telegram messages after gateway restarts May 6, 2026
Brings in v0.28.1 (zombie process reaping, /health timeout, engine
disconnect idempotency, PR garrytan#637).

Conflicts resolved:
- VERSION → 0.28.3 (ours; newer than master's 0.28.1)
- package.json → version 0.28.3 (matches VERSION)
- CHANGELOG.md → kept v0.28.3 entry above master's v0.28.1 entry; both
  full entries preserved with their own ### Itemized changes sections

Post-merge actions:
- bun install (no dep changes)
- bun run build:llms (regenerated llms-full.txt to pick up master's
  CLAUDE.md additions for v0.28.1)
- bun run test (3,876 pass / 0 fail) + verify (clean) + typecheck (clean)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan merged commit e744eda into garrytan:master May 6, 2026
7 checks passed
garrytan added a commit that referenced this pull request May 7, 2026
….28.6

Master shipped three v0.28.x patch releases without the takes feature
while v0.28-release was in flight:
- v0.28.1: zombie process accumulation + health endpoint timeout (#637)
- v0.28.3: restart-sweep — detect dropped Telegram messages (#675)
- v0.28.4: skillify cross-modal eval quality gate (#674)

Master's v0.28.0 slot was consumed without the takes layer ever landing,
so this release ships the original takes feature as v0.28.6 (skipping
v0.28.5 to leave space for any in-flight master patches).

The migration orchestrator file (v0_28_0.ts) and migration skill doc
(skills/migrations/v0.28.0.md) keep their original version keys —
those identify the migration version, not the release version.

Conflicts resolved:
- VERSION → 0.28.6 (was 0.28.0; master had 0.28.4)
- package.json → 0.28.6 (auto-merged ai-sdk deps from master's v0.27)
- CHANGELOG.md → renamed top entry "## [0.28.0]" → "## [0.28.6]" with
  date 2026-05-06; rebuilt the "To take advantage of" block (was
  truncated by stale === markers from a prior merge); preserved master's
  v0.28.4/v0.28.3/v0.28.1 entries beneath
- src/cli.ts auto-merged (CLI_ONLY has providers + takes/think both)

Verified post-merge:
- bun run verify: PASS (privacy + jsonb + progress + test-isolation +
  wasm + admin-build + typecheck)
- 133 tests pass: migrate + apply-migrations + takes-engine + takes-fence
- migrations v37 (takes) + v38 (access_tokens_permissions) apply cleanly
  on top of master's v35 (auto-RLS) + v36 (subagent persistence)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants