Skip to content

kmanan/cheap-openclaw

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cheap-openclaw

15 production-tested techniques to cut your OpenClaw agent costs by 10x.

I run OpenClaw as my family's autonomous butler — Spratt. He handles iMessage conversations, morning briefings, evening digests, health monitoring, email scanning, grocery tracking, flight alerts, and household automation, 24/7. The skills I built for that are in the spratt-skills repo.

Below is every cost and performance optimization that emerged from running Spratt in production. These aren't theories — they're battle-tested against a month of real traffic, real cron jobs, and many painful cost spikes. Every technique includes the config, the rationale, and what went wrong when we got it wrong.


Table of Contents

  1. Intelligent Model Routing
  2. Cron Job Model Tiering
  3. Exec Payloads for Deterministic Work
  4. Prompt Caching
  5. lightContext for Cron Jobs
  6. Bootstrap File Optimization
  7. Fallback Chain Optimization
  8. Cron Session Cleanup
  9. Reply Noise Suppression
  10. Subagent Model Assignment
  11. Compaction and Context Pruning
  12. Usage Tracking
  13. Prompt Compression
  14. Session Reset on Idle
  15. Move High-Volume Extraction to Subscription-Backed Codex
  16. Ambient Monitor (Heartbeat) Configuration

Quick Start

If you want the biggest wins with the least effort, do these three first:

  1. Deploy a model router for interactive sessions (iblai-openclaw-router) — routes 80% of messages to Haiku instead of Sonnet. ~4x savings.
  2. Enable prompt caching on Haiku (cacheRetention: "short") — 90% discount on cached input tokens for your highest-volume model.
  3. Set lightContext: true on all cron jobs that don't need full agent context — eliminates ~15-20K chars of bootstrap re-injection per cron turn.
  4. Move high-volume extraction jobs off metered API models when you already pay for a subscription-backed provider like OpenAI Codex. Keep the LLM for language/extraction, but isolate deterministic writes so the swap is safe.

Combined, these changes can cut your bill by 5-8x, and the fourth one prevents a "cheap" extraction pipeline from quietly becoming an $80/month metered API workload.


1. Intelligent Model Routing

The idea: Not every message needs your most expensive model. A routing proxy scores each incoming message and sends it to the cheapest model that can handle it.

Setup

Deploy iblai-openclaw-router as a local proxy. It's a lightweight Node.js server (449 lines, zero external dependencies) that sits between OpenClaw and the Anthropic API.

Set it as your primary model in openclaw.json:

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "iblai-router/auto"
      }
    }
  }
}

Three-Tier System

Tier Model Cost (input/output per M tokens)
LIGHT claude-haiku-4-5-20251001 $0.80 / $4.00
MEDIUM claude-sonnet-4-6 $3.00 / $15.00
HEAVY claude-sonnet-4-6 (or Opus if you need it) $3.00 / $15.00

How Scoring Works

The router evaluates the last user message only (not the full conversation history) across 14 weighted dimensions:

Dimension Weight Effect
reasoningMarkers 0.18 Complex reasoning keywords push toward MEDIUM/HEAVY
codePresence 0.14 Code-related terms push toward MEDIUM
simpleIndicators 0.06 Greetings, acknowledgments push toward LIGHT
relayIndicators 0.05 "Tell X...", "Send to..." push toward LIGHT
tokenCount 0.08 Short messages (-0.8), long messages (+0.8)
... and 9 more 0.49 Technical terms, multi-step, creative, agentic, etc.

The final score maps to a tier:

  • Score < lightMedium boundary -> LIGHT (Haiku)
  • Score < mediumHeavy boundary -> MEDIUM (Sonnet)
  • Score >= mediumHeavy -> HEAVY

Tuning Tips

The default boundaries are a good start, but you'll want to tune for your traffic:

  • Too much hitting Sonnet? Lower the lightMedium boundary (e.g., from 0.08 to 0.05) to widen the LIGHT band.
  • Quality suffering on simple tasks? Raise it back.
  • BlueBubbles/iMessage metadata inflating scores? The router scores the raw message including channel metadata (~100-150 tokens per message). Adjust tokenCount thresholds upward to compensate.
  • Watch the confidence threshold. When the score is near a boundary, the router computes a sigmoid confidence. Below the threshold, it defaults to MEDIUM as a safety net. Lower the threshold to let more borderline cases stay LIGHT.

Hard Overrides

Two conditions bypass scoring entirely, forcing HEAVY:

  1. 2+ reasoning keywords matched
  2. Estimated tokens > 50,000

What We Measured

After tuning: 80% LIGHT (Haiku), 20% MEDIUM (Sonnet) across 564 routing decisions. That's a ~4x cost reduction vs. sending everything to Sonnet.


2. Cron Job Model Tiering

The idea: Match the model to what the job actually does. Content composition for humans needs quality (Haiku). Orchestration and checks just need to work (Flash).

The Rule

Task Type Model Why
Briefings, digests, summaries Claude Haiku 4.5 90% data retrieval, 10% formatting. Haiku is great at structured formatting.
Health checks, scrapers, inspections Gemini Flash Orchestration work. Cheapest option at $0.30/$2.50 per M.
High-volume classification/extraction pipelines Subscription-backed Codex where available Use your fixed subscription instead of paying per-token API rates.
Interactive sessions Router (see above) Per-turn routing.

Example jobs.json

{
  "jobs": [
    {
      "name": "Morning Briefing",
      "model": "anthropic/claude-haiku-4-5-20251001",
      "sessionTarget": "isolated",
      "payload": { "kind": "agentTurn", "message": "Run the morning briefing pipeline" },
      "lightContext": true
    },
    {
      "name": "Health Check",
      "model": "google/gemini-2.5-flash",
      "sessionTarget": "isolated",
      "payload": { "kind": "agentTurn", "message": "Run health checks" },
      "lightContext": true
    }
  ]
}

Cost Difference

Switching briefings from Sonnet ($3/$15) to Haiku ($0.80/$4) is a ~12x cost reduction on those jobs. Switching orchestration from Haiku to Flash ($0.30/$2.50) saves another ~3x. For high-volume extraction jobs, moving from metered Gemini Flash to a subscription-backed Codex route can remove that workload from your variable API bill entirely.

Pitfall: Don't Let Claude "Optimize" Your Models

If you have model assignment rules in your CLAUDE.md or AGENTS.md, be explicit. AI assistants love to "optimize" by upgrading cheap models to expensive ones or downgrading quality-sensitive jobs to the cheapest option. State the rule clearly and explain why.


3. Exec Payloads for Deterministic Work

The idea: If a cron job runs a shell script, don't spin up an LLM session to do it.

Before (wasteful)

Cron job -> agentTurn -> LLM session -> LLM reads system prompt -> LLM calls tool -> script runs

You're paying for: system prompt injection, LLM reasoning about what to do, tool call overhead. For a shell command.

After (free)

Cron job -> exec -> script runs directly

Zero tokens. Zero LLM involvement.

Configuration

{
  "name": "Nightly Backup",
  "payload": {
    "kind": "exec",
    "command": "/path/to/backup.sh"
  }
}

Caveat

sessionTarget: "isolated" requires agentTurn — exec payloads are silently skipped for isolated sessions. If a job must be isolated and runs a script, wrap it in a minimal agentTurn:

{
  "sessionTarget": "isolated",
  "payload": {
    "kind": "agentTurn",
    "message": "Run: /path/to/script.sh"
  }
}

This still costs tokens for the prompt, but far less than a full agent turn with tools.


4. Prompt Caching

The idea: Anthropic's prompt caching gives a 90% discount on repeated input content. Your system prompt (bootstrap files) is nearly identical every turn — cache it.

Configuration

{
  "agents": {
    "defaults": {
      "model": {
        "models": {
          "anthropic/claude-haiku-4-5-20251001": {
            "params": { "cacheRetention": "short" }
          },
          "anthropic/claude-sonnet-4-6": {
            "params": { "cacheRetention": "short" }
          }
        }
      }
    }
  }
}

Why Haiku Matters Most

If you're using a router that sends 80% of traffic to Haiku, Haiku is your highest-volume model. Without caching, every Haiku turn re-processes the full system prompt (~15-20K chars). With caching, that content gets a 90% discount after the first turn.

See examples/openclaw.json for a complete annotated configuration.


5. lightContext for Cron Jobs

The idea: Cron jobs running pipelines or scripts don't need your agent's full personality, tool reference data, memory files, and behavioral rules injected into their session.

Configuration

Add "lightContext": true to any cron job that doesn't need full agent context:

{
  "name": "Morning Briefing",
  "lightContext": true,
  "payload": { "kind": "agentTurn", "message": "..." }
}

What It Does

With lightContext: true, the cron session starts with minimal bootstrap content instead of injecting all your workspace files (AGENTS.md, SOUL.md, TOOLS.md, MEMORY.md, etc.). For a briefing job that just runs a pipeline, the model doesn't need personality rules or Home Assistant entity IDs.

What to Enable It On

  • Briefing/digest pipelines (they get their instructions from the pipeline, not bootstrap)
  • Health checks and inspections
  • Any job that executes a well-defined script or workflow

What to Leave It Off

  • Jobs that need to interpret unstructured data using agent knowledge
  • Jobs that use tools requiring full context (e.g., email scanning with classification rules in AGENTS.md)

6. Bootstrap File Optimization

The idea: OpenClaw injects workspace bootstrap files into every turn's system prompt, capped at 20K chars per file and 150K total. Everything in those files costs tokens on every single turn.

What to Do

  1. Audit your bootstrap files. Check the sizes of AGENTS.md, SOUL.md, TOOLS.md, IDENTITY.md, USER.md, MEMORY.md. Anything close to the 20K limit is probably carrying dead weight.

  2. Move reference data to skills. Device IDs, entity tables, API endpoint lists, routing tables — these belong in a skill's SKILL.md, loaded on-demand. OpenClaw injects only a compact manifest line per skill at bootstrap.

  3. Move domain-scoped rules to skills. Instructions that only matter for specific tasks (briefing formatting, trip management, flight tracking) should be lazy-loaded skills, not always-injected rules.

  4. Set bootstrapMaxChars as a safety net:

{
  "agents": {
    "defaults": {
      "bootstrapMaxChars": 20000
    }
  }
}

Skill-Based Lazy Loading

OpenClaw's skill architecture is the prescribed way to avoid bootstrap bloat:

  • At bootstrap: only skill name + description + path are injected (one line per skill)
  • On demand: full SKILL.md is loaded when the model decides it needs that skill
  • Result: domain knowledge costs zero tokens when irrelevant

7. Fallback Chain Optimization

The idea: When your primary model hits rate limits (429s) or spending caps, the fallback chain determines what happens next. A bad fallback chain can cost you 10x.

The Problem

Default fallback: Flash -> Sonnet -> ...

When Flash 429s (common on free tiers or spending caps), your health check suddenly runs on Sonnet at $3/$15 per M tokens. A heartbeat that runs every 30 minutes hits Sonnet 48 times/day.

The Fix

{
  "agents": {
    "defaults": {
      "model": {
        "fallbacks": [
          "anthropic/claude-haiku-4-5-20251001",
          "google/gemini-2.5-flash"
        ]
      }
    }
  }
}

Flash always falls back to Haiku, never Sonnet. Apply this both system-wide and on any per-agent overrides.

Per-Agent Override

If an agent has its own model config, it needs its own fallback:

{
  "agents": {
    "my-heartbeat": {
      "model": {
        "primary": "google/gemini-2.5-flash",
        "fallbacks": ["anthropic/claude-haiku-4-5-20251001"]
      }
    }
  }
}

8. Cron Session Cleanup

The idea: OpenClaw's lossless-claw context engine accumulates messages across isolated cron sessions, even though each run gets a new session ID. Left unchecked, this "session rot" wastes tokens and eventually breaks jobs.

The Problem

sessionTarget: "isolated" creates a new sessionId each run, but lossless-claw indexes by sessionKey. Every run's messages accumulate under the same key. After ~25 days of daily runs, 100+ messages of dead context get replayed into each new session.

Symptoms:

  • Haiku returns empty responses (overwhelmed by stale context)
  • Token usage climbs steadily over days/weeks
  • Flash survives longer but eventually rots too
  • GCP bill explodes. In our case, Gemini API costs went from ~$4/month to $64.74/month — a 1,571% increase — entirely from input token growth caused by session rot. Every cron run replays the full accumulated history as input tokens. With multiple jobs running daily (some 3x/day), the compounding cost is severe. Later, even after session cleanup, high-volume email extraction still showed roughly $80/month of Gemini spend; see technique #15 for the separate fix.

The Native Fix (lossless-claw 0.9.4+)

As of lossless-claw 0.9.4, the plugin supports ignoreSessionPatterns — sessions matching these glob patterns are entirely skipped: no conversation rows, no messages, no participation in compaction. This is the proper upstream fix.

{
  "plugins": {
    "entries": {
      "lossless-claw": {
        "enabled": true,
        "config": {
          "summaryModel": "anthropic/claude-haiku-4-5-20251001",
          "ignoreSessionPatterns": [
            "agent:*:cron:**",
            "agent:<your-heartbeat-agent>:**"
          ]
        }
      }
    }
  }
}

Patterns we use:

  • agent:*:cron:** — every isolated cron run, across every agent
  • agent:spratt-heartbeat:** — the persistent heartbeat session

The gateway hot-reloads this config — no restart strictly needed, but verify with one. After applying, every fire of a matching session leaves lcm.db untouched: no new conversation row, no message inserts, updated_at doesn't advance. We confirmed it empirically by snapshotting lcm.db before a heartbeat fire and after — message counts stayed flat while the gateway log still showed agent:spratt-heartbeat:main:heartbeat activity.

Verify Your Plugin Version

# Check the loaded extension version
grep -E '"version"' ~/.openclaw/extensions/lossless-claw/package.json

# Confirm the option is implemented in the bundled dist
grep -c 'ignoreSessionPattern' ~/.openclaw/extensions/lossless-claw/dist/index.js

If the count is 0, your version doesn't implement the option yet — fall back to the cleanup script below.

One-Time Backfill

Adding ignoreSessionPatterns only stops future ingestion. The accumulated junk in lcm.db is still there. Bulk-clean it once:

-- Run inside a transaction. Back up lcm.db first.
DELETE FROM messages WHERE conversation_id IN (
    SELECT conversation_id FROM conversations
    WHERE session_key LIKE 'agent:%:cron:%'
       OR session_key LIKE 'agent:<your-heartbeat-agent>:%'
);
DELETE FROM conversations
WHERE session_key LIKE 'agent:%:cron:%'
   OR session_key LIKE 'agent:<your-heartbeat-agent>:%';
-- Repeat for child tables: summaries, message_parts, summary_messages,
-- context_items, large_files, conversation_bootstrap_state,
-- conversation_compaction_telemetry, conversation_compaction_maintenance.
-- Then rebuild FTS indexes and VACUUM.

In our case this dropped lcm.db from 179 MB to 155 MB and removed 12,712 stale messages across 88 conversations.

The Cleanup Script (still useful as defense in depth, or required pre-0.9.4)

scripts/cron-session-cleanup.py runs daily and:

  1. Reads jobs.json to find all enabled isolated cron jobs
  2. Removes their session entries from sessions.json
  3. Deactivates their conversations in lcm.db
  4. Archives transcript files with 4-week retention

Schedule it via cron, launchd, or systemd. On macOS via launchd see examples/com.openclaw.cron-session-cleanup.plist.

Even with ignoreSessionPatterns, the script is useful as belt-and-suspenders: it catches sessions that match a different shape than your patterns expected, and it archives transcripts that the runtime keeps even when ingestion is skipped.

Don't Forget the Heartbeat

The gateway's built-in heartbeat agent uses a persistent session key (agent:spratt-heartbeat:main:heartbeat) that rots the same way cron jobs do. Add it to your ignoreSessionPatterns (preferred), or to the cleanup script:

cur = conn.execute(
    "UPDATE conversations SET active = 0, archived_at = datetime('now') "
    "WHERE session_key LIKE '%heartbeat%' AND active = 1",
)

In our case, the heartbeat accumulated 7,535 messages over 19 days and was replaying all of them every 30 minutes.

Gmail newer_than: is Broken

If your heartbeat or cron jobs use gog gmail search with newer_than:, be aware that newer_than: silently fails when combined with other operators like is:unread. Gmail returns all matching results, ignoring the time filter entirely. This means your agent processes the entire inbox on every run instead of just recent emails.

Use after: with epoch seconds instead:

# Broken — newer_than silently ignored when combined with is:unread
gog gmail search "is:unread newer_than:30m" -a account@gmail.com

# Fixed — after: with epoch works correctly
gog gmail search "is:unread after:$(date -v-30M +%s)" -a account@gmail.com

This applies to any Gmail query that combines newer_than: with other search operators. We caught it because the heartbeat was re-alerting about a 3-week-old email every 30 minutes.


9. Reply Noise Suppression

The idea: Without configuration, OpenClaw streams intermediate text blocks as separate messages. A single task can produce 10+ messages of "Let me try X...", "That didn't work...", tool call narration. Each message is generated output tokens and gets re-injected as conversation history (input tokens).

Configuration

{
  "agents": {
    "defaults": {
      "verboseDefault": "off",
      "blockStreamingDefault": "off",
      "blockStreamingBreak": "message_end"
    }
  },
  "channels": {
    "bluebubbles": {
      "blockStreaming": false
    }
  }
}

Prompt Engineering

Add a directive to your SOUL.md:

## Working Silently
Never narrate your tool use or debugging process. Do not send messages like
"Let me check...", "I'll try...", "That didn't work, let me...". Complete
the task and send only the final result.

What This Saves

  • Fewer output tokens generated (no intermediate narration)
  • Smaller conversation history (less re-injection on subsequent turns)
  • Fewer messages delivered to users (better UX)

10. Subagent Model Assignment

The idea: Subagents handle focused subtasks (data lookups, formatting, parallel queries). Their outputs are consumed by the parent agent, not humans. They don't need your most expensive model.

Configuration

{
  "agents": {
    "defaults": {
      "subagents": {
        "model": "google/gemini-2.5-flash",
        "maxConcurrent": 8
      }
    }
  }
}

Flash at $0.30/$2.50 vs. inheriting the parent's Sonnet at $3.00/$15.00 — that's 10x cheaper per subagent call.


11. Compaction and Context Pruning

The idea: Conversations grow. Each turn re-sends the entire history. Without management, the 10th message re-transmits all 9 previous turns, and token costs grow quadratically.

Context Pruning

{
  "agents": {
    "defaults": {
      "contextPruning": {
        "mode": "cache-ttl",
        "ttl": "15m"
      }
    }
  }
}

After 15 minutes idle, stale context is pruned. Adjust the TTL based on how long your typical conversations last.

Compaction

{
  "agents": {
    "defaults": {
      "compaction": {
        "mode": "safeguard",
        "reserveTokensFloor": 40000
      }
    }
  }
}

When approaching context limits, compaction summarizes older turns. Use Haiku for the summary model:

{
  "plugins": {
    "lossless-claw": {
      "summaryModel": "anthropic/claude-haiku-4-5-20251001"
    }
  }
}

Memory Flush

{
  "compaction": {
    "memoryFlush": {
      "enabled": true,
      "softThresholdTokens": 4000
    }
  }
}

When in-context memory exceeds 4K tokens, flush it to disk rather than carrying it in the conversation window.


12. Usage Tracking

The idea: You can't optimize what you can't measure. Track routing decisions and estimate costs daily to catch regressions early.

Router Logging

The iblai-router logs every routing decision to routing.csv:

timestamp,tier,model,score,confidence,reasoning,tokens,query
2026-04-06T14:30:00Z,LIGHT,claude-haiku-4-5-20251001,0.02,0.89,scored,450,
2026-04-06T14:31:00Z,MEDIUM,claude-sonnet-4-6,0.12,0.72,scored,1200,"explain the trip..."

Daily Usage Script

scripts/daily-usage.py parses routing.csv and appends daily summaries to usage-history.csv:

python3 scripts/daily-usage.py           # Today
python3 scripts/daily-usage.py 2026-04-05  # Specific date

Output:

Usage for 2026-04-06:
  Total: 245 requests, ~$0.5894
  LIGHT: 93 requests, ~4200 tokens, ~$0.0032
  MEDIUM: 152 requests, ~82000 tokens, ~$0.5862

What to Watch For

  • LIGHT percentage dropping below 70% — your router boundaries may need retuning
  • A spike in MEDIUM requests — check if a new tool or prompt is inflating scores
  • Sudden cost jump — check the fallback chain, a model may be 429ing

13. Prompt Compression

The idea: Fewer tokens in, fewer tokens charged. Write terse prompts and pre-structure data.

Extraction Prompts

Before (45 lines):

Please extract the following information from the text below.
The output should be a valid JSON object with these fields:
- "title": The title of the event (string or null if not found)
- "date": The date in ISO format (string or null)
...
[20 more lines of explanation]
[Pretty-printed JSON example]

After (3 lines):

Extract from text as JSON: {"title":str|null,"date":str|null,"location":str|null}
Only include fields present. Return valid JSON, nothing else.

Same accuracy. ~15x fewer input tokens.

Pipeline Architecture

Separate data gathering (deterministic scripts, zero tokens) from composition (single LLM call with pre-structured data). Don't make the LLM do what grep, sqlite3, or jq can do for free.


14. Session Reset on Idle

The idea: Long-running sessions accumulate context that inflates every subsequent turn. Reset them.

{
  "session": {
    "reset": {
      "idleMinutes": 60
    }
  }
}

After 60 minutes of inactivity, the session resets. The next interaction starts fresh instead of carrying stale history. Adjust based on your usage patterns — shorter for chatbot-style agents, longer for agents doing extended work.


15. Move High-Volume Extraction to Subscription-Backed Codex

The idea: Some workloads genuinely need an LLM, but they do not need to stay on a metered API model if you already pay for a subscription-backed model route. Email triage and extraction is the best example: the model should decide what an email means, but deterministic code should execute the writes.

What We Found

Gemini Flash looked cheap enough for scheduled email scanning, but in production it was still consuming roughly $80/month once all the repeated header triage, body extraction, PDF extraction, retries, and context overhead were included.

That spend survived earlier optimizations because the architecture still had two direct Gemini call sites:

gather email headers -> Gemini header classification -> Gemini body/PDF extraction -> deterministic writes

The important distinction: the problem was not "LLM vs no LLM." Email classification and extraction are language tasks. The problem was leaving a high-volume language task on a metered API when an already-paid subscription-backed Codex route was available.

The Architecture Change

Split the pipeline into explicit stages:

gather -> decide -> extract actions -> run actions
Stage Uses LLM? Responsibility
gather No Fetch recent headers from email providers.
decide Yes Cheap header-only triage: skip vs actionable category.
extract actions Yes Read only actionable bodies/PDFs and produce structured JSON actions.
run actions No Write orders/trips/reminders/outbox and mark email read after success.

This preserves the two-stage cost optimization:

  1. First LLM pass sees only headers.
  2. Second LLM pass runs only for actionable emails.
  3. The final write path has no model calls.

The Model Change

Replace direct Gemini HTTP calls:

https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent

with OpenClaw gateway inference:

openclaw infer model run \
  --gateway \
  --model openai-codex/gpt-5.5 \
  --json \
  --prompt "$PROMPT"

This keeps the call on OpenClaw's supported model surface instead of inventing a direct provider integration.

PDF Attachments

Do not bolt on a local PDF parser as a workaround. Use OpenClaw's PDF stack.

For interactive/manual flows, use the native pdf tool. For scheduled scripts, use OpenClaw's bundled document extraction plugin for PDF text extraction, then send that extracted text to Codex for structured JSON.

The important rule:

PDF bytes -> OpenClaw document-extract/pdf -> extracted text -> Codex structured extraction -> deterministic writes

That avoids paying Gemini for inline PDF extraction and avoids fragile ad hoc parsers.

Validation Checklist

Before turning this on, verify:

  • The Codex route works:

    openclaw infer model run --gateway --model openai-codex/gpt-5.5 --json --prompt '{"ok":true}'
  • The live scripts no longer contain direct Gemini/model-key calls:

    rg -n "gemini|GEMINI|generateContent|API_KEY|call_flash" path/to/email-scan
  • Synthetic daycare/school email produces an action instead of being skipped.

  • Synthetic travel PDF produces structured flight data.

  • The deterministic runner passes --dry-run.

  • Real writes and mark-read happen only after successful deterministic execution.

What This Saves

In our production setup, this moved email scanning's LLM spend off the Gemini API path. The exact savings depend on your subscription and volume, but the target was an observed ~$80/month Gemini workload that should not have been variable API spend.

Caveat

Codex is not automatically cheaper for every workload. This optimization makes sense when:

  • You already pay for a subscription-backed Codex/OpenAI route.
  • The workload is high-volume and language-heavy.
  • You can keep deterministic writes outside the LLM.
  • You validate quality on the specific cases you care about.

Do not move deterministic shell scripts to Codex. Move only the language/extraction stages.


16. Ambient Monitor (Heartbeat) Configuration

The idea: A "heartbeat" agent that fires every 30 minutes around the clock is the single biggest variable-cost surface in an ambient assistant — 48 fires/day, every day, forever. If that agent inherits your default context (bootstrap files, memory search, full tool prompts, broad query scope), you are paying for ~150K characters of injection and unbounded tool output on every fire. Most heartbeats can and should be empty (HEARTBEAT_OK).

This section consolidates the heartbeat-specific knobs (some appear in earlier sections) into one config recipe. Every setting starves a different cost surface. Skip any of them and the savings degrade fast.

The full agent config

{
  "agents": {
    "list": [
      {
        "id": "spratt-heartbeat",
        "model": {
          "primary": "google/gemini-2.5-flash",
          "fallbacks": ["anthropic/claude-haiku-4-5-20251001"]
        },
        "memorySearch": { "enabled": false },
        "heartbeat": {
          "every": "30m",
          "model": "google/gemini-2.5-flash",
          "prompt": "Follow HEARTBEAT.md EXACTLY. Run ONLY the commands written there. ...",
          "isolatedSession": true,
          "lightContext": true,
          "suppressToolErrorWarnings": true
        },
        "tools": { "deny": ["write", "edit"] },
        "workspace": "/path/to/workspace"
      }
    ]
  }
}

What each setting does

Setting What it saves
model.primary: gemini-2.5-flash Cheapest API model for ambient orchestration.
model.fallbacks: [haiku] Prevents a Flash 429 from cascading to Sonnet (see #7).
memorySearch.enabled: false Skips the per-turn memory search call entirely. The heartbeat doesn't compose user-facing prose; it doesn't need recall.
heartbeat.isolatedSession: true Each fire is a fresh session, no carry-over conversation history. Combined with ignoreSessionPatterns (see #8), the runtime also stops accumulating these in lcm.db.
heartbeat.lightContext: true Skips bootstrap injection — no SOUL.md, AGENTS.md, TOOLS.md, IDENTITY.md, USER.md, MEMORY.md. ~150K chars saved per fire on a typical workspace.
heartbeat.suppressToolErrorWarnings: true Silences harmless tool errors (missing optional files, timed-out probes) so they don't pull the model into a follow-up turn to "investigate."
tools.deny: ["write", "edit"] Read-only watchdog. Removes the temptation for the model to "fix" something and burn extra turns.

The prompt + scoped tool runbook

The agent prompt should be one paragraph that points at a separate, fully-prescriptive runbook. Let the prompt say only "follow HEARTBEAT.md literally — never construct commands not in this file." Put every tool invocation in the runbook with exact arguments, time windows, and accounts. Example shape:

## Check 1 — Forwarded emails

gog gmail search "is:unread after:$(date -v-30M +%s)" -a forwarder@example.com

Act on matches. Mark as read.

## Check 2 — ...

Why this shape:

  • One account, not all accounts. The heartbeat does not need to scan every mailbox. Pick the one address used as a forwarding/inbox-of-record and search only that.
  • is:unread filters out everything already actioned. Nothing in the inbox of yesterday's mail should be re-evaluated 48 times tomorrow.
  • after:$(date -v-30M +%s), not newer_than:30m. newer_than: silently fails when combined with other Gmail operators (see #8) and dumps the full inbox into context.
  • Headers only. gog gmail search returns ~250 bytes per email by default — enough to triage. Reading bodies is what the email-scan pipeline does on its own schedule (see #15), not the heartbeat.
  • No SQLite queries. The runbook should explicitly forbid the agent from inventing sqlite3 commands. Other systems own those databases.

Fold the heartbeat into ignoreSessionPatterns

Even with isolatedSession: true and lightContext: true, the gateway will still hand turn data to the context engine for ingestion unless told otherwise. Add the heartbeat's session key prefix to ignoreSessionPatterns (see #8):

"ignoreSessionPatterns": [
  "agent:*:cron:**",
  "agent:spratt-heartbeat:**"
]

Without this, the heartbeat agent's session_key reuses across fires and lcm.db grows linearly with every beat.

What we measured

Before this configuration: heartbeat conversations had accumulated 7,535 messages over 19 days, replaying all of them as input every 30 minutes. Cost was material on Flash and would have been catastrophic on Haiku.

After this configuration: each fire injects only the tightly-scoped prompt + runbook + the heartbeat's own tool output. We empirically verified post-config: gateway log shows the heartbeat firing on schedule, lcm.db row counts and updated_at timestamps for the heartbeat's conversation row do not advance, message counts stay flat. Most fires return a single HEARTBEAT_OK token.

When this pattern applies more broadly

Any always-on, fixed-schedule agent that mostly observes (no-op, OK, "nothing to report") fits this shape:

  • Health monitors and watchdogs
  • Inventory/queue probes
  • Alert correlators

Conversely, do not apply this pattern to agents that compose user-facing prose every turn (briefings, digests). They legitimately need the bootstrap and memory.


Summary

Technique What It Does Savings
Model routing 80% of interactive traffic to Haiku ~4x
Cron model tiering Haiku for composition, Flash for orchestration ~10-15x on cron
Exec payloads Skip LLM for deterministic work 100% on orchestration
Prompt caching 90% discount on repeated system prompt ~90% on cached input
lightContext Minimal bootstrap for cron jobs ~80% per cron turn
Bootstrap optimization Move reference data to lazy-loaded skills ~15-20% per turn
Fallback chain Flash -> Haiku (not Sonnet) on 429s Prevents 10x spike events
Session cleanup ignoreSessionPatterns (0.9.4+) plus daily cleanup Prevented a 16x cost spike ($4 → $65/mo)
Reply suppression No intermediate messages Reduces output + history
Subagent assignment Flash for all subagents ~10x vs. Sonnet
Compaction Haiku-based summarization Cheap context management
Usage tracking CSV logging + daily reports Catches regressions
Prompt compression Terse prompts, pre-structured data Per-prompt savings
Session reset Idle session reset Prevents stale carry
Codex for extraction Move high-volume language extraction off metered Gemini Removed an observed ~$80/mo variable API workload
Ambient monitor config Starve every context surface on the heartbeat agent Eliminated bootstrap, memory search, tool noise, and lcm.db growth on a 48-fire/day Gemini agent

Scripts

Examples


License

MIT


Built by @kmanan running Spratt, a 24/7 autonomous OpenClaw family butler. These optimizations emerged from a month of production operation and many painful cost spikes.

About

14 production-tested techniques to cut your OpenClaw agent costs by 10x

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages