Skip to content

perf: cache DB reads during compaction to reduce SQLite lock contention #291

@100yenadmin

Description

@100yenadmin

Problem

During compaction, lossless-claw makes 120-160 redundant DB operations per full sweep on a single SQLite database shared by 5+ concurrent agent sessions. On a 1.9GB database with 219K messages, this creates significant event loop blocking (the real bottleneck — not SQLite locks) that can stall the gateway during bursts.

Root Cause Analysis (Red Team / Blue Team)

A 4-agent adversarial investigation revealed:

The repeated reads are NOT accidental — after each compaction pass, replaceContextRangeWithSummary resequences ALL ordinals. Any pre-write cache is instantly stale. The code deliberately re-reads to ensure correctness.

However, many reads are redundant WITHIN the same phase (before any write occurs):

  • `getContextItems()` called 3x before the first write in `compactLeaf`
  • `getMessageById()` called individually 50-200 times instead of batch
  • `getSummary()` called individually 40-60 times per sweep instead of batch

The real bottleneck is `DatabaseSync` (Node.js synchronous SQLite). Every query blocks the event loop. At burst rates (30 afterTurn/min with 5 agents), event loop utilization hits 7-22%. At DB sizes >4-5GB, cold queries take 50-200ms each, causing HTTP timeouts and LLM streaming stalls.

Proposed Optimizations (ranked by impact × safety)

1. Per-phase context cache (EASY, 8-12 reads saved/sweep)

Cache `getContextItems()` once at the start of each phase. Invalidate after every write. The red team confirmed: within a phase (before any write), the data is immutable.

2. Token count delta tracking (EASY, 3-11 reads saved/sweep)

After each `leafPass`/`condensedPass`, compute: `tokensAfter = tokensBefore - chunkSourceTokens + summaryTokenCount` instead of re-querying. The arithmetic is exact — uses the same values the DB would sum. Eliminates the most expensive query (`getContextTokenCount` = double-JOIN aggregate).

3. Summary lookup cache (EASY, 40-60 reads saved/sweep)

At the start of condensed phase, batch-fetch all summaries referenced in context items into a Map. Summaries are immutable after creation — zero staleness risk.

4. Batch message lookups (MODERATE, 60-70 → 2-3 per leaf iteration)

Replace N individual `getMessageById()` calls with `SELECT ... WHERE message_id IN (...)`. Messages are immutable — zero staleness risk. Biggest event loop savings (~200-400ms/sweep).

5. DB size management — archival/VACUUM (MODERATE)

At 10K messages/day, the DB grows ~25-30MB/day (messages are never deleted — lossless design). Projected 7.5GB at 6 months, 13GB at 1 year. Performance cliffs appear at 4-5GB. Need: periodic VACUUM, old conversation archival, or tiered storage.

What NOT to do

  • Don't change ordinal resequencing — The 2×N UPDATE with negative temp ordinals is the only correct approach. SQLite doesn't support deferred UNIQUE constraints.
  • Don't add read-only connections — With single-threaded `DatabaseSync`, this adds complexity with no benefit.
  • Don't cache across writes — Ordinals change after every write. Per-phase caching with invalidation is the correct pattern.

Concurrent Load Model

For the deployment (1 main Opus + 4 sub-agent Sonnet, 10K messages/day):

Metric Average Burst (30/min)
DB ops/minute 70-126 300-540
Event loop blocked 1.7-5.2% 7-22%
Write lock duration 25-60ms Same (per-transaction)
Reader wait (WAL) 0ms 0ms
DB growth 25-30MB/day
6-month projected size 7.5GB
First failure mode Event loop blocking at 4-5GB+

Impact

Combined optimizations reduce per-sweep DB operations from ~120-160 to ~30-40 (75% reduction). Event loop blocking during compaction drops proportionally.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions