Skip to content

Lite: Add scheduled database compaction to prevent DuckDB bloat #161

@erikdarlingdata

Description

@erikdarlingdata

Problem

DuckDB's storage engine is append-oriented. When ArchiveService DELETEs old rows after exporting to parquet, the space is marked free but never reclaimed. VACUUM does not fix append-fragmented space. Over time, the database file grows monotonically — we observed 3.8GB for just 35MB of real data (110x bloat), causing 3-7 second collector times and 50-60% CPU.

The only fix is export/reimport (compaction), which we did manually to go from 3.8GB → 323MB.

Proposed solution

Add automated daily compaction to prevent bloat from recurring:

  1. Daily compaction — after the hourly archive cycle, once per day:

    • Pause collection
    • CHECKPOINT (flush WAL)
    • Export all tables to a temp DuckDB file
    • Close connections, swap files (old → .bak, temp → primary)
    • Reopen connections, resume collection
    • Delete .bak
  2. Size watchdog — log a warning if database file exceeds a threshold (e.g., 1GB) between compaction cycles, so runaway bloat is caught early.

Context

  • Compaction takes ~2-5 seconds for a 300MB database
  • Collection gap is negligible (at most one 1-minute cycle missed)
  • VACUUM is not an alternative — it does not reclaim append-fragmented space in DuckDB
  • The checkpoint_threshold=1GB setting (PR DuckDB checkpoint optimization and timing fix #159) prevents checkpoint stalls but does not address long-term file growth

Files involved

  • Lite/Database/DuckDbInitializer.cs — compaction logic (owns file path and connection string)
  • Lite/Services/CollectionBackgroundService.cs — daily compaction timer alongside existing archive/retention timers

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions