Skip to content

feat: implement automated backup and restore system #449

@Aureliolo

Description

@Aureliolo

Summary

Implement a comprehensive backup and restore system that protects all persistent data — memory, agent state, persistence database, configuration, and audit logs. Backups should run automatically on a schedule, at shutdown, and at startup (snapshot before changes), with user-configurable paths, retention policies, and restore capabilities.

Motivation

SynthOrg manages valuable state: agent memories, task history, security audit trails, budget records, and organizational knowledge. A crash, bad deployment, or corrupted database with no backup means data loss. Users need confidence that their synthetic organization's institutional knowledge is protected.

Design

What Gets Backed Up

Data Source Format
Persistence DB SQLite database (synthorg.db) — tasks, audit log, agent state, budget records File copy (SQLite online backup API for consistency)
Agent memory Mem0 backend (Qdrant embedded + SQLite metadata) Directory snapshot of memory dir
Organization memory Shared org facts (SQLite-backed OrgFactStore) Included in persistence DB
Company config YAML company configuration File copy
Checkpoints Per-turn execution checkpoints Directory snapshot

Backup Triggers

Trigger When Behavior
Scheduled Configurable interval (default: every 6 hours) Background task, non-blocking
Pre-shutdown Company.shutdown() / SIGTERM handler Synchronous, must complete before exit
Post-startup After config load, before accepting tasks Snapshot current state as recovery point
Manual POST /api/v1/admin/backup or CLI synthorg backup On-demand, returns backup ID
Pre-migration Before schema migrations run Automatic, tagged as pre-migration

Configuration

backup:
  enabled: true
  path: "/data/backups"              # Where to store backups
  schedule_hours: 6                   # Interval for scheduled backups
  retention:
    max_count: 10                     # Maximum number of backups to keep
    max_age_days: 30                  # Delete backups older than this
  on_shutdown: true                   # Backup before shutdown
  on_startup: true                    # Snapshot on startup
  compression: true                   # gzip compress backup archives
  include:
    - persistence                     # SQLite database
    - memory                          # Agent + org memory
    - config                          # Company YAML config
    - checkpoints                     # Execution checkpoints

Backup Format

Each backup is a timestamped directory or compressed archive:

backups/
  2026-03-15T14-30-00_scheduled/
    manifest.json           # Metadata: timestamp, trigger, version, included components
    synthorg.db             # SQLite backup (via VACUUM INTO or backup API)
    memory/                 # Memory directory snapshot
    config/                 # Company config snapshot
    checkpoints/            # Checkpoint data
  2026-03-15T14-30-00_scheduled.tar.gz  # If compression enabled

Manifest

{
  "version": "1",
  "synthorg_version": "0.2.4",
  "timestamp": "2026-03-15T14:30:00Z",
  "trigger": "scheduled",
  "components": ["persistence", "memory", "config", "checkpoints"],
  "db_schema_version": 3,
  "size_bytes": 1048576,
  "checksum": "sha256:abc123..."
}

Restore

  • POST /api/v1/admin/restore with backup ID — requires shutdown + restart
  • CLI: synthorg restore <backup-id> — stops running containers, restores, restarts
  • Restore validates manifest version compatibility before overwriting
  • Pre-restore backup taken automatically (safety net)

Retention

  • After each backup, prune old backups exceeding max_count or max_age_days
  • Never prune the most recent backup regardless of age
  • Never prune backups tagged as pre-migration (kept until explicitly deleted)
  • Log pruned backups at INFO level

API Endpoints

Method Path Description
POST /api/v1/admin/backup Trigger manual backup
GET /api/v1/admin/backups List available backups
GET /api/v1/admin/backups/{id} Get backup details
DELETE /api/v1/admin/backups/{id} Delete a specific backup
POST /api/v1/admin/restore Restore from backup (requires confirmation)

CLI Commands

Command Description
synthorg backup Trigger manual backup
synthorg backup list List available backups
synthorg backup restore <id> Restore from backup

Implementation Notes

  • SQLite backup: Use VACUUM INTO for a consistent point-in-time copy (avoids WAL complications). For Mem0's embedded Qdrant, snapshot the data directory while the backend is paused.
  • Concurrency: Backups must not block task execution. Use a dedicated asyncio task with appropriate locking (pause writes briefly for DB consistency, then resume).
  • Docker volume awareness: In Docker deployments, backups write to the synthorg-data volume. The CLI can mount a host path for backup extraction.
  • Error handling: Backup failures log at ERROR but don't crash the runtime. Restore failures abort cleanly without corrupting current state.

Affected Modules

  • src/synthorg/persistence/ — backup/restore methods on PersistenceBackend protocol
  • src/synthorg/memory/ — backup/restore on MemoryBackend protocol
  • src/synthorg/config/ — backup config schema
  • src/synthorg/api/controllers/ — admin backup endpoints
  • src/synthorg/engine/ — shutdown/startup hooks
  • cli/cmd/ — Go CLI backup subcommands

Dependencies

Acceptance Criteria

  • Scheduled backups run at configured interval
  • Pre-shutdown backup completes before process exits
  • Post-startup snapshot taken before accepting tasks
  • Manual backup via API and CLI
  • Configurable backup path, retention count, retention age
  • Compression option (gzip)
  • Manifest with version, timestamp, trigger, checksum
  • Restore validates compatibility before overwriting
  • Pre-restore safety backup taken automatically
  • Retention pruning respects max_count, max_age_days, never prunes latest
  • Backup failures don't crash the runtime
  • Restore failures don't corrupt current state

Metadata

Metadata

Assignees

No one assigned

    Labels

    prio:highImportant, should be prioritizedscope:large3+ days of workspec:architectureDESIGN_SPEC Section 15 - Technical Architecturetype:featureNew feature implementation

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions