-
Notifications
You must be signed in to change notification settings - Fork 0
feat: implement automated backup and restore system #449
Copy link
Copy link
Closed
Labels
prio:highImportant, should be prioritizedImportant, should be prioritizedscope:large3+ days of work3+ days of workspec:architectureDESIGN_SPEC Section 15 - Technical ArchitectureDESIGN_SPEC Section 15 - Technical Architecturetype:featureNew feature implementationNew feature implementation
Description
Summary
Implement a comprehensive backup and restore system that protects all persistent data — memory, agent state, persistence database, configuration, and audit logs. Backups should run automatically on a schedule, at shutdown, and at startup (snapshot before changes), with user-configurable paths, retention policies, and restore capabilities.
Motivation
SynthOrg manages valuable state: agent memories, task history, security audit trails, budget records, and organizational knowledge. A crash, bad deployment, or corrupted database with no backup means data loss. Users need confidence that their synthetic organization's institutional knowledge is protected.
Design
What Gets Backed Up
| Data | Source | Format |
|---|---|---|
| Persistence DB | SQLite database (synthorg.db) — tasks, audit log, agent state, budget records |
File copy (SQLite online backup API for consistency) |
| Agent memory | Mem0 backend (Qdrant embedded + SQLite metadata) | Directory snapshot of memory dir |
| Organization memory | Shared org facts (SQLite-backed OrgFactStore) |
Included in persistence DB |
| Company config | YAML company configuration | File copy |
| Checkpoints | Per-turn execution checkpoints | Directory snapshot |
Backup Triggers
| Trigger | When | Behavior |
|---|---|---|
| Scheduled | Configurable interval (default: every 6 hours) | Background task, non-blocking |
| Pre-shutdown | Company.shutdown() / SIGTERM handler |
Synchronous, must complete before exit |
| Post-startup | After config load, before accepting tasks | Snapshot current state as recovery point |
| Manual | POST /api/v1/admin/backup or CLI synthorg backup |
On-demand, returns backup ID |
| Pre-migration | Before schema migrations run | Automatic, tagged as pre-migration |
Configuration
backup:
enabled: true
path: "/data/backups" # Where to store backups
schedule_hours: 6 # Interval for scheduled backups
retention:
max_count: 10 # Maximum number of backups to keep
max_age_days: 30 # Delete backups older than this
on_shutdown: true # Backup before shutdown
on_startup: true # Snapshot on startup
compression: true # gzip compress backup archives
include:
- persistence # SQLite database
- memory # Agent + org memory
- config # Company YAML config
- checkpoints # Execution checkpointsBackup Format
Each backup is a timestamped directory or compressed archive:
backups/
2026-03-15T14-30-00_scheduled/
manifest.json # Metadata: timestamp, trigger, version, included components
synthorg.db # SQLite backup (via VACUUM INTO or backup API)
memory/ # Memory directory snapshot
config/ # Company config snapshot
checkpoints/ # Checkpoint data
2026-03-15T14-30-00_scheduled.tar.gz # If compression enabled
Manifest
{
"version": "1",
"synthorg_version": "0.2.4",
"timestamp": "2026-03-15T14:30:00Z",
"trigger": "scheduled",
"components": ["persistence", "memory", "config", "checkpoints"],
"db_schema_version": 3,
"size_bytes": 1048576,
"checksum": "sha256:abc123..."
}Restore
POST /api/v1/admin/restorewith backup ID — requires shutdown + restart- CLI:
synthorg restore <backup-id>— stops running containers, restores, restarts - Restore validates manifest version compatibility before overwriting
- Pre-restore backup taken automatically (safety net)
Retention
- After each backup, prune old backups exceeding
max_countormax_age_days - Never prune the most recent backup regardless of age
- Never prune backups tagged as
pre-migration(kept until explicitly deleted) - Log pruned backups at INFO level
API Endpoints
| Method | Path | Description |
|---|---|---|
POST |
/api/v1/admin/backup |
Trigger manual backup |
GET |
/api/v1/admin/backups |
List available backups |
GET |
/api/v1/admin/backups/{id} |
Get backup details |
DELETE |
/api/v1/admin/backups/{id} |
Delete a specific backup |
POST |
/api/v1/admin/restore |
Restore from backup (requires confirmation) |
CLI Commands
| Command | Description |
|---|---|
synthorg backup |
Trigger manual backup |
synthorg backup list |
List available backups |
synthorg backup restore <id> |
Restore from backup |
Implementation Notes
- SQLite backup: Use
VACUUM INTOfor a consistent point-in-time copy (avoids WAL complications). For Mem0's embedded Qdrant, snapshot the data directory while the backend is paused. - Concurrency: Backups must not block task execution. Use a dedicated asyncio task with appropriate locking (pause writes briefly for DB consistency, then resume).
- Docker volume awareness: In Docker deployments, backups write to the
synthorg-datavolume. The CLI can mount a host path for backup extraction. - Error handling: Backup failures log at ERROR but don't crash the runtime. Restore failures abort cleanly without corrupting current state.
Affected Modules
src/synthorg/persistence/— backup/restore methods onPersistenceBackendprotocolsrc/synthorg/memory/— backup/restore onMemoryBackendprotocolsrc/synthorg/config/— backup config schemasrc/synthorg/api/controllers/— admin backup endpointssrc/synthorg/engine/— shutdown/startup hookscli/cmd/— Go CLI backup subcommands
Dependencies
- feat: implement AgentStateRepository for runtime state persistence #261 — AgentStateRepository (needs to exist to be backed up)
- Existing:
PersistenceBackend,MemoryBackend,Companylifecycle
Acceptance Criteria
- Scheduled backups run at configured interval
- Pre-shutdown backup completes before process exits
- Post-startup snapshot taken before accepting tasks
- Manual backup via API and CLI
- Configurable backup path, retention count, retention age
- Compression option (gzip)
- Manifest with version, timestamp, trigger, checksum
- Restore validates compatibility before overwriting
- Pre-restore safety backup taken automatically
- Retention pruning respects max_count, max_age_days, never prunes latest
- Backup failures don't crash the runtime
- Restore failures don't corrupt current state
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
prio:highImportant, should be prioritizedImportant, should be prioritizedscope:large3+ days of work3+ days of workspec:architectureDESIGN_SPEC Section 15 - Technical ArchitectureDESIGN_SPEC Section 15 - Technical Architecturetype:featureNew feature implementationNew feature implementation