fix(kvstore): Prevent data loss from crash during DB format conversion#838
fix(kvstore): Prevent data loss from crash during DB format conversion#838mattisonchao merged 9 commits intomainfrom
Conversation
The database format conversion process was not atomic. If the server crashed in the middle of the conversion, the database could be left in a corrupted state, leading to data loss. This commit makes the conversion process crash-proof by: 1. Backing up the original database before starting the conversion. 2. If a crash occurs, restoring the backup on the next startup. 3. Deleting the backup only after the conversion is successfully completed. A trap mechanism was also added to allow for simulating crashes during tests to verify the recovery process.
There was a problem hiding this comment.
Pull request overview
This PR implements crash-proof database format conversion by adding backup/restore logic to prevent data loss when the server crashes mid-conversion. The conversion process now backs up the original database, performs the conversion, and only deletes the backup after successful completion. A trap mechanism with build tags enables testing crash scenarios without runtime overhead in production builds.
Key Changes
- Added atomic backup/restore mechanism for database conversion with recovery on startup
- Implemented build-tag controlled trap mechanism for simulating crashes in tests
- Updated build configuration to disable trap mechanism in release builds
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| oxiad/dataserver/database/kvstore/kv_trap.go | Implements trap mechanism for testing crash scenarios (enabled by default) |
| oxiad/dataserver/database/kvstore/kv_trap_ignore.go | No-op trap implementation for production builds (with disable_trap tag) |
| oxiad/dataserver/database/kvstore/kv_pebble_formats.go | Adds backup/restore logic and recovery checks during database conversion |
| oxiad/dataserver/database/kvstore/kv_pebble.go | Integrates trap mechanism into KV creation flow |
| oxiad/dataserver/database/kvstore/kv.go | Adds NewKVWithTrap method to Factory interface for test injection |
| oxiad/dataserver/database/kvstore/kv_pebble_formats_test.go | Adds tests verifying crash recovery and backup cleanup |
| Makefile | Updates build targets to include disable_trap tag for release builds |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@mattisonchao I've opened a new pull request, #839, to work on those changes. Once the pull request is ready, I'll request review from you. |
|
Hi @copilot could you help resolve outdated comments? |
|
@mattisonchao I've opened a new pull request, #840, to work on those changes. Once the pull request is ready, I'll request review from you. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 11 comments.
Comments suppressed due to low confidence (1)
oxiad/dataserver/database/kvstore/kv_pebble_formats.go:167
- The checkConvertDB method lacks documentation explaining its critical role in crash recovery. Consider adding a comment describing the recovery logic and the various scenarios it handles, for example: "checkConvertDB checks if database conversion is needed and handles recovery from crashes during previous conversions. It detects incomplete conversions by checking for backup databases and performs appropriate recovery or cleanup."
func (p *pebbleDbConversion) checkConvertDB(desiredEncoding compare.Encoder) error {
dbBackPath := makeDbBackupPath(p.dbPath)
if !pathExists(p.dbPath) {
// No db, check if we need to recover from backup
if !pathExists(dbBackPath) {
// no backup, no need to convert DB
return nil
}
p.log.Info("Database backup found without primary database, indicating crash during conversion")
// recover backup and keep going
if err := os.Rename(dbBackPath, p.dbPath); err != nil {
return err
}
}
// DB already exists
// Check if we need to clean up the backup, as previous backup cleanup may have failed.
if pathExists(dbBackPath) {
p.log.Info("Database backup found alongside primary database, indicating incomplete cleanup after conversion")
if err := os.RemoveAll(dbBackPath); err != nil {
return err
}
}
var keyEncodingMarker string
if markerData, err := os.ReadFile(filepath.Join(p.dbPath, markerFileName)); err != nil {
if !os.IsNotExist(err) {
return err
}
// Older versions were not setting the marker
keyEncodingMarker = keyEncodingFormatOldCompareHierarchical
} else {
keyEncodingMarker = string(markerData)
}
if keyEncodingMarker == desiredEncoding.Name() {
// Format is already correct, nothing to do
return nil
}
switch keyEncodingMarker {
case keyEncodingFormatOldCompareHierarchical:
confOld := p.configForOldCompareHierarchical()
confNew := p.configForNewerFormat()
return p.convertDb(
confOld, compare.EncoderNatural,
confNew, desiredEncoding)
case compare.EncoderNatural.Name(),
compare.EncoderHierarchical.Name():
confOld := p.configForNewerFormat()
confNew := p.configForNewerFormat()
oldEncoder, err := compare.GetEncoder(keyEncodingMarker)
if err != nil {
return err
}
return p.convertDb(
confOld, oldEncoder,
confNew, desiredEncoding)
default:
p.log.Warn("Found unknown encoding type. No conversion performed",
slog.String("keyEncodingMarker", keyEncodingMarker))
return nil
}
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
fixes: #846
The database format conversion process was not atomic. If the server crashed in the middle of the conversion, the database could be left in a corrupted state, leading to data loss.
This commit makes the conversion process crash-proof by:
A trap mechanism was also added to allow for simulating crashes during tests to verify the recovery process. Additionally, the trap mechanism incurs zero cost in the release mode. That code will automatically be removed by compiler optimisation.
The assembly code comparison is as follows: