fix(provider): purge keystore datastore after reset#11198
fix(provider): purge keystore datastore after reset#11198guillaumemichel merged 22 commits intomasterfrom
Conversation
There was a problem hiding this comment.
LGTM on the factory approach and tests.
One concern: the old keystore data at /provider/keystore/{0,1,active} inside the shared repo datastore is now orphaned.
Two ideas:
-
On first start with the new code, check if the shared repo datastore has keys under
/provider/keystore/and if so, purge them. This is a one-time migration that cleans up the old data automatically. Without this people who reported increased disk usage will be stuck with the last queue, no? -
Wire up
ipfs provide clear(or similar) to do a hard reset: stop the provide system, purge both the old shared-datastore keys and the newprovider-keystore/directory, and restart provide schedule fresh. Useful for operators who want to reclaim disk space immediately without waiting for the next reprovide cycle. -
iiuc
ipfs diag datastoreis blind to these new datastores, would be good to have means to inspect them, maybe by adding--instance=default|provider0|1
(1) feels like a blocker for this PR, (2) and (3) could be a follow-up issues if they are too involved to implement
|
Thanks for the review @lidel ! I implemented (1), so the orphaned keystore datastore prefix is now purged when migrating. IMO (2) doesn't really apply. Because we don't have a better way to track when CIDs should stop being advertised (see #11150), we have to rely on the Also the goal of the (3) makes sense, |
mount keystore datastores to /provider/keystore/0 and /1 so that they are included in the ipfs diag datastore command
…etions destroyDs calls os.RemoveAll with a suffix from the upstream library. If suffix were ever ".." or empty, this could delete wrong directories. Validate that suffix is "0" or "1" in both createDs and destroyDs.
If opening datastore "0" succeeds but "1" fails, MountKeystoreDatastores returned an error without closing "0".
Avoids allocating a datastore batch when no orphaned keys exist.
findRootDatastoreSpec silently returns wrapper specs it doesn't know about. If a plugin adds a wrapper with a "child" field, openDatastoreAt gets the wrapper instead of the leaf backend and fails confusingly. Log a warning so operators can spot the issue.
- explain why context.Background() is used in the migration code - add changelog note about the provide cycle restarting on upgrade - add downgrade caveat about orphaned provider-keystore directory
There was a problem hiding this comment.
(Note to myself)
Reviewed until that point c22c16d 👍🏻
- chunk orphan purge into 4096-key batches to bound memory and match existing batching patterns in the same file - cancel the purge context via fx.Lifecycle OnStop so SIGINT during startup does not block indefinitely - deep-copy slices in copySpec (not just maps) so the function matches its documented "deep-copy" contract - return nil from findRootDatastoreSpec when no "/" mount exists, so callers fall back to in-memory instead of passing a mount-type spec to openDatastoreAt - rename local variable to avoid shadowing the mount package import
- add `ipfs diag datastore put` subcommand for writing arbitrary key-value pairs to the datastore (offline, experimental) - add DatastorePut harness helper for CLI tests - add TestProviderKeystoreMigrationPurge: seeds orphaned keystore keys via `put`, starts the daemon to trigger migration, verifies the orphaned keys are purged and provider-keystore/ dir is created - add put/get roundtrip test for diag datastore
6c42463 to
010e0c8
Compare
11f8074 to
a16ddc0
Compare
There was a problem hiding this comment.
@guillaumemichel lgtm, manually tested and works as expected as well.
pushed some final changes (high level details below) lmk if any concerns, otherwise we can merge libp2p/go-libp2p-kad-dht#1233 then switch this PR to go-libp2p-kad-dht@master
Pushed changes
Hardening: bitflips, or big repositories with huge old keystores
- orphan purge: batched deletes (4096 keys), cancellable via SIGINT, deferred batch allocation
- i was unsure what is sane batch size, 4096 is 4x bufferedBatchSize
destroyDs: validate suffix is"0"or"1"beforeos.RemoveAllfindRootDatastoreSpec: graceful fallback when no/mount exists, warn on unknown wrapper typesMountKeystoreDatastores: close already-opened datastores on partial failurecopySpec: deep-copy slices too, not just maps
Observability:
- log datastore open/remove with suffix and path in
createDs/destroyDscallbacks
Testing:
- added
ipfs diag datastore put(experimental, offline-only) to complementget/count - added
TestProviderKeystoreMigrationPurge: seeds orphaned keys, starts daemon, verifies purge
Docs:
- changelog entry rweaked to focus on what was the problem and how it was fixed
Manual test looks ok
Fresh node, SweepEnabled=true, Interval=30s, provider logging enabled. Pre-seeded 5 orphaned keys to simulate old Kubo data.
Observed across 3 reprovide cycles:
- orphaned keys purged on first start
- namespaces alternate correctly (0→1→0→1)
- only the active namespace directory exists on disk after each swap
- all orphaned keys gone after shutdown
3a39a96 to
72f4a5f
Compare
Fixes #11096
Context
The
SweepingProvider's keystore periodically resets its contents (replacing all stored keys with the current set of CIDs to reprovide) because of the lack of CID tracking database (#11150).Previously, the keystore used
namespace.Wrapto carve out a namespace inside the shared repo datastore. When a reset occurred, old keys were deleted one-by-one from that shared datastore. This meant the underlying storage engine (LevelDB, Pebble, etc.) could retain the deleted data as tombstones until a background compaction eventually reclaimed the space (if it ever does before shutdown). On large nodes, this caused persistent disk bloat from stale provider data that was logically deleted but physically still on disk.Solution
Use the new
WithDatastoreFactory()option introduced in libp2p/go-libp2p-kad-dht#1233. Instead of namespacing within the shared repo datastore, the keystore now creates physically separate datastores under /provider-keystore/{0,1,meta}. On each reset cycle:os.RemoveAll()This guarantees immediate and complete disk reclamation.
The datastore type (LevelDB, Pebble, etc.) is determined from the repo's existing
Datastore.Specconfig viafsrepo.AnyDatastoreConfig, so it automatically matches whatever backend the user has configured.Changes
WithDatastoreFactory()from feat(provider/keystore): alt datastore wipe libp2p/go-libp2p-kad-dht#1233 for the providerResettableKeystore, enabling full disk reclamation on keystore reset