Skip to content

fix(provider): purge keystore datastore after reset#11198

Merged
guillaumemichel merged 22 commits intomasterfrom
fix/keystore-datastore-compaction
Mar 19, 2026
Merged

fix(provider): purge keystore datastore after reset#11198
guillaumemichel merged 22 commits intomasterfrom
fix/keystore-datastore-compaction

Conversation

@guillaumemichel
Copy link
Contributor

Fixes #11096

Context

The SweepingProvider's keystore periodically resets its contents (replacing all stored keys with the current set of CIDs to reprovide) because of the lack of CID tracking database (#11150).

Previously, the keystore used namespace.Wrap to carve out a namespace inside the shared repo datastore. When a reset occurred, old keys were deleted one-by-one from that shared datastore. This meant the underlying storage engine (LevelDB, Pebble, etc.) could retain the deleted data as tombstones until a background compaction eventually reclaimed the space (if it ever does before shutdown). On large nodes, this caused persistent disk bloat from stale provider data that was logically deleted but physically still on disk.

Solution

Use the new WithDatastoreFactory() option introduced in libp2p/go-libp2p-kad-dht#1233. Instead of namespacing within the shared repo datastore, the keystore now creates physically separate datastores under /provider-keystore/{0,1,meta}. On each reset cycle:

  1. A new datastore is created for the alternate namespace
  2. New keys are written to it
  3. Namespaces are atomically swapped
  4. The old datastore is destroyed via os.RemoveAll()

This guarantees immediate and complete disk reclamation.

The datastore type (LevelDB, Pebble, etc.) is determined from the repo's existing Datastore.Spec config via fsrepo.AnyDatastoreConfig, so it automatically matches whatever backend the user has configured.

Changes

Copy link
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on the factory approach and tests.

One concern: the old keystore data at /provider/keystore/{0,1,active} inside the shared repo datastore is now orphaned.

Two ideas:

  1. On first start with the new code, check if the shared repo datastore has keys under /provider/keystore/ and if so, purge them. This is a one-time migration that cleans up the old data automatically. Without this people who reported increased disk usage will be stuck with the last queue, no?

  2. Wire up ipfs provide clear (or similar) to do a hard reset: stop the provide system, purge both the old shared-datastore keys and the new provider-keystore/ directory, and restart provide schedule fresh. Useful for operators who want to reclaim disk space immediately without waiting for the next reprovide cycle.

  3. iiuc ipfs diag datastore is blind to these new datastores, would be good to have means to inspect them, maybe by adding --instance=default|provider0|1

(1) feels like a blocker for this PR, (2) and (3) could be a follow-up issues if they are too involved to implement

@guillaumemichel
Copy link
Contributor Author

guillaumemichel commented Feb 17, 2026

Thanks for the review @lidel !


I implemented (1), so the orphaned keystore datastore prefix is now purged when migrating.


IMO (2) doesn't really apply. Because we don't have a better way to track when CIDs should stop being advertised (see #11150), we have to rely on the KeyChanFunc to reset all CIDs that should be advertised by the provider. So it means keys are only removed from the Keystore during a Reset() operation, which consists in creating a new datastore using the factory (e.g /0), and removing the old active datastore after the swap (e.g /1). So the datastore is purged whenever keys are removed.

Also the goal of the ipfs provide clear command was to clear the provide queue (CIDs that have never been provided yet), and not CIDs that are scheduled for reprovide (keystore).


(3) makes sense, it should certainly be implemented in a follow up PR implemented in d8e7d78 👍🏻

mount keystore datastores to /provider/keystore/0 and /1 so that they
are included in the ipfs diag datastore command
@gammazero gammazero added the need/maintainers-input Needs input from the current maintainer(s) label Mar 3, 2026
@guillaumemichel guillaumemichel mentioned this pull request Mar 5, 2026
22 tasks
lidel added 7 commits March 18, 2026 02:27
…etions

destroyDs calls os.RemoveAll with a suffix from the upstream library.
If suffix were ever ".." or empty, this could delete wrong directories.
Validate that suffix is "0" or "1" in both createDs and destroyDs.
If opening datastore "0" succeeds but "1" fails,
MountKeystoreDatastores returned an error without closing "0".
Avoids allocating a datastore batch when no orphaned keys exist.
findRootDatastoreSpec silently returns wrapper specs it doesn't know
about. If a plugin adds a wrapper with a "child" field, openDatastoreAt
gets the wrapper instead of the leaf backend and fails confusingly.
Log a warning so operators can spot the issue.
- explain why context.Background() is used in the migration code
- add changelog note about the provide cycle restarting on upgrade
- add downgrade caveat about orphaned provider-keystore directory
Copy link
Contributor Author

@guillaumemichel guillaumemichel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Note to myself)

Reviewed until that point c22c16d 👍🏻

lidel added 4 commits March 18, 2026 19:30
- chunk orphan purge into 4096-key batches to bound memory and
  match existing batching patterns in the same file
- cancel the purge context via fx.Lifecycle OnStop so SIGINT
  during startup does not block indefinitely
- deep-copy slices in copySpec (not just maps) so the function
  matches its documented "deep-copy" contract
- return nil from findRootDatastoreSpec when no "/" mount exists,
  so callers fall back to in-memory instead of passing a mount-type
  spec to openDatastoreAt
- rename local variable to avoid shadowing the mount package import
- add `ipfs diag datastore put` subcommand for writing arbitrary
  key-value pairs to the datastore (offline, experimental)
- add DatastorePut harness helper for CLI tests
- add TestProviderKeystoreMigrationPurge: seeds orphaned keystore
  keys via `put`, starts the daemon to trigger migration, verifies
  the orphaned keys are purged and provider-keystore/ dir is created
- add put/get roundtrip test for diag datastore
@lidel lidel force-pushed the fix/keystore-datastore-compaction branch from 6c42463 to 010e0c8 Compare March 18, 2026 20:18
@lidel lidel force-pushed the fix/keystore-datastore-compaction branch from 11f8074 to a16ddc0 Compare March 18, 2026 20:40
Copy link
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guillaumemichel lgtm, manually tested and works as expected as well.

pushed some final changes (high level details below) lmk if any concerns, otherwise we can merge libp2p/go-libp2p-kad-dht#1233 then switch this PR to go-libp2p-kad-dht@master

Pushed changes

Hardening: bitflips, or big repositories with huge old keystores

  • orphan purge: batched deletes (4096 keys), cancellable via SIGINT, deferred batch allocation
    • i was unsure what is sane batch size, 4096 is 4x bufferedBatchSize
  • destroyDs: validate suffix is "0" or "1" before os.RemoveAll
  • findRootDatastoreSpec: graceful fallback when no / mount exists, warn on unknown wrapper types
  • MountKeystoreDatastores: close already-opened datastores on partial failure
  • copySpec: deep-copy slices too, not just maps

Observability:

  • log datastore open/remove with suffix and path in createDs/destroyDs callbacks

Testing:

  • added ipfs diag datastore put (experimental, offline-only) to complement get/count
  • added TestProviderKeystoreMigrationPurge: seeds orphaned keys, starts daemon, verifies purge

Docs:

  • changelog entry rweaked to focus on what was the problem and how it was fixed

Manual test looks ok

Fresh node, SweepEnabled=true, Interval=30s, provider logging enabled. Pre-seeded 5 orphaned keys to simulate old Kubo data.

Observed across 3 reprovide cycles:

  • orphaned keys purged on first start
  • namespaces alternate correctly (0→1→0→1)
  • only the active namespace directory exists on disk after each swap
  • all orphaned keys gone after shutdown

@guillaumemichel guillaumemichel force-pushed the fix/keystore-datastore-compaction branch from 3a39a96 to 72f4a5f Compare March 19, 2026 10:15
@guillaumemichel guillaumemichel merged commit 99c092b into master Mar 19, 2026
21 checks passed
@guillaumemichel guillaumemichel deleted the fix/keystore-datastore-compaction branch March 19, 2026 10:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

need/maintainers-input Needs input from the current maintainer(s)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

datastore grows in size; StorageMax not respected since 0.39.0

3 participants