Skip to content

[Colossus] Add pruning service#4845

Merged
mnaamani merged 6 commits intoJoystream:colossus-betafrom
zeeshanakram3:colossus-add-pruning-service
Nov 17, 2023
Merged

[Colossus] Add pruning service#4845
mnaamani merged 6 commits intoJoystream:colossus-betafrom
zeeshanakram3:colossus-add-pruning-service

Conversation

@zeeshanakram3
Copy link
Copy Markdown
Contributor

@zeeshanakram3 zeeshanakram3 commented Aug 28, 2023

addresses #4813

This PR adds:

  • adds cleanup service to prune outdated assets, and configuration option to enable/disable it on storage-node startup
  • adds CLI command 'dev:cleanup' for performing pruning actions (ideally should not be used in production)
  • extends '/status' endpoint with additional info, e.g. sync, cleanup/pruning & serving buckets configurations

How the pruning of assets work

The cleanupService runs the data objects cleanup/pruning workflow. It removes all the locally stored data objects that the operator is no longer obliged to keep, because either the data object has been deleted from the runtime or it has been moved to some other bucket/s

PRECONDITIONS:

  • Since the cleanup uses the QueryNode to query the data obligations, the QueryNode processor should not lag more than MAXIMUM_QN_LAGGING_THRESHOLD blocks, otherwise the cleanup workflow would not be performed in order to avoid pruning assets based on outdated state.
  • If the asset being pruned from this storage node still exists in the runtime (i.e. its storage obligation has been moved), then at least "X" other storage nodes should hold the asset, otherwise, the cleanup workflow would not be performed, where "X" is defined by MINIMUM_REPLICATION_THRESHOLD
  • If the asset being pruned from this storage node is currently being downloaded by some external actors, then the cleanup action for this asset would be postponed

@zeeshanakram3 zeeshanakram3 requested a review from mnaamani August 28, 2023 06:38
Copy link
Copy Markdown
Member

@mnaamani mnaamani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good and will be quite valuable. Left some feedback. Also needs an update from master to fix some merge conflicts.

default: 1,
}),
cleanup: flags.boolean({
char: 's',
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

char 's' already assigned for sync. Better remove it as I don't know what the behavior is if short flag is used and multiple flags assigned same char.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, changed the short flag to c. WDYT?

cleanupInterval: flags.integer({
char: 'i',
description: 'Interval between periodic cleanup actions (in minutes)',
default: 1,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say this should be much higher like 1h, maybe even longer?
Depends on how much work/time it takes per cleanup run.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, ideally it should be higher that the sync interval, I just set it to 1min, basically test the pruning behavior locally. Now I have set it to 6hrs

): Promise<{ dataObjectId: DataObjectId; pinnedCount: DataObjectPinCount } | undefined> {
if (idCache.has(dataObjectId)) {
await lock.acquireAsync()
const id = { dataObjectId, pinnedCount: idCache.get(dataObjectId) as DataObjectPinCount }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return type of function doesn't seem to match id, dataObjectId is a string where as in the return type it is DataObjectId (u64)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DataObjectId is also a string, as you can see

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, i didn't spot that type definition.


const timeoutMs = 60 * 1000 // 1 minute since it's only a HEAD request
const deletionTasks: DeleteLocalFileTask[] = []
await Promise.all(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
await Promise.all(
await Promise.allSettled(

To avoid the promise being rejected when HEAD request to one data object fails, Promise.allSettled() is more appropriate.

Also I can image if the number of dataobjects could potentially be very large, doing a massive number of HTTP HEAD requests in parallel might be a bad idea. Perhaps we should do it in batches.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but I have to change target option in tsconfig.json to es2020 as Promise.allSettled does not seem to be available in es2017

Copy link
Copy Markdown
Member

@mnaamani mnaamani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One final small change and this should be ready to test in production :)
And updating from master to resolve merge conflict.

dataObjectId: string
): Promise<{ dataObjectId: DataObjectId; pinnedCount: DataObjectPinCount } | undefined> {
if (idCache.has(dataObjectId)) {
await lock.acquireAsync()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the lock be acquired before checking idCache.has() ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. However, placing a lock before the if condition can lead to unnecessary blocking if the dataObjectId is not present in the cache. So I reimplemented by first checking the cache without acquiring the lock. This is a quick operation and if the dataObjectId is not present, just return without the overhead of acquiring the lock.

@zeeshanakram3 zeeshanakram3 force-pushed the colossus-add-pruning-service branch from 5e4be6e to f692d15 Compare November 14, 2023 17:33
@mnaamani mnaamani changed the base branch from master to colossus-beta November 17, 2023 05:52
@mnaamani mnaamani self-requested a review November 17, 2023 05:53
Copy link
Copy Markdown
Member

@mnaamani mnaamani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will merge this work into a new branch colossus-beta where we can do final tweaks for a "beta" release.

@mnaamani mnaamani merged commit c9ca6f9 into Joystream:colossus-beta Nov 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants