Skip to content

[nexus] Instance Deletion is now a saga#2060

Merged
smklein merged 10 commits into
mainfrom
more-sagas
Dec 22, 2022
Merged

[nexus] Instance Deletion is now a saga#2060
smklein merged 10 commits into
mainfrom
more-sagas

Conversation

@smklein

@smklein smklein commented Dec 16, 2022

Copy link
Copy Markdown
Collaborator

Fixes #2034

@smklein smklein mentioned this pull request Dec 16, 2022
16 tasks
Base automatically changed from saga-macro to main December 20, 2022 15:37
Comment thread nexus/src/app/sagas/instance_delete.rs
Comment on lines +63 to +75
async fn sid_delete_instance_record(
sagactx: NexusActionContext,
) -> Result<(), ActionError> {
let osagactx = sagactx.user_data();
let params = sagactx.saga_params::<Params>()?;
let opctx = OpContext::for_saga_action(&sagactx, &params.serialized_authn);
osagactx
.datastore()
.project_delete_instance(&opctx, &params.authz_instance)
.await
.map_err(ActionError::action_failed)?;
Ok(())
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like part of what project_delete_instance is doing is going through an detaching disks. I guess that's really just updating the db record for the disks to unassociate them with the instance (via Instance::detach_resource). Looking in the detach disk call it's using the same Instance::detach_resource. I was wondering if that action, detaching disks, should be hoisted to the saga. Given that it's basically using the same thing under the hood I suppose not?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think it should be.

When I wrote DatastoreDetachManyTarget::detach_resources - which Instance objects implement, so they could detach disks - the whole point of it was that the detach operation (on disks) could be issued as part of the same statement that deletes instances, within a common table expression being issued to CRDB.

By existing within a single CTE, the statement is transactional, even though it modifies multiple tables and only sends a single request to CRDB. This follows the advice of https://www.cockroachlabs.com/docs/v22.2/performance-best-practices-overview#reduce-transaction-contention

Using CTEs, where possible, is generally our "best performance" way to combine multiple SQL statements. My usage of sagas elsewhere is a worse-performance attempt to get similar properties, in a non-atomic way.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related: Ideally, each function exposed from the datastore() would only issue a single DB statement, so that it either "fully completes" or "does nothing at all".

project_delete_instance, by virtue of being a CTE, already abides by this principle.

Comment on lines +25 to +36
declare_saga_actions! {
instance_delete;
INSTANCE_DELETE_RECORD -> "no_result1" {
+ sid_delete_instance_record
}
DELETE_NETWORK_INTERFACES -> "no_result2" {
+ sid_delete_network_interfaces
}
DEALLOCATE_EXTERNAL_IP -> "no_result3" {
+ sid_deallocate_external_ip
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be any rollback cases represented here? What happens if a step fails?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be; I'm not 100% sure there should be? If we're able to delete the instance record, I think we should be "best-effort" rolling forward deletion, rather than attempting to undo the deletion.

For example, suppose we cannot re-allocate the deleted resource: either the resource itself was exhausted, user-data was deleted, or a sled cooperating in the allocation is no longer around. What should we do?

By having deletion be a saga, it's at least durable - even if Nexus crashes, it'll finish the work when it comes back online - but I'm not sure about the value of "making deletion undoable" vs "making sure it rolls forward, whatever that means".

TL;DR: Maybe? But there is a lot of complexity down the road of undoing deletions that I'm not really sure how to handle yet.

@just-be-dev just-be-dev left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good. The new saga macro really helps make this easier to follow. Had a question about rollbacks and somewhat of a question about disk detachment but otherwise I'm 👍

@davepacheco davepacheco left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this.

Comment thread nexus/authz-macros/src/lib.rs
Comment thread nexus/db-model/src/update_artifact.rs
Comment thread nexus/src/authz/api_resources.rs
Comment thread nexus/src/app/sagas/instance_delete.rs
@smklein smklein mentioned this pull request Dec 22, 2022
14 tasks
@smklein smklein merged commit b5c0d3f into main Dec 22, 2022
@smklein smklein deleted the more-sagas branch December 22, 2022 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[nexus] Instance deletion is not a saga, but it should be

3 participants