[nexus] Instance Deletion is now a saga#2060
Conversation
| async fn sid_delete_instance_record( | ||
| sagactx: NexusActionContext, | ||
| ) -> Result<(), ActionError> { | ||
| let osagactx = sagactx.user_data(); | ||
| let params = sagactx.saga_params::<Params>()?; | ||
| let opctx = OpContext::for_saga_action(&sagactx, ¶ms.serialized_authn); | ||
| osagactx | ||
| .datastore() | ||
| .project_delete_instance(&opctx, ¶ms.authz_instance) | ||
| .await | ||
| .map_err(ActionError::action_failed)?; | ||
| Ok(()) | ||
| } |
There was a problem hiding this comment.
It looks like part of what project_delete_instance is doing is going through an detaching disks. I guess that's really just updating the db record for the disks to unassociate them with the instance (via Instance::detach_resource). Looking in the detach disk call it's using the same Instance::detach_resource. I was wondering if that action, detaching disks, should be hoisted to the saga. Given that it's basically using the same thing under the hood I suppose not?
There was a problem hiding this comment.
I do not think it should be.
When I wrote DatastoreDetachManyTarget::detach_resources - which Instance objects implement, so they could detach disks - the whole point of it was that the detach operation (on disks) could be issued as part of the same statement that deletes instances, within a common table expression being issued to CRDB.
By existing within a single CTE, the statement is transactional, even though it modifies multiple tables and only sends a single request to CRDB. This follows the advice of https://www.cockroachlabs.com/docs/v22.2/performance-best-practices-overview#reduce-transaction-contention
Using CTEs, where possible, is generally our "best performance" way to combine multiple SQL statements. My usage of sagas elsewhere is a worse-performance attempt to get similar properties, in a non-atomic way.
There was a problem hiding this comment.
Related: Ideally, each function exposed from the datastore() would only issue a single DB statement, so that it either "fully completes" or "does nothing at all".
project_delete_instance, by virtue of being a CTE, already abides by this principle.
| declare_saga_actions! { | ||
| instance_delete; | ||
| INSTANCE_DELETE_RECORD -> "no_result1" { | ||
| + sid_delete_instance_record | ||
| } | ||
| DELETE_NETWORK_INTERFACES -> "no_result2" { | ||
| + sid_delete_network_interfaces | ||
| } | ||
| DEALLOCATE_EXTERNAL_IP -> "no_result3" { | ||
| + sid_deallocate_external_ip | ||
| } | ||
| } |
There was a problem hiding this comment.
Should there be any rollback cases represented here? What happens if a step fails?
There was a problem hiding this comment.
There could be; I'm not 100% sure there should be? If we're able to delete the instance record, I think we should be "best-effort" rolling forward deletion, rather than attempting to undo the deletion.
For example, suppose we cannot re-allocate the deleted resource: either the resource itself was exhausted, user-data was deleted, or a sled cooperating in the allocation is no longer around. What should we do?
By having deletion be a saga, it's at least durable - even if Nexus crashes, it'll finish the work when it comes back online - but I'm not sure about the value of "making deletion undoable" vs "making sure it rolls forward, whatever that means".
TL;DR: Maybe? But there is a lot of complexity down the road of undoing deletions that I'm not really sure how to handle yet.
just-be-dev
left a comment
There was a problem hiding this comment.
Overall this looks good. The new saga macro really helps make this easier to follow. Had a question about rollbacks and somewhat of a question about disk detachment but otherwise I'm 👍
davepacheco
left a comment
There was a problem hiding this comment.
Thanks for fixing this.
Fixes #2034