Summary
There appears to be an extremely subtle race between the instance start saga, its unwind path, and the instance cleanup logic in sled agent that can cause an instance's virtual provisioning counters to underflow. It goes something like this:
- Instance gets created; the instance record has generation 2
- Task T1 begins start saga A1
- A1 sets the VMM ID V1 in the instance, advancing its record to generation 3
- A1 registers the V1 with sled S1, telling the sled the instance is at generation 3
- A1 allocates virtual provisioning counters for the now-starting instance
- A1 calls S1 to ask it to set the instance to running
- In S1's sled agent, T2 takes the instance's lock and begins setting up a Propolis process (
setup_propolis_locked) and telling it to create a VM (ensure_propolis_and_tasks)
- A1 times out waiting for T2 to finish its work; start saga unwind begins
- A1 releases virtual provisioning counters as part of unwind
- A1 issues an
ensure_unregistered call to S1
- S1 spawns T3 to handle this call; this blocks waiting for the instance lock (which T2 owns)
- T2 spawns instance monitoring task T4; this calls into Propolis, gets an initial state from it, returns, and then blocks on the instance lock (which T2 still owns)
- T2 drops the lock; T3 acquires it, then unregisters the instance; this terminates the Propolis zone and marks the relevant VMM as
Destroyed, then returns to Nexus
- T1 finishes unwinding saga S1: this clears the VMM ID and moves the instance generation to 4, then returns
- A new start saga begins for the same instance; call it A2. A2 registers a new VMM for the instance, advances the instance generation to 5, re-registers the instance (note that its previous monitoring task T4 is not gone yet!),
- A2 registers a VMM, advancing the instance to generation 5; it then tells sled S2 to register the instance and charges the virtual provisioning collection
- Back over on S1, T4 gets the lock and calls sled agent's
InstanceInner::observe_state to apply its state observation from step 12. This is a no-op because the instance's zone was destroyed in step 13. However, the monitoring task still notifies Nexus of the state change!
- T4 publishes to Nexus an update that says that the instance's VMM is destroyed; this has an instance generation of 3, because that's what was registered with S1 all the way back in step 4
- Nexus's
Instance::notify_instance_updated processes this update; it calls virtual_provisioning_collection_delete_instance again
- This call finds that there's a
virtual_provisioning_resource entry for the instance (it was created in step 16), but does not delete it, because the notify call supplied instance generation 3 and the instance was already advanced to generation 5. However, this doesn't suffice to keep the virtual_provisioning_collection counters from being modified, for reasons explained below.
And there you have it: the same VMM's provisioning counters were removed twice (step 9 and step 20), causing underflow when the project's last instance is stopped.
Analysis, or, "how the heck did you come up with this?"
A dogfood user reported that creating an instance had failed with a mysterious "value is too small for a byte count" error message emanating from a saga node whose output was named "no_result". A quick look through the code reveals that this is the output name for the "allocate disk space" node in the disk create saga, which touches the virtual provisioning tables. As a quick recap:
- The
virtual_provisioning_resource table has one row for every "thing" in the system (running instance, disk, control plane service, etc.) that is consuming some resources (CPU, swap charges, reservoir, disk space) on some sled.
- The
virtual_provisioning_collection table has one row for the fleet, each silo, and each project. Each of these rows tracks the total usage of the provisioning resources that belong to that fleet/silo/project.
Most of the queries that touch the virtual_provisioning_resource tables return a Vec<VirtualProvisioningCollection>. This type includes a ByteCount (for memory and disk) that impls TryFrom<i64>. The conversion fails if the byte count is negative. Looking at this table in dogfood reveals that the user's project has a problem:
id | time_modified | collection_type | virtual_disk_bytes_provisioned | cpus_provisioned | ram_provisioned
---------------------------------------+-------------------------------+-----------------+--------------------------------+------------------+------------------
ea038803-a7cf-4ba8-96cb-4b5bb60f7259 | 2024-02-09 21:05:28.359901+00 | Project | 3435973836800 | -8 | -68719476736
(Note that this failure occurs after the query has executed--it is a failure to convert the values returned from the database into Rust types, not a failure to execute the query itself!)
How did this underflow occur? virtual_provisioning_collection_delete_instance is a complex CTE that has three relevant parts:
- One subquery asks, "should I update the relevant collections?" This subquery is the same for all provisioning collection operations: it returns True if the relevant instance has an entry in the virtual provisioning resource table and false otherwise.
- A second subquery is caller-provided and is used to delete the virtual provisioning resource for the instance. It has a special instance generation number check to try to make sure that, if an instance starts, stops, and starts again, a "leftover" or delayed call to notify Nexus that the instance has stopped won't delete the virtual provisioning resource from the running instance.
- The third subquery subtracts from the
virtual_provisioning_collection table entries. This is not caller-provided; it is executed whenever the first subquery returns "true," even if the second subquery decides that the request to delete the provisioning resource is too old and shouldn't be executed.
This is a bug in virtual_provisioning_collection_delete_instance--both the removal of the resource and the update of the collections should be conditioned on the generation number check, not just the resource removal. Unfortunately the provisioning queries are not quite written to allow this (the caller can supply extra filters to subquery #2, but not to subquery #3).
To show that this is in fact the problem, we can look at the history of the instances in this project, most of which are short-lived VMs that are used for Hubris builds. Here's a table showing VMMs that belonged to these instances, along with the instance creation times and the VMMs' creation and destruction times:
hubris-build-oxide-rot-1-b51c865b-f4b7-4714-b177-07ea11aee552 | f1d52cc1-b77f-48c4-bfcc-d34df457a9d6 | 16 | 2024-02-09 17:02:21.313603+00 | 26444018-057e-4107-994f-2eb56678f044 | 2024-02-09 17:02:27.560346+00 | 2024-02-09 17:03:47.787641+00
hubris-build-oxide-rot-1-b51c865b-f4b7-4714-b177-07ea11aee552 | f1d52cc1-b77f-48c4-bfcc-d34df457a9d6 | 16 | 2024-02-09 17:02:21.313603+00 | 35cf3f25-1f48-4484-8085-de85ec9b61a7 | 2024-02-09 17:03:49.009715+00 | 2024-02-09 17:10:13.603491+00
hubris-build-oxide-rot-1-d6641735-2a09-4472-a818-cccdbf8d69cc | 965abd21-2c7b-4520-bd95-e9bde7b721aa | 16 | 2024-02-09 17:23:18.538819+00 | 9dcbf3ab-3425-47e4-974f-32183a18c675 | 2024-02-09 17:23:23.917451+00 | 2024-02-09 17:29:40.576401+00
hubris-build-oxide-rot-1-4614ea2a-796e-4b9a-8a56-bd57bf4dbe2d | b9aaad2f-4b74-4acc-af66-89b8c49c6d76 | 16 | 2024-02-09 17:36:35.588431+00 | a07cf1db-cf60-476e-8527-d2326b461e56 | 2024-02-09 17:36:40.387528+00 | 2024-02-09 17:43:02.420445+00
hubris-build-lpc55xpresso-4614ea2a-796e-4b9a-8a56-bd57bf4dbe2d | 9da15f1e-f6b6-4e64-9f47-f844d584f4cc | 16 | 2024-02-09 17:36:35.58672+00 | 9e02eb13-1969-4d26-8aa9-42795f6d3c3f | 2024-02-09 17:36:44.766089+00 | 2024-02-09 17:43:06.557992+00
hubris-build-rot-carrier-4614ea2a-796e-4b9a-8a56-bd57bf4dbe2d | 3a9ac063-f379-4ace-ab6d-38fd21a5c617 | 16 | 2024-02-09 17:36:35.587577+00 | 80c29597-1b7f-4ed8-b990-241a0c65201e | 2024-02-09 17:36:41.794754+00 | 2024-02-09 17:43:10.394102+00
hubris-build-oxide-rot-1-77d9c487-f724-4e95-ad05-43f4010ab928 | 3debbe6e-b82d-4bb2-8b1e-bbac5f769141 | 16 | 2024-02-09 17:41:49.341982+00 | 9b76e3f6-276e-421a-a6f2-4e2e4aa65051 | 2024-02-09 17:41:54.011303+00 | 2024-02-09 17:43:28.197523+00
hubris-build-oxide-rot-1-77d9c487-f724-4e95-ad05-43f4010ab928 | 3debbe6e-b82d-4bb2-8b1e-bbac5f769141 | 16 | 2024-02-09 17:41:49.341982+00 | 2ba1f42e-d7b5-4fb9-a69b-a5bb89da8ccf | 2024-02-09 17:43:28.534384+00 | 2024-02-09 17:49:42.51934+00
hubris-build-oxide-rot-1-1f7dd907-d792-494a-9b0b-7a56aac00151 | 00a5bb2f-c292-486f-a13b-81603db574cb | 16 | 2024-02-09 17:44:20.115873+00 | 644f012e-c78b-459d-b08a-baf3c55f352f | 2024-02-09 17:44:29.106687+00 | 2024-02-09 17:50:44.126764+00
hubris-build-rot-carrier-1f7dd907-d792-494a-9b0b-7a56aac00151 | 1b28b618-7e22-4a8a-a62f-ec3a6bbb547b | 16 | 2024-02-09 17:44:20.115487+00 | 6b0be948-7ec1-48d1-9f3b-d602916e6bdd | 2024-02-09 17:44:25.632522+00 | 2024-02-09 17:50:46.486401+00
hubris-build-lpc55xpresso-1f7dd907-d792-494a-9b0b-7a56aac00151 | 3b0e9b33-2646-4613-8932-aad2b86ed6ce | 16 | 2024-02-09 17:44:20.117503+00 | 047470f6-cc10-49df-af4b-941a9c3e7711 | 2024-02-09 17:44:26.757357+00 | 2024-02-09 17:50:54.98489+00
testing | 3e031726-f523-4a7f-a6c8-afc0e4f41030 | 8 | 2024-02-09 21:05:26.412537+00 | 1cea44bb-320a-4770-b80f-1999795a4417 | 2024-02-09 21:05:27.507978+00 | 2024-02-09 21:05:29.08211+00
Before VMM a07cf1db is started, there are no active VMMs in this project; after it's started, there's at least one VMM until 3b0e9b33 exits.
Rummaging through the Nexus logs looking for state updates from these VMMs reveals some interesting facts. First, we can see where the underflow actually occurs: the last VMM to be destroyed (3b0e9b33) fails to update its state with an underflow error, then sends another update that doesn't hit that error:
3b0e9b33-cleanup.txt
73:{"msg":"received new runtime state from sled agent","v":0,"name":"nexus","level":30,"time":"2024-02-09T17:50:58.418763158Z","hostname":"oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a","pid":28990,"component":"nexus","component":"ServerContext","name":"20b100d0-84c3-4119-aa9b-0c632b0b6a3a","file":"nexus/src/app/instance.rs:1472","vmm_state":"VmmRuntimeState { state: Destroyed, gen: Generation(7), time_updated: 2024-02-09T17:50:54.984890356Z }","propolis_id":"047470f6-cc10-49df-af4b-941a9c3e7711","instance_state":"InstanceRuntimeState { propolis_id: None, dst_propolis_id: None, migration_id: None, gen: Generation(5), time_updated: 2024-02-09T17:50:54.984890356Z }","instance_id":"3b0e9b33-2646-4613-8932-aad2b86ed6ce"}
// gjc: note "value is too small for a byte count" error message--underflow occurred here
126:{"msg":"request completed","v":0,"name":"nexus","level":30,"time":"2024-02-09T17:50:59.232031708Z","hostname":"oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a","pid":28990,"uri":"/instances/3b0e9b33-2646-4613-8932-aad2b86ed6ce","method":"PUT","req_id":"b63a020f-09b7-4612-8490-cad8c58119ca","remote_addr":"[fd00:1122:3344:102::1]:62162","local_addr":"[fd00:1122:3344:104::3]:12221","component":"dropshot_internal","name":"20b100d0-84c3-4119-aa9b-0c632b0b6a3a","file":"/home/levon/.cargo/git/checkouts/dropshot-a4a923d29dccc492/711a749/dropshot/src/server.rs:837","error_message_external":"Internal Server Error","error_message_internal":"unexpected database error: value is too small for a byte count","latency_us":813315,"response_code":"500"}
// gjc: the virtual provisioning resource was deleted previously (remember, the error occurs after the query executes), so this time no collection rows are returned and no bad conversion occurs
127:{"msg":"received new runtime state from sled agent","v":0,"name":"nexus","level":30,"time":"2024-02-09T17:50:59.397307515Z","hostname":"oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a","pid":28990,"component":"nexus","component":"ServerContext","name":"20b100d0-84c3-4119-aa9b-0c632b0b6a3a","file":"nexus/src/app/instance.rs:1472","vmm_state":"VmmRuntimeState { state: Destroyed, gen: Generation(7), time_updated: 2024-02-09T17:50:54.984890356Z }","propolis_id":"047470f6-cc10-49df-af4b-941a9c3e7711","instance_state":"InstanceRuntimeState { propolis_id: None, dst_propolis_id: None, migration_id: None, gen: Generation(5), time_updated: 2024-02-09T17:50:54.984890356Z }","instance_id":"3b0e9b33-2646-4613-8932-aad2b86ed6ce"}
// gjc: unrelated timeout talking to Oximeter
185:{"msg":"request completed","v":0,"name":"nexus","level":30,"time":"2024-02-09T17:51:15.00813115Z","hostname":"oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a","pid":28990,"uri":"/instances/3b0e9b33-2646-4613-8932-aad2b86ed6ce","method":"PUT","req_id":"4da7fa35-f2f6-4667-9c3b-42de7c246c35","remote_addr":"[fd00:1122:3344:102::1]:62162","local_addr":"[fd00:1122:3344:104::3]:12221","component":"dropshot_internal","name":"20b100d0-84c3-4119-aa9b-0c632b0b6a3a","file":"/home/levon/.cargo/git/checkouts/dropshot-a4a923d29dccc492/711a749/dropshot/src/server.rs:837","error_message_external":"Internal Server Error","error_message_internal":"failed to delete producer from collector: Communication Error: error sending request for url (http://[fd00:1122:3344:10a::3]:12223/producers/3b0e9b33-2646-4613-8932-aad2b86ed6ce): operation timed out","latency_us":15610126,"response_code":"500"}
// gjc: timeout resolves itself and we soldier on
186:{"msg":"received new runtime state from sled agent","v":0,"name":"nexus","level":30,"time":"2024-02-09T17:51:15.729916994Z","hostname":"oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a","pid":28990,"component":"nexus","component":"ServerContext","name":"20b100d0-84c3-4119-aa9b-0c632b0b6a3a","file":"nexus/src/app/instance.rs:1472","vmm_state":"VmmRuntimeState { state: Destroyed, gen: Generation(7), time_updated: 2024-02-09T17:50:54.984890356Z }","propolis_id":"047470f6-cc10-49df-af4b-941a9c3e7711","instance_state":"InstanceRuntimeState { propolis_id: None, dst_propolis_id: None, migration_id: None, gen: Generation(5), time_updated: 2024-02-09T17:50:54.984890356Z }","instance_id":"3b0e9b33-2646-4613-8932-aad2b86ed6ce"}
239:{"msg":"vmm is terminated, cleaning up resources","v":0,"name":"nexus","level":30,"time":"2024-02-09T17:51:16.33712833Z","hostname":"oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a","pid":28990,"component":"nexus","component":"ServerContext","name":"20b100d0-84c3-4119-aa9b-0c632b0b6a3a","file":"nexus/src/app/instance.rs:1585","propolis_id":"047470f6-cc10-49df-af4b-941a9c3e7711","instance_id":"3b0e9b33-2646-4613-8932-aad2b86ed6ce"}
This instance may only be a victim, though: it was the last one to be cleaned up, so it's possible the accounting error occurred with a different VM and was only noticed here. Looking over the Nexus logs for the other VMs shows one that looks a little different from the others:
3debbe6e-cleanup.txt
// gjc: rather odd: Propolis shouldn't go straight from Starting to Destroyed, so why is the VMM generation 2 here?
1:{"msg":"received new runtime state from sled agent","v":0,"name":"nexus","level":30,"time":"2024-02-09T17:43:27.411487763Z","hostname":"oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a","pid":28990,"component":"nexus","component":"ServerContext","name":"20b100d0-84c3-4119-aa9b-0c632b0b6a3a","file":"nexus/src/app/instance.rs:1472","vmm_state":"VmmRuntimeState { state: Destroyed, gen: Generation(2), time_updated: 2024-02-09T17:43:27.396779524Z }","propolis_id":"9b76e3f6-276e-421a-a6f2-4e2e4aa65051","instance_state":"InstanceRuntimeState { propolis_id: None, dst_propolis_id: None, migration_id: None, gen: Generation(5), time_updated: 2024-02-09T17:43:27.396779524Z }","instance_id":"3debbe6e-b82d-4bb2-8b1e-bbac5f769141"}
// gjc: oximeter timeout again
59:{"msg":"request completed","v":0,"name":"nexus","level":30,"time":"2024-02-09T17:43:43.260631876Z","hostname":"oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a","pid":28990,"uri":"/instances/3debbe6e-b82d-4bb2-8b1e-bbac5f769141","method":"PUT","req_id":"498f4377-b563-416d-93a4-57e398af8ef5","remote_addr":"[fd00:1122:3344:103::1]:39410","local_addr":"[fd00:1122:3344:104::3]:12221","component":"dropshot_internal","name":"20b100d0-84c3-4119-aa9b-0c632b0b6a3a","file":"/home/levon/.cargo/git/checkouts/dropshot-a4a923d29dccc492/711a749/dropshot/src/server.rs:837","error_message_external":"Internal Server Error","error_message_internal":"failed to delete producer from collector: Communication Error: error sending request for url (http://[fd00:1122:3344:10a::3]:12223/producers/3debbe6e-b82d-4bb2-8b1e-bbac5f769141): operation timed out","latency_us":15848392,"response_code":"500"}
// gjc: new attempt to publish the same state
60:{"msg":"received new runtime state from sled agent","v":0,"name":"nexus","level":30,"time":"2024-02-09T17:43:43.424297694Z","hostname":"oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a","pid":28990,"component":"nexus","component":"ServerContext","name":"20b100d0-84c3-4119-aa9b-0c632b0b6a3a","file":"nexus/src/app/instance.rs:1472","vmm_state":"VmmRuntimeState { state: Destroyed, gen: Generation(2), time_updated: 2024-02-09T17:43:27.396779524Z }","propolis_id":"9b76e3f6-276e-421a-a6f2-4e2e4aa65051","instance_state":"InstanceRuntimeState { propolis_id: None, dst_propolis_id: None, migration_id: None, gen: Generation(5), time_updated: 2024-02-09T17:43:27.396779524Z }","instance_id":"3debbe6e-b82d-4bb2-8b1e-bbac5f769141"}
// gjc: this one worked
78:{"msg":"vmm is terminated, cleaning up resources","v":0,"name":"nexus","level":30,"time":"2024-02-09T17:43:43.459435646Z","hostname":"oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a","pid":28990,"component":"nexus","component":"ServerContext","name":"20b100d0-84c3-4119-aa9b-0c632b0b6a3a","file":"nexus/src/app/instance.rs:1585","propolis_id":"9b76e3f6-276e-421a-a6f2-4e2e4aa65051","instance_id":"3debbe6e-b82d-4bb2-8b1e-bbac5f769141"}
// gjc: then the VMM gets started and stopped again (normally)
157:{"msg":"received new runtime state from sled agent","v":0,"name":"nexus","level":30,"time":"2024-02-09T17:49:45.859126539Z","hostname":"oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a","pid":28990,"component":"nexus","component":"ServerContext","name":"20b100d0-84c3-4119-aa9b-0c632b0b6a3a","file":"nexus/src/app/instance.rs:1472","vmm_state":"VmmRuntimeState { state: Destroyed, gen: Generation(7), time_updated: 2024-02-09T17:49:42.519340611Z }","propolis_id":"2ba1f42e-d7b5-4fb9-a69b-a5bb89da8ccf","instance_state":"InstanceRuntimeState { propolis_id: None, dst_propolis_id: None, migration_id: None, gen: Generation(7), time_updated: 2024-02-09T17:49:42.519340611Z }","instance_id":"3debbe6e-b82d-4bb2-8b1e-bbac5f769141"}
215:{"msg":"request completed","v":0,"name":"nexus","level":30,"time":"2024-02-09T17:50:01.4832182Z","hostname":"oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a","pid":28990,"uri":"/instances/3debbe6e-b82d-4bb2-8b1e-bbac5f769141","method":"PUT","req_id":"8639f2ce-acf7-470f-9551-e8acbc9d4030","remote_addr":"[fd00:1122:3344:106::1]:61489","local_addr":"[fd00:1122:3344:104::3]:12221","component":"dropshot_internal","name":"20b100d0-84c3-4119-aa9b-0c632b0b6a3a","file":"/home/levon/.cargo/git/checkouts/dropshot-a4a923d29dccc492/711a749/dropshot/src/server.rs:837","error_message_external":"Internal Server Error","error_message_internal":"failed to delete producer from collector: Communication Error: error sending request for url (http://[fd00:1122:3344:10a::3]:12223/producers/3debbe6e-b82d-4bb2-8b1e-bbac5f769141): operation timed out","latency_us":15623404,"response_code":"500"}
216:{"msg":"received new runtime state from sled agent","v":0,"name":"nexus","level":30,"time":"2024-02-09T17:50:01.689966318Z","hostname":"oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a","pid":28990,"component":"nexus","component":"ServerContext","name":"20b100d0-84c3-4119-aa9b-0c632b0b6a3a","file":"nexus/src/app/instance.rs:1472","vmm_state":"VmmRuntimeState { state: Destroyed, gen: Generation(7), time_updated: 2024-02-09T17:49:42.519340611Z }","propolis_id":"2ba1f42e-d7b5-4fb9-a69b-a5bb89da8ccf","instance_state":"InstanceRuntimeState { propolis_id: None, dst_propolis_id: None, migration_id: None, gen: Generation(7), time_updated: 2024-02-09T17:49:42.519340611Z }","instance_id":"3debbe6e-b82d-4bb2-8b1e-bbac5f769141"}
269:{"msg":"vmm is terminated, cleaning up resources","v":0,"name":"nexus","level":30,"time":"2024-02-09T17:50:03.181895019Z","hostname":"oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a","pid":28990,"component":"nexus","component":"ServerContext","name":"20b100d0-84c3-4119-aa9b-0c632b0b6a3a","file":"nexus/src/app/instance.rs:1585","propolis_id":"2ba1f42e-d7b5-4fb9-a69b-a5bb89da8ccf","instance_id":"3debbe6e-b82d-4bb2-8b1e-bbac5f769141"}
The "starting to destroyed" transition suggests that something went amiss with starting 3debbe6e the first time: Propolis doesn't do that, so this is a sign that something may have gone wrong in the instance startup path. And, sure enough, we find that a start attempt for this instance failed and tried to unwind:
17:42:55.038Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): start saga: unregistering instance from sled
file = nexus/src/app/sagas/instance_start.rs:550
instance_id = 3debbe6e-b82d-4bb2-8b1e-bbac5f769141
saga_id = f7278691-37f4-435f-9196-157afa67999b
saga_name = instance-start
sled_id = 2707b587-9c7f-4fb0-a7af-37c3b7a9a0fa
Not too longer after this we see the instance being registered again as part of a new attempt to start:
17:43:29.414Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): start saga: ensuring instance is registered on sled
file = nexus/src/app/sagas/instance_start.rs:512
instance_id = 3debbe6e-b82d-4bb2-8b1e-bbac5f769141
saga_id = 415e11aa-cc85-49c7-beb2-cdd545708144
saga_name = instance-start
sled_id = b886b58a-1e3f-4be1-b9f2-0c2e66c6bc88
Followed by a successful attempt to start it:
17:43:29.666Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): start saga: ensuring instance is running
file = nexus/src/app/sagas/instance_start.rs:661
instance_id = 3debbe6e-b82d-4bb2-8b1e-bbac5f769141
saga_id = 415e11aa-cc85-49c7-beb2-cdd545708144
saga_name = instance-start
sled_id = b886b58a-1e3f-4be1-b9f2-0c2e66c6bc88
17:43:29.686Z DEBG 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): authorize result
action = Modify
actor = Some(Actor::SiloUser { silo_user_id: 45ca902b-ac16-4927-a33b-529493f911af, silo_id: 7bd7623a-68ed-4636-8ecb-b59e3b068787, .. })
actor_id = 45ca902b-ac16-4927-a33b-529493f911af
authenticated = true
resource = Instance { parent: Project { parent: Silo { parent: Fleet, key: 7bd7623a-68ed-4636-8ecb-b59e3b068787, lookup_type: ById(7bd7623a-68ed-4636-8ecb-b59e3b068787) }, key: ea038803-a7cf-4ba8-96cb-4b5bb60f7259, lookup_type: ById(ea038803-a7cf-4ba8-96cb-4b5bb60f7259) }, key: 3debbe6e-b82d-4bb2-8b1e-bbac5f769141, lookup_type: ById(3debbe6e-b82d-4bb2-8b1e-bbac5f769141) }
result = Ok(())
saga_id = 415e11aa-cc85-49c7-beb2-cdd545708144
saga_name = instance-start
saga_node = EnsureRunning
17:43:29.701Z DEBG 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): authorize result
action = Modify
actor = Some(Actor::SiloUser { silo_user_id: 45ca902b-ac16-4927-a33b-529493f911af, silo_id: 7bd7623a-68ed-4636-8ecb-b59e3b068787, .. })
actor_id = 45ca902b-ac16-4927-a33b-529493f911af
authenticated = true
resource = Instance { parent: Project { parent: Silo { parent: Fleet, key: 7bd7623a-68ed-4636-8ecb-b59e3b068787, lookup_type: ById(7bd7623a-68ed-4636-8ecb-b59e3b068787) }, key: ea038803-a7cf-4ba8-96cb-4b5bb60f7259, lookup_type: ById(ea038803-a7cf-4ba8-96cb-4b5bb60f7259) }, key: 3debbe6e-b82d-4bb2-8b1e-bbac5f769141, lookup_type: ById(3debbe6e-b82d-4bb2-8b1e-bbac5f769141) }
result = Ok(())
saga_id = 415e11aa-cc85-49c7-beb2-cdd545708144
saga_name = instance-start
saga_node = EnsureRunning
17:43:29.743Z DEBG 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): client request
SledAgent = b886b58a-1e3f-4be1-b9f2-0c2e66c6bc88
body = Some(Body)
method = PUT
uri = http://[fd00:1122:3344:106::1]:12345/instances/3debbe6e-b82d-4bb2-8b1e-bbac5f769141/state
17:44:26.787Z DEBG 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): client response
SledAgent = b886b58a-1e3f-4be1-b9f2-0c2e66c6bc88
result = Ok(Response { url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Ipv6(fd00:1122:3344:106::1)), port: Some(12345), path: "/instances/3debbe6e-b82d-4bb2-8b1e-bbac5f769141/state", query: None, fragment: None }, status: 200, headers: {"content-type": "application/json", "x-request-id": "a044e8bb-5347-4703-8f80-07ed1cb4f2bc", "content-length": "331", "date": "Fri, 09 Feb 2024 17:44:26 GMT"} })
17:44:26.787Z DEBG 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): writing instance state returned from sled agent
instance_id = 3debbe6e-b82d-4bb2-8b1e-bbac5f769141
new_state = Some(SledInstanceState { instance_state: InstanceRuntimeState { propolis_id: Some(2ba1f42e-d7b5-4fb9-a69b-a5bb89da8ccf), dst_propolis_id: None, migration_id: None, gen: Generation(5), time_updated: 2024-02-09T17:43:28.563370514Z }, propolis_id: 2ba1f42e-d7b5-4fb9-a69b-a5bb89da8ccf, vmm_state: VmmRuntimeState { state: Starting, gen: Generation(1), time_updated: 2024-02-09T17:43:28.534384Z } })
17:44:26.792Z DEBG 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): attempted to write instance state from sled agent
instance_id = 3debbe6e-b82d-4bb2-8b1e-bbac5f769141
propolis_id = 2ba1f42e-d7b5-4fb9-a69b-a5bb89da8ccf
result = Ok((false, false))
The second start attempt's virtual provisioning counters must have been acquired by 17:43:29.666Z (when the attempt to run the instance is logged). The retried attempt to send state from the old VMM lands at 17:43:43.424, which is after this but before the instance has actually started. This send has an obsolete instance generation, but (as described above) that's not enough to prevent the counters from being deleted. And there you have it!
Summary
There appears to be an extremely subtle race between the instance start saga, its unwind path, and the instance cleanup logic in sled agent that can cause an instance's virtual provisioning counters to underflow. It goes something like this:
setup_propolis_locked) and telling it to create a VM (ensure_propolis_and_tasks)ensure_unregisteredcall to S1Destroyed, then returns to NexusInstanceInner::observe_stateto apply its state observation from step 12. This is a no-op because the instance's zone was destroyed in step 13. However, the monitoring task still notifies Nexus of the state change!Instance::notify_instance_updatedprocesses this update; it callsvirtual_provisioning_collection_delete_instanceagainvirtual_provisioning_resourceentry for the instance (it was created in step 16), but does not delete it, because the notify call supplied instance generation 3 and the instance was already advanced to generation 5. However, this doesn't suffice to keep thevirtual_provisioning_collectioncounters from being modified, for reasons explained below.And there you have it: the same VMM's provisioning counters were removed twice (step 9 and step 20), causing underflow when the project's last instance is stopped.
Analysis, or, "how the heck did you come up with this?"
A dogfood user reported that creating an instance had failed with a mysterious "value is too small for a byte count" error message emanating from a saga node whose output was named "no_result". A quick look through the code reveals that this is the output name for the "allocate disk space" node in the disk create saga, which touches the virtual provisioning tables. As a quick recap:
virtual_provisioning_resourcetable has one row for every "thing" in the system (running instance, disk, control plane service, etc.) that is consuming some resources (CPU, swap charges, reservoir, disk space) on some sled.virtual_provisioning_collectiontable has one row for the fleet, each silo, and each project. Each of these rows tracks the total usage of the provisioning resources that belong to that fleet/silo/project.Most of the queries that touch the
virtual_provisioning_resourcetables return aVec<VirtualProvisioningCollection>. This type includes aByteCount(for memory and disk) that implsTryFrom<i64>. The conversion fails if the byte count is negative. Looking at this table in dogfood reveals that the user's project has a problem:(Note that this failure occurs after the query has executed--it is a failure to convert the values returned from the database into Rust types, not a failure to execute the query itself!)
How did this underflow occur?
virtual_provisioning_collection_delete_instanceis a complex CTE that has three relevant parts:virtual_provisioning_collectiontable entries. This is not caller-provided; it is executed whenever the first subquery returns "true," even if the second subquery decides that the request to delete the provisioning resource is too old and shouldn't be executed.This is a bug in
virtual_provisioning_collection_delete_instance--both the removal of the resource and the update of the collections should be conditioned on the generation number check, not just the resource removal. Unfortunately the provisioning queries are not quite written to allow this (the caller can supply extra filters to subquery #2, but not to subquery #3).To show that this is in fact the problem, we can look at the history of the instances in this project, most of which are short-lived VMs that are used for Hubris builds. Here's a table showing VMMs that belonged to these instances, along with the instance creation times and the VMMs' creation and destruction times:
Before VMM
a07cf1dbis started, there are no active VMMs in this project; after it's started, there's at least one VMM until3b0e9b33exits.Rummaging through the Nexus logs looking for state updates from these VMMs reveals some interesting facts. First, we can see where the underflow actually occurs: the last VMM to be destroyed (
3b0e9b33) fails to update its state with an underflow error, then sends another update that doesn't hit that error:This instance may only be a victim, though: it was the last one to be cleaned up, so it's possible the accounting error occurred with a different VM and was only noticed here. Looking over the Nexus logs for the other VMs shows one that looks a little different from the others:
The "starting to destroyed" transition suggests that something went amiss with starting
3debbe6ethe first time: Propolis doesn't do that, so this is a sign that something may have gone wrong in the instance startup path. And, sure enough, we find that a start attempt for this instance failed and tried to unwind:Not too longer after this we see the instance being registered again as part of a new attempt to start:
Followed by a successful attempt to start it:
The second start attempt's virtual provisioning counters must have been acquired by 17:43:29.666Z (when the attempt to run the instance is logged). The retried attempt to send state from the old VMM lands at 17:43:43.424, which is after this but before the instance has actually started. This send has an obsolete instance generation, but (as described above) that's not enough to prevent the counters from being deleted. And there you have it!