Fix TOCTOU when adding network interfaces to an instance#1222
Conversation
- Fixes a TOCTOU bug, where we checked if an instance is stopped and then issue a separate query to actually create/attach a network interface to it. Done by adding an additional subquery which verifies the instance state in an additional CTE, at both create and delete time. - Adds additional tests for checking the new instance verification - Improves instance integration test coverage by using the endpoint to list interfaces both when the expected list is empty and non-empty
|
Resolves #1134 |
| // NOTE: We need to lookup the VPC and VPC Subnet, since we need both | ||
| // IDs for creating the network interface. | ||
| // | ||
| // TODO-correctness: There are additional races here. The VPC and VPC |
There was a problem hiding this comment.
If it isn't one, it's another, eh?
(Semi-related to this PR, I really want better testing / analysis to help us suss these issues out. DB-related raciness is an area with a lot of potential for unsafety)
There was a problem hiding this comment.
If it isn't one, it's another, eh?
Indeed :)
I really want better testing / analysis to help us suss these issues out
Yeah, that'd be nice. I'm not sure how to force ourselves to hit this, since the flow would be: fetch the VPC Subnet successfully, delete the VPC Subnet, then try to run this query. How would we do that? Maybe part of a set of stress tests?
There was a problem hiding this comment.
I'm not sure. I can think of a view options:
- Random-walk stress testing: ideally, having a "set of operations" and "set of objects/names" that can all be acted upon, and doing some random-walk style operations would be neat.
- Permutation testing: we could try to integrate something like https://github.com/tokio-rs/loom to invasively test the DB-related ops. There would still be reliance on CRDB, however, so these tests would probably not run as a part of the CI check - they might be delegated to "once daily" jobs.
- Model checking: we could attempt to model the DB interactions externally (e.g., through TLA+) and try to prove safety properties about them. I'm skeptical about the cost/benefit of this approach; it seems like a more expensive option.
There was a problem hiding this comment.
I like the second approach. The first is probably easiest, but I'd be worried about baking in our assumptions about how things should be used. I agree the formal verification approach is likely to be too heavyweight.
- Reduce const visibility - SQL formatting - Refactor tests for specific behavior
Propolis changes: Add `IntrPin::import_state` and migrate LPC UART pin states (#669) Attempt to set WCE for raw file backends Fix clippy/lint nits for rust 1.77.0 Crucible changes: Correctly (and robustly) count bytes (#1237) test-replay.sh fix name of DTrace script (#1235) BlockReq -> BlockOp (#1234) Simplify `BlockReq` (#1218) DTrace, cmon, cleanup, retry downstairs connections at 10 seconds. (#1231) Remove `MAX_ACTIVE_COUNT` flow control system (#1217) Crucible changes that were in Omicron but not in Propolis before this commit. Return *410 Gone* if volume is inactive (#1232) Update Rust crate opentelemetry to 0.22.0 (#1224) Update Rust crate base64 to 0.22.0 (#1222) Update Rust crate async-recursion to 1.1.0 (#1221) Minor cleanups to extent implementations (#1230) Update Rust crate http to 0.2.12 (#1220) Update Rust crate reedline to 0.30.0 (#1227) Update Rust crate rayon to 1.9.0 (#1226) Update Rust crate nix to 0.28 (#1223) Update Rust crate async-trait to 0.1.78 (#1219) Various buffer optimizations (#1211) Add low-level test for message encoding (#1214) Don't let df failures ruin the buildomat tests (#1213) Activate the NBD server's psuedo file (#1209) --------- Co-authored-by: Alan Hanson <alan@oxide.computer>
then issue a separate query to actually create/attach a network
interface to it. Done by adding an additional subquery which verifies
the instance state in an additional CTE, at both create and delete
time.
list interfaces both when the expected list is empty and non-empty