Conversation
… explicit refresh
| "add", | ||
| &format!("vm-{}", inner.propolis_id()), | ||
| &smf_instance_name, | ||
| "refresh", |
There was a problem hiding this comment.
We had talked about putting an additional check here, that the config/server_addr property has actually taken for the instance. Do you think that's overly paranoid?
There was a problem hiding this comment.
I think I'm okay omitting it, based on the data collected (500+ iterations without the server_addr missing), but I'd be happy to add it later if we see any issues?
| ) | ||
| .await?; | ||
|
|
||
| info!(inner.log, "Adding service property group"); |
There was a problem hiding this comment.
Maybe add the service name to these log messages? The actual address might also be nice, in the svccfg setprop call below.
There was a problem hiding this comment.
I'll add more context. These all have the instance ID as part of the context already, so the log messages look like:
[2022-05-27T00:08:00.58389195Z] INFO: SledAgent/InstanceManager/6692 on thelio: Adding service property group (instance_id=cd7cbd04-5507-4c53-a1e0-1f1e6016e7bb)
[2022-05-27T00:08:00.617921961Z] INFO: SledAgent/InstanceManager/6692 on thelio: Setting server address property (instance_id=cd7cbd04-5507-4c53-a1e0-1f1e6016e7bb)
[2022-05-27T00:08:00.649773537Z] INFO: SledAgent/InstanceManager/6692 on thelio: Refreshing instance (instance_id=cd7cbd04-5507-4c53-a1e0-1f1e6016e7bb)
[2022-05-27T00:08:00.675333382Z] INFO: SledAgent/InstanceManager/6692 on thelio: Enabling instance (instance_id=cd7cbd04-5507-4c53-a1e0-1f1e6016e7bb)
This should make it easy to disambiguate if multiple instances are being allocated simultaneously
bnaecker
left a comment
There was a problem hiding this comment.
Thanks for taking this on! Looks good, and sounds like you've not hit any similar races after a while through the instance creation loop. I have a couple of comments, but otherwise 👍 .
|
For reference, this should close #1115. |
Crucible changes: Remove unused fields in IOop (#1149) New downstairs clone subcommand. (#1129) Simplify the do_work_task loop (#1150) Move `Guest` stuff into a module (#1125) Bump nix to 0.27.1 and use new safer Fd APIs (#1110) Move `FramedWrite` work to a separate task (#1145) Use fewer borrows in ExtentInner API (#1147) Update Rust crate reedline to 0.28.0 (#1141) Update Rust crate tokio to 1.36 (#1143) Update Rust crate slog-bunyan to 2.5.0 (#1139) Update Rust crate rayon to 1.8.1 (#1138) Update Rust crate itertools to 0.12.1 (#1137) Update Rust crate byte-unit to 5.1.4 (#1136) Update Rust crate base64 to 0.21.7 (#1135) Update Rust crate async-trait to 0.1.77 (#1134) Discard deferred msgs (#1131) Minor Downstairs cleanup (#1127) Update test_fail_live_repair to support pstop (#1128) Ignore client messages after stopping the IO task (#1126) Move client IO task into a struct (#1124) Bump Rust to 1.75 and fix new Clippy lints (#1123) Propolis changes: PHD: convert to async (#633) PHD: assume specialized Windows images (#636) propolis-standalone-config needn't be a crate standalone: Use tar for snapshot/restore phd: use latest "lab-2.0-opte" target, not a specific version (#637) PHD: add tests for migration of running processes (#623) PHD: fix `cargo xtask phd` tidy not doing anything (#630) PHD: add documentation for `cargo xtask phd` (#629) standalone: improve virtual device creation errors (#632) phd: add Windows Server 2019 guest adapter (#627) PHD: add `cargo xtask phd` to make using PHD nicer (#619)
Crucible changes: Remove unused fields in IOop (#1149) New downstairs clone subcommand. (#1129) Simplify the do_work_task loop (#1150) Move `Guest` stuff into a module (#1125) Bump nix to 0.27.1 and use new safer Fd APIs (#1110) Move `FramedWrite` work to a separate task (#1145) Use fewer borrows in ExtentInner API (#1147) Update Rust crate reedline to 0.28.0 (#1141) Update Rust crate tokio to 1.36 (#1143) Update Rust crate slog-bunyan to 2.5.0 (#1139) Update Rust crate rayon to 1.8.1 (#1138) Update Rust crate itertools to 0.12.1 (#1137) Update Rust crate byte-unit to 5.1.4 (#1136) Update Rust crate base64 to 0.21.7 (#1135) Update Rust crate async-trait to 0.1.77 (#1134) Discard deferred msgs (#1131) Minor Downstairs cleanup (#1127) Update test_fail_live_repair to support pstop (#1128) Ignore client messages after stopping the IO task (#1126) Move client IO task into a struct (#1124) Bump Rust to 1.75 and fix new Clippy lints (#1123) Propolis changes: PHD: convert to async (#633) PHD: assume specialized Windows images (#636) propolis-standalone-config needn't be a crate standalone: Use tar for snapshot/restore phd: use latest "lab-2.0-opte" target, not a specific version (#637) PHD: add tests for migration of running processes (#623) PHD: fix `cargo xtask phd` tidy not doing anything (#630) PHD: add documentation for `cargo xtask phd` (#629) standalone: improve virtual device creation errors (#632) phd: add Windows Server 2019 guest adapter (#627) PHD: add `cargo xtask phd` to make using PHD nicer (#619) Co-authored-by: Alan Hanson <alan@oxide.computer>
Fixes #1115 , hopefully.
Previously
I used the following script:
After < 10 iterations, I'd typically see an SMF failure, either due to a missing property, or due to some issue on import.
After discussion with @bnaecker and @rmustacc , we confirmed that the Propolis Server zone automatically imports the manifest, and that it may be racing with an explicit import.
This PR
This PR performs the following actions:
svccfg: Pattern 'svc:/system/illumos/propolis- server' doesn't match any instances or services), but succeeds once the built-inconfigdservice has finished importing the propolis manifest.The aforementioned "create an instance, stop it, destroy it" script has been running on my machine for nearly 100 iterations without a single failure using this PR, which is a major improvement.