Report physical disks through the inventory system#5145
Conversation
1ffd4d1 to
b962606
Compare
jgallagher
left a comment
There was a problem hiding this comment.
LGTM; just two tiny nits picking at memory usage that probably doesn't matter - feel free to ignore either or both.
| @@ -0,0 +1,13 @@ | |||
| CREATE TABLE IF NOT EXISTS omicron.public.inv_physical_disk ( | |||
There was a problem hiding this comment.
Can/should we include the slot number here? (if so, can/should we use that for pagination/uniqueness?)
There was a problem hiding this comment.
Sure, we'll probably want that out of inventory anyway. Also, I agree, this seems like a better PK ("sled id" + "slot") than the reported values of serial/model/vendor.
Updated to plumb through the slot from the sled agent, and using as pagination key in this PR.
| .context("reservoir_size")?, | ||
| disks: self | ||
| .storage | ||
| .lock() |
There was a problem hiding this comment.
Are there any situations where this lock might be held for a long time (thinking tens of seconds or minutes)?
There was a problem hiding this comment.
I don't believe so, especially not since this is the simulated sled agent. At a cursory glance, i think it's mostly just getting locked to populate / grab contents from a group of hashmaps.
| use db::schema::inv_physical_disk::dsl; | ||
|
|
||
| let mut iter = | ||
| disks.chunks(SQL_BATCH_SIZE.get().try_into().unwrap()); |
There was a problem hiding this comment.
In general I like the idea of inserting in batches. But this is all going into one transaction so I'm not sure it's any better in that case.
There was a problem hiding this comment.
The batch size is still fairly large 1,000, so it's unlikely we'll be looping much regardless on a single rack or two. But I could see this blowing past the size of a "single SQL statement" if we consider it for larger multi-rack cases?
|
Thank you @jgallagher and @davepacheco for the feedback! Apologies for the delay on responding, I've been building off this PR in #5172 and trying to make sure it'll work -- seems like yes. I believe I've responded to all feedback -- I know y'all have me the +1 to merge, but I'll wait another ~half day if there are any other comments. |
|
|
| InvPhysicalDisk::new(collection_id, *sled_id, disk.clone()) | ||
| }) | ||
| }) | ||
| .collect(); |
There was a problem hiding this comment.
Tiny nit: I think you can drop this collect() (and the corresponding : Vec<_> type hint), make this mut, and then remove the disks.into_iter() call on line 659.
There was a problem hiding this comment.
This causes the following compilation error:
error: implementation of `FnOnce` is not general enough
--> nexus/src/app/background/inventory_collection.rs:58:9
|
58 | / async {
59 | | match inventory_activate(
60 | | opctx,
61 | | &self.datastore,
... |
88 | | }
89 | | .boxed()
| |________________^ implementation of `FnOnce` is not general enough
|
= note: closure with signature `fn(&'0 nexus_types::inventory::PhysicalDisk) -> InvPhysicalDisk` must implement `FnOnce<(&'1 nexus_types::inventory::PhysicalDisk,)>`, for any two lifetimes `'0` and `'1`...
= note: ...but it actually implements `FnOnce<(&nexus_types::inventory::PhysicalDisk,)>`
I believe the usage of disks.into_iter() is called within a transaction, which is where the FnOnce comes from?
I think I'm going to keep this pattern of "collect it into Vec, pass that Vec to the FnOnce" for the moment, because the lifetime complexity doesn't seem worth trying to optimize right now
There was a problem hiding this comment.
Grrrr, this is very annoying. I checked that diff against nexus-db-queries and it was fine, but didn't check whether it cause compilation errors down the line. That feels like a bug - if my function signature hasn't changed, why can implementation changes cause client breakage?
There was a problem hiding this comment.
Thanks to @cbiffle who found we're hitting rust-lang/rust#99492 here; this ought to work. I found a couple weird looking workarounds using hints from that issue:
@@ -127,15 +127,10 @@ impl DataStore {
})
.collect::<Result<Vec<_>, Error>>()?;
// Pull disks out of all sled agents
- let disks: Vec<_> = collection
- .sled_agents
- .iter()
- .flat_map(|(sled_id, sled_agent)| {
- sled_agent.disks.iter().map(|disk| {
- InvPhysicalDisk::new(collection_id, *sled_id, disk.clone())
- })
- })
- .collect();
+ let mut disks =
+ collection.sled_agents.iter().flat_map(|(sled_id, sled_agent)| {
+ std::iter::repeat(sled_id).zip(sled_agent.disks.iter())
+ });
// Partition the sled agents into those with an associated baseboard id
// and those without one. We handle these pretty differently.
@@ -656,10 +651,18 @@ impl DataStore {
use db::schema::inv_physical_disk::dsl;
let batch_size = SQL_BATCH_SIZE.get().try_into().unwrap();
- let mut disks = disks.into_iter();
loop {
- let some_disks =
- disks.by_ref().take(batch_size).collect::<Vec<_>>();
+ let some_disks = disks
+ .by_ref()
+ .take(batch_size)
+ .map(|(&sled_id, disk)| {
+ InvPhysicalDisk::new(
+ collection_id,
+ sled_id,
+ disk.clone(),
+ )
+ })
+ .collect::<Vec<_>>();
if some_disks.is_empty() {
break;
}or if we don't like collection_id floating around, it can go in the repeat too:
@@ -127,15 +127,11 @@ impl DataStore {
})
.collect::<Result<Vec<_>, Error>>()?;
// Pull disks out of all sled agents
- let disks: Vec<_> = collection
- .sled_agents
- .iter()
- .flat_map(|(sled_id, sled_agent)| {
- sled_agent.disks.iter().map(|disk| {
- InvPhysicalDisk::new(collection_id, *sled_id, disk.clone())
- })
- })
- .collect();
+ let mut disks =
+ collection.sled_agents.iter().flat_map(|(sled_id, sled_agent)| {
+ std::iter::repeat((collection_id, *sled_id))
+ .zip(sled_agent.disks.iter())
+ });
// Partition the sled agents into those with an associated baseboard id
// and those without one. We handle these pretty differently.
@@ -656,10 +652,18 @@ impl DataStore {
use db::schema::inv_physical_disk::dsl;
let batch_size = SQL_BATCH_SIZE.get().try_into().unwrap();
- let mut disks = disks.into_iter();
loop {
- let some_disks =
- disks.by_ref().take(batch_size).collect::<Vec<_>>();
+ let some_disks = disks
+ .by_ref()
+ .take(batch_size)
+ .map(|((collection_id, sled_id), disk)| {
+ InvPhysicalDisk::new(
+ collection_id,
+ sled_id,
+ disk.clone(),
+ )
+ })
+ .collect::<Vec<_>>();
if some_disks.is_empty() {
break;
}I don't know that either of these are worth saving the intermediate .collect(), but at least we (kind of) understand it.
There was a problem hiding this comment.
Yeah I ran into this yesterday. It's a pretty bad bug in rustc.
| @@ -0,0 +1,828 @@ | |||
| { | |||
There was a problem hiding this comment.
Is this file used? I recently ran into something where I considered updating the RSS on-disk plan format, and found a way to not do that to avoid the hassle.
There was a problem hiding this comment.
It is used, and is stored, but it's not used particularly frequently.
- We always try to "load a plan if one already exists" before creating a new one:
omicron/sled-agent/src/rack_setup/service.rs
Lines 959 to 971 in 145f25c
- This tries to load the plan with the latest version: , and in other cases, at least it tries to report a better error
omicron/sled-agent/src/rack_setup/plan/service.rs
Lines 137 to 172 in 145f25c
In the limit: I think we may get more mileage out of "wiping everything and forcing RSS to restart" in the case where we restart, which would make these intermediate files removable.
There was a problem hiding this comment.
But those bits of code are still referencing the "rss-service-plan-v2.json" path, right? Should we bump that to v3? (I don't see anywhere where the schema json is checked, either? Sorry if I'm just missing it.)
There was a problem hiding this comment.
Ugh, thank you, good catch, I'm using this in #5172 , but not here yet. I'll pull this into the appropriate PR.
The purpose of that file is to exercise the CLI with data saved from a real system. That seemed useful because part of the goal was to be able to import data from real systems. But it could be more trouble than it's worth while we're still evolving the blueprint format. Also, it doesn't have to be any particular system. The test will dynamically pick an inventory collection and blueprints that it finds. |
I logged onto a machine set up by a4x2, configured my ssh keys to be able to copy files into and out of that machine, used zlogin to access, the switch zone, and ran I can spin up this environment and take another lap - to clarify, do I need to do |
|
Okay, I ran:
|
inv_physical_disktableFixes #5150