After #8912, when Nexus starts up and determines that it's part of the set of Nexus instances currently in control, it will perform any needed database schema update. In the normal self-service update case, this is necessary and correct. However, if we had a normally running system after 8912 and someone were to MUPdate a sled that has a Nexus on it to a new version of Nexus with a new database schema, it would find that it's still in the active set and do a schema update that would pull the rug out from the other Nexus instances (violating our runtime constraint that all Nexus instances only ever use the database running the schema version that they know about).
The assumption is that this would probably have been an operator / support mistake. It's always going to be dangerous to MUPdate a sled that's part of the control plane while the control plane is running because of problems like this (and analogous problems with Crucible or Oximeter, which have their own on-disk formats and schema versions), and you could be invalidating inter-API version dependencies, etc. But this Nexus case is particularly bad since in principle it could lead to permanent control plane database corruption. If it's easy to prevent, that seems worthwhile. On today's watercooler call we discussed preventing this by having the db_metadata_nexus records contain the Nexus image id, and having Nexus know its own image id, and then having it stop on startup if it finds its image doesn't match the one in its record. (We'd presumably also want it to be able to expose this status somewhere so that an operator or support can see what's happened.)
After #8912, when Nexus starts up and determines that it's part of the set of Nexus instances currently in control, it will perform any needed database schema update. In the normal self-service update case, this is necessary and correct. However, if we had a normally running system after 8912 and someone were to MUPdate a sled that has a Nexus on it to a new version of Nexus with a new database schema, it would find that it's still in the active set and do a schema update that would pull the rug out from the other Nexus instances (violating our runtime constraint that all Nexus instances only ever use the database running the schema version that they know about).
The assumption is that this would probably have been an operator / support mistake. It's always going to be dangerous to MUPdate a sled that's part of the control plane while the control plane is running because of problems like this (and analogous problems with Crucible or Oximeter, which have their own on-disk formats and schema versions), and you could be invalidating inter-API version dependencies, etc. But this Nexus case is particularly bad since in principle it could lead to permanent control plane database corruption. If it's easy to prevent, that seems worthwhile. On today's watercooler call we discussed preventing this by having the
db_metadata_nexusrecords contain the Nexus image id, and having Nexus know its own image id, and then having it stop on startup if it finds its image doesn't match the one in its record. (We'd presumably also want it to be able to expose this status somewhere so that an operator or support can see what's happened.)