Skip to content

core/device: fix state transition on switching root with persistent database#23437

Closed
yuwata wants to merge 4 commits intosystemd:mainfrom
yuwata:core-device-switching-root-persistent-database
Closed

core/device: fix state transition on switching root with persistent database#23437
yuwata wants to merge 4 commits intosystemd:mainfrom
yuwata:core-device-switching-root-persistent-database

Conversation

@yuwata
Copy link
Copy Markdown
Member

@yuwata yuwata commented May 19, 2022

Fixes #23429.

@yuwata yuwata added the pid1 label May 19, 2022
@yuwata yuwata added this to the v251 milestone May 19, 2022
@yuwata
Copy link
Copy Markdown
Member Author

yuwata commented May 19, 2022

cc @mwilck and @fbuihuu. Also, @bluca and @keszybz. PTAL.

* device_catchup() or uevents. */

if (MANAGER_IS_SWITCHING_ROOT(m)) {
if (MANAGER_IS_SWITCHING_ROOT(m) && !FLAGS_SET(d->enumerated_found, DEVICE_FOUND_UDEV)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks similar to the condition I use in #23436, which fixes #23429 too, so I guess this should do the trick.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are able to reproduce the issue, can you confirm whether this fixes it for you?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep I'll give it a test.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing these "dead->plugged" transitions anymore right after switching root for devices with the "db_persist" option set.

Copy link
Copy Markdown
Contributor

@mwilck mwilck May 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is wrong. We must reset the state to DEVICE_TENTATIVE to prevent premature device activation in the switching root case. The argument is the same as in #23208. Even with db_persist, the udev rules may have changed while switching root, and a device which was "plugged" in the initramfs may not be any more in the mounted root (SYSTEMD_READY="0" may be set on it). IMO DEVICE_READY should never be set after switching root until an actual "add" uevent for the device has been processed.

After looking at the dm-crypt example, I have to say this statement of mine was wrong. dm-crypt devices apparently need to be in "plugged" state early on. We can't wait for udev coldplug to establish this state, because cryptsetup@,service may need to be activated before udev after switching root, and that wouldn't happen if the device was only in tentative state.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well if db_persist is used it's clearly an indication that the device must remain plugged, see the doc. If SYSTEMD_READY=0 is used in the host then it seems actually expected that the device goes through the plugged->dead transition when the new set of rules from the host is processed by udev.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's one possible intepretation of the text in udev(7). But you're just inventing it :-)

db_persist dates back to 2011, before the systemd/udev merge. The man page paragraph was added much later. I don't think the relationship of db_persist and systemd's state attribute has ever been clearly defined.

db_persist was created in the past for LVM, because LVM uses "change" events for device activation rather than "add" events, and needs special treatment for coldplug "add" events for devices that are already active (see 10-dm.rules). LVM doesn't care about systemd's state, so as far as LVM is concerned, it doesn't matter whether we translate db_persist to DEVICE_READY or DEVICE_TENTATIVE.

I still think DEVICE_TENTATIVE is safer. It doesn't do harm, because the actual state will be resolved sooner or later either way, and it avoids premature activation.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is wrong. We must reset the state to DEVICE_TENTATIVE to prevent premature device activation in the switching root case. The argument is the same as in #23208. Even with db_persist, the udev rules may have changed while switching root, and a device which was "plugged" in the initramfs may not be any more in the mounted root (SYSTEMD_READY="0" may be set on it). IMO DEVICE_READY should never be set after switching root until an actual "add" uevent for the device has been processed.

@mwilck The host PID1 has used and honored the persistent database generated in initrd at least more than two years. Why do you want to change the behavior? Of course, we have not documented the behavior, but IIRC, there is no issue related to that.
As reported in #23429, after PR #23218, the device with persistent database has unnecessary state transition. It may be harmless in most cases, but such state transition did not occur previously. If we can avoid that, why not to do that?

Well if db_persist is used it's clearly an indication that the device must remain plugged, see the doc. If SYSTEMD_READY=0 is used in the host then it seems actually expected that the device goes through the plugged->dead transition when the new set of rules from the host is processed by udev.

@fbuihuu I agree with you. Even if the behavior is not documented, we have done so for a while. There is no reason to change it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mwilck The host PID1 has used and honored the persistent database generated in initrd at least more than two years. Why do you want to change the behavior?

@yuwata, you're right that as a matter of fact, systemd kept DEVICE_READY state over switching root. It did this for all devices though, which was wrong, and was fixed in #23218. #23218 did indeed "change the behavior". I hope we agree it did so for the better.

The open question is how to deal with db_persist. We need to realize that we have two conflicting semantic rules:

  1. udev rules are allowed to be different between initrd and root fs, and thus need to be re-parsed and re-executed after switching root before switching device states and forwarding them to dependent units.
  2. db_persist devices keep state over switching root.

@fbuihuu and you think that rule 2) defines an exception to rule 1). I disagree. 1) is a general principle. Nothing prevents udev rules from changing between initrd and root FS. Thus if we run device_set_state(DEVICE_READY) on these devices, we'd open up new ways for the race described in #23208 to occur, which may cause boot failure.

db_persist is only used by dm and md. After switching root, the udev rules for these subsystems need to distinguish between "activation" events and other events, which requires certain udev properties to survive the cleanup (see e.g. this dracut commit). This distinction is made in udev rules and relies 100% on properties in the udev db. It has nothing to do with systemd's device activation. Thus for the current consumers of db_persist, it doesn't matter if we set the state to "tentative" rather than "plugged" here.

It does matter for the further processing of dependent devices, though. The processing of dm and md devices during boot relies heavily on (udev) coldplug. It's not in the interest of these subsystems to have dependent units activated before the (udev) coldplug events have been processed. Strictly speaking, after switching root, we should never call device_set_state(DEVICE_READY), except from device_dispatch_io().

However, I realize that this hunk is not the point. With this hunk, we'll set DEVICE_READY in device_coldplug() for db_persist devices, whereas without it, we set it in device_catchup(). If my argument above is correct, both calls would be premature. @fbuihuu, with this observation I don't understand how this patch fixes your issue. The extra "dead→plugged" transition should still observed, just a little bit earlier.

* device_catchup() or uevents. */

if (MANAGER_IS_SWITCHING_ROOT(m)) {
if (MANAGER_IS_SWITCHING_ROOT(m) && !FLAGS_SET(d->enumerated_found, DEVICE_FOUND_UDEV)) {
Copy link
Copy Markdown
Contributor

@mwilck mwilck May 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is wrong. We must reset the state to DEVICE_TENTATIVE to prevent premature device activation in the switching root case. The argument is the same as in #23208. Even with db_persist, the udev rules may have changed while switching root, and a device which was "plugged" in the initramfs may not be any more in the mounted root (SYSTEMD_READY="0" may be set on it). IMO DEVICE_READY should never be set after switching root until an actual "add" uevent for the device has been processed.

After looking at the dm-crypt example, I have to say this statement of mine was wrong. dm-crypt devices apparently need to be in "plugged" state early on. We can't wait for udev coldplug to establish this state, because cryptsetup@,service may need to be activated before udev after switching root, and that wouldn't happen if the device was only in tentative state.

@yuwata yuwata force-pushed the core-device-switching-root-persistent-database branch from 588ca27 to 1ec046e Compare May 19, 2022 19:29
@mwilck
Copy link
Copy Markdown
Contributor

mwilck commented May 20, 2022

Whatever we eventually do here, can we please separate the MANAGER_IS_SWITCHING_ROOT change from the bug fix?

@mwilck
Copy link
Copy Markdown
Contributor

mwilck commented May 20, 2022

TL;DR: IMO this is no regression. The dead→plugged debug message has just been suppressed before, and isn't anymore after #23218.

Long story

@fbuihuu, this patch simply moves the spurious "dead→plugged" transition from device_catchup() to device_coldplug(). I think your test observation is a deception. My suspicion is that you were tricked (like myself this morning) by the fact that debug messages are likely to be suppressed in coldplug state after switching root. IOW, you saw the "dead→plugged" message in catchup(), but you didn't see it in coldplug().

During the coldplug stage, journald isn't running. That means that log messages are sent to /dev/kmsg (to be copied back into the journal by journald later). Rate limiting is applied for /dev/kmsg both on the kernel level and on the systemd level. I can override the kernel rate limiting with printk.devkmsg=on log_buf_size=4M, but for systemd, I had to patch write_to_kmsg(). After doing that, I'd see the messages from device_coldplug() that I didn't see before. The same doesn't hold for device_catchup(), which is called significantly later, after various services including journald:

[    6.868156] hellboy systemd[1]: Switching root.
[    7.297649] hellboy systemd[1]: Invoking unit coldplug() handlers…
[   12.113406] hellboy systemd[1]: Started Dispatch Password Requests to Console Directory Watch.
[   12.116364] hellboy systemd[1]: Reached target Local Encrypted Volumes.
[   12.120129] hellboy systemd[1]: Reached target Remote File Systems.
[   12.121027] hellboy systemd[1]: Reached target Slice Units.
[   12.121876] hellboy systemd[1]: Reached target Local Verity Protected Volumes.
[   12.140989] hellboy systemd[1]: Starting Create List of Static Device Nodes...
[   12.142845] hellboy systemd[1]: Starting Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling...
[   12.146504] hellboy systemd[1]: Starting Load Kernel Module configfs...
[   12.148307] hellboy systemd[1]: Starting Load Kernel Module drm...
[   12.150010] hellboy systemd[1]: Starting Load Kernel Module fuse...
[   12.155541] hellboy systemd[1]: Starting Journal Service...
[   12.157488] hellboy systemd[1]: Starting Load Kernel Modules...
[   12.159231] hellboy systemd[1]: Starting Remount Root and Kernel File Systems...
[   12.161172] hellboy systemd[1]: Starting Coldplug All udev Devices...
[   22.242575] hellboy systemd[1]: Reached target Swaps.
[   22.143867] hellboy systemd[1]: Invoking unit catchup() handlers…
[   22.301878] hellboy systemd[1]: Started Journal Service.

Now I see the following with the code from #23218:

[    4.551797] hellboy systemd[1]: sys-devices-virtual-block-dm\x2d0.device: Changed dead -> plugged
...
[    6.868156] hellboy systemd[1]: Switching root.
[    7.257259] hellboy systemd[1]: Deserializing state...
[    7.297649] hellboy systemd[1]: Invoking unit coldplug() handlers…
[   22.143867] hellboy systemd[1]: Invoking unit catchup() handlers…
[   22.175241] hellboy systemd[1]: sys-devices-virtual-block-dm\x2d0.device: Changed dead -> plugged

With a5dedd6 applied, it becomes:

[    4.558180] hellboy systemd[1]: sys-devices-virtual-block-dm\x2d0.device: Changed dead -> plugged
...
[    6.851002] hellboy systemd[1]: Switching root.
[    7.269579] hellboy systemd[1]: Deserializing state...
[    7.309934] hellboy systemd[1]: Invoking unit coldplug() handlers…
[    7.310348] hellboy systemd[1]: sys-devices-virtual-block-dm\x2d0.device: Changed dead -> plugged
[   18.828799] hellboy systemd[1]: Invoking unit catchup() handlers…

You can see that the dead->plugged is printed in device_catchup() with the patch, and in device_coldplug() with the patch.

@yuwata
Copy link
Copy Markdown
Member Author

yuwata commented May 20, 2022

@mwilck Thank you for the investigation. Still I think this PR is OK, but let's postpone to v252.

@yuwata yuwata removed this from the v251 milestone May 20, 2022
@yuwata
Copy link
Copy Markdown
Member Author

yuwata commented May 20, 2022

@mwilck
#23218 (comment)

@yuwata. recall this discussion we had previously? If we removed the lines

+                if (found == DEVICE_NOT_FOUND)
+                        state = DEVICE_DEAD; /* If nobody sees the device, downgrade more */

from 75d7b59, we'd still see transition messages but only "tentative→plugged", not the confusing "dead→plugged". I don't think this can be completely avoided. We apply the result from enumeration after the result from deserialization on purpose.

#23437 (comment)

I still think DEVICE_TENTATIVE is safer. It doesn't do harm, because the actual state will be resolved sooner or later either way, and it avoids premature activation.

I hope the last commit in this PR addresses that. WDYT?

@mwilck
Copy link
Copy Markdown
Contributor

mwilck commented May 20, 2022

I hope the last commit in this PR addresses that. WDYT?

I guess so, but I haven't taken a closer look. I've tested these patches one-by-one.
One thing that I don't like so much about this PR is that it mixes enumerated_found into the device_coldplug() logic. Previouly, device_coldplug() would handle deserialized_found and either device_enumerate or device_catchup() would handle enumerated_found. Now, enumerated_found is part of the logic of both coldplug and catchup.

I've been wondering if it'd be possible (in the switching-root case at least) to postpone the call to device_set_state() , and perhaps the entire state calculation logic, to device_catchup(). Why do we need to call device_set_state() twice?

@fbuihuu
Copy link
Copy Markdown
Contributor

fbuihuu commented May 23, 2022

You can see that the dead->plugged is printed in device_catchup() with the patch, and in device_coldplug() with the patch.

I see, thanks for the investigation, I've been fooled by the rate limiting in write_to_kmsg() :(.

Then issue #23429 is not a regression.

@fbuihuu
Copy link
Copy Markdown
Contributor

fbuihuu commented May 23, 2022

I've been wondering if it'd be possible (in the switching-root case at least) to postpone the call to device_set_state() , and perhaps the entire state calculation logic, to device_catchup(). Why do we need to call device_set_state() twice?

This is basically what the last commit of #23436 is all about.

@mwilck
Copy link
Copy Markdown
Contributor

mwilck commented May 24, 2022

Then issue #23429 is not a regression.

Well unfortunately it seems to be a regression for systems using dm-crypt for the root FS. Another example for the general theorem that whatever we do in this area, we overlook something 😭

I also have to revoke my statement above that "tentative" is always better than "plugged". I still think that we have a conflict between two principles here, as stated above, but at least the way things are set up today, we have to carry over "plugged" state from the initrd for db_persist devices. I can think about other approaches for dm-crypt, but let's not overcomplicate matters at this stage.

fbuihuu and others added 4 commits May 25, 2022 13:34
On switching root, a device may have a persistent databse. In that case,
Device.enumerated_found may have DEVICE_FOUND_UDEV flag, and it is not
necessary to downgrade the Device.deserialized_found and
Device.deserialized_state. Otherwise, the state of the device unit may
be changed plugged -> dead -> plugged, if the device has not been mounted.

Fixes systemd#23429.
…as a swap

Otherwise, the device state will changed as plugged -> dead -> tentative,
and the corresponding .mount or .swap units will be stopped.

Fixes systemd#23429 (comment).
@yuwata yuwata force-pushed the core-device-switching-root-persistent-database branch from 1ec046e to fb4a249 Compare May 25, 2022 04:41
yuwata added a commit to mwilck/systemd that referenced this pull request May 25, 2022
On switching root, a device may have a persistent databse. In that case,
Device.enumerated_found may have DEVICE_FOUND_UDEV flag, and it is not
necessary to downgrade the Device.deserialized_found and
Device.deserialized_state. Otherwise, the state of the device unit may
be changed plugged -> dead -> plugged, if the device has not been mounted.

Fixes systemd#23429.

[mwilck: cherry-picked from systemd#23437]
mrc0mmand pushed a commit to mrc0mmand/systemd that referenced this pull request May 26, 2022
On switching root, a device may have a persistent databse. In that case,
Device.enumerated_found may have DEVICE_FOUND_UDEV flag, and it is not
necessary to downgrade the Device.deserialized_found and
Device.deserialized_state. Otherwise, the state of the device unit may
be changed plugged -> dead -> plugged, if the device has not been mounted.

Fixes systemd#23429.

[mwilck: cherry-picked from systemd#23437]
mrc0mmand pushed a commit to mrc0mmand/systemd that referenced this pull request May 26, 2022
On switching root, a device may have a persistent databse. In that case,
Device.enumerated_found may have DEVICE_FOUND_UDEV flag, and it is not
necessary to downgrade the Device.deserialized_found and
Device.deserialized_state. Otherwise, the state of the device unit may
be changed plugged -> dead -> plugged, if the device has not been mounted.

Fixes systemd#23429.

[mwilck: cherry-picked from systemd#23437]
mrc0mmand pushed a commit to mrc0mmand/systemd that referenced this pull request May 26, 2022
On switching root, a device may have a persistent databse. In that case,
Device.enumerated_found may have DEVICE_FOUND_UDEV flag, and it is not
necessary to downgrade the Device.deserialized_found and
Device.deserialized_state. Otherwise, the state of the device unit may
be changed plugged -> dead -> plugged, if the device has not been mounted.

Fixes systemd#23429.

[mwilck: cherry-picked from systemd#23437]
@yuwata
Copy link
Copy Markdown
Member Author

yuwata commented May 26, 2022

Superseded by #23517. I will pick the cleanup commits by @fbuihuu in another PR. Closing.

@yuwata yuwata closed this May 26, 2022
@yuwata yuwata deleted the core-device-switching-root-persistent-database branch May 26, 2022 18:27
tewarid pushed a commit to tewarid/systemd that referenced this pull request Aug 23, 2022
On switching root, a device may have a persistent databse. In that case,
Device.enumerated_found may have DEVICE_FOUND_UDEV flag, and it is not
necessary to downgrade the Device.deserialized_found and
Device.deserialized_state. Otherwise, the state of the device unit may
be changed plugged -> dead -> plugged, if the device has not been mounted.

Fixes systemd#23429.

[mwilck: cherry-picked from systemd#23437]

(cherry picked from commit 4fc69e8)
bluca pushed a commit to bluca/systemd that referenced this pull request Jan 27, 2023
On switching root, a device may have a persistent databse. In that case,
Device.enumerated_found may have DEVICE_FOUND_UDEV flag, and it is not
necessary to downgrade the Device.deserialized_found and
Device.deserialized_state. Otherwise, the state of the device unit may
be changed plugged -> dead -> plugged, if the device has not been mounted.

Fixes systemd#23429.

[mwilck: cherry-picked from systemd#23437]

(cherry picked from commit 4fc69e8)
(cherry picked from commit 131206d)
Werkov pushed a commit to Werkov/systemd that referenced this pull request Nov 1, 2023
On switching root, a device may have a persistent databse. In that case,
Device.enumerated_found may have DEVICE_FOUND_UDEV flag, and it is not
necessary to downgrade the Device.deserialized_found and
Device.deserialized_state. Otherwise, the state of the device unit may
be changed plugged -> dead -> plugged, if the device has not been mounted.

Fixes systemd#23429.

[mwilck: cherry-picked from systemd#23437]

(cherry picked from commit 4fc69e8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

PR#23218 introduced a regression for devices with db_persist option

4 participants