core/device: fix state transition on switching root with persistent database by yuwata · Pull Request #23437 · systemd/systemd

yuwata · 2022-05-19T12:29:43Z

yuwata · 2022-05-19T12:30:21Z

cc @mwilck and @fbuihuu. Also, @bluca and @keszybz. PTAL.

fbuihuu · 2022-05-19T13:00:56Z

src/core/device.c

         *   device_catchup() or uevents. */

-        if (MANAGER_IS_SWITCHING_ROOT(m)) {
+        if (MANAGER_IS_SWITCHING_ROOT(m) && !FLAGS_SET(d->enumerated_found, DEVICE_FOUND_UDEV)) {


It looks similar to the condition I use in #23436, which fixes #23429 too, so I guess this should do the trick.

if you are able to reproduce the issue, can you confirm whether this fixes it for you?

yep I'll give it a test.

I'm not seeing these "dead->plugged" transitions anymore right after switching root for devices with the "db_persist" option set.

I think this is wrong. We must reset the state to DEVICE_TENTATIVE to prevent premature device activation in the switching root case. The argument is the same as in #23208. Even with db_persist, the udev rules may have changed while switching root, and a device which was "plugged" in the initramfs may not be any more in the mounted root (SYSTEMD_READY="0" may be set on it). IMO DEVICE_READY should never be set after switching root until an actual "add" uevent for the device has been processed.

After looking at the dm-crypt example, I have to say this statement of mine was wrong. dm-crypt devices apparently need to be in "plugged" state early on. We can't wait for udev coldplug to establish this state, because cryptsetup@,service may need to be activated before udev after switching root, and that wouldn't happen if the device was only in tentative state.

Well if db_persist is used it's clearly an indication that the device must remain plugged, see the doc. If SYSTEMD_READY=0 is used in the host then it seems actually expected that the device goes through the plugged->dead transition when the new set of rules from the host is processed by udev.

That's one possible intepretation of the text in udev(7). But you're just inventing it :-)

db_persist dates back to 2011, before the systemd/udev merge. The man page paragraph was added much later. I don't think the relationship of db_persist and systemd's state attribute has ever been clearly defined.

db_persist was created in the past for LVM, because LVM uses "change" events for device activation rather than "add" events, and needs special treatment for coldplug "add" events for devices that are already active (see 10-dm.rules). LVM doesn't care about systemd's state, so as far as LVM is concerned, it doesn't matter whether we translate db_persist to DEVICE_READY or DEVICE_TENTATIVE.

I still think DEVICE_TENTATIVE is safer. It doesn't do harm, because the actual state will be resolved sooner or later either way, and it avoids premature activation.

I think this is wrong. We must reset the state to DEVICE_TENTATIVE to prevent premature device activation in the switching root case. The argument is the same as in #23208. Even with db_persist, the udev rules may have changed while switching root, and a device which was "plugged" in the initramfs may not be any more in the mounted root (SYSTEMD_READY="0" may be set on it). IMO DEVICE_READY should never be set after switching root until an actual "add" uevent for the device has been processed.

@mwilck The host PID1 has used and honored the persistent database generated in initrd at least more than two years. Why do you want to change the behavior? Of course, we have not documented the behavior, but IIRC, there is no issue related to that.
As reported in #23429, after PR #23218, the device with persistent database has unnecessary state transition. It may be harmless in most cases, but such state transition did not occur previously. If we can avoid that, why not to do that?

Well if db_persist is used it's clearly an indication that the device must remain plugged, see the doc. If SYSTEMD_READY=0 is used in the host then it seems actually expected that the device goes through the plugged->dead transition when the new set of rules from the host is processed by udev.

@fbuihuu I agree with you. Even if the behavior is not documented, we have done so for a while. There is no reason to change it.

@mwilck The host PID1 has used and honored the persistent database generated in initrd at least more than two years. Why do you want to change the behavior?

@yuwata, you're right that as a matter of fact, systemd kept DEVICE_READY state over switching root. It did this for all devices though, which was wrong, and was fixed in #23218. #23218 did indeed "change the behavior". I hope we agree it did so for the better.

The open question is how to deal with db_persist. We need to realize that we have two conflicting semantic rules:

udev rules are allowed to be different between initrd and root fs, and thus need to be re-parsed and re-executed after switching root before switching device states and forwarding them to dependent units.

db_persist devices keep state over switching root.

@fbuihuu and you think that rule 2) defines an exception to rule 1). I disagree. 1) is a general principle. Nothing prevents udev rules from changing between initrd and root FS. Thus if we run device_set_state(DEVICE_READY) on these devices, we'd open up new ways for the race described in #23208 to occur, which may cause boot failure.

db_persist is only used by dm and md. After switching root, the udev rules for these subsystems need to distinguish between "activation" events and other events, which requires certain udev properties to survive the cleanup (see e.g. this dracut commit). This distinction is made in udev rules and relies 100% on properties in the udev db. It has nothing to do with systemd's device activation. Thus for the current consumers of db_persist, it doesn't matter if we set the state to "tentative" rather than "plugged" here.

It does matter for the further processing of dependent devices, though. The processing of dm and md devices during boot relies heavily on (udev) coldplug. It's not in the interest of these subsystems to have dependent units activated before the (udev) coldplug events have been processed. Strictly speaking, after switching root, we should never call device_set_state(DEVICE_READY), except from device_dispatch_io().

However, I realize that this hunk is not the point. With this hunk, we'll set DEVICE_READY in device_coldplug() for db_persist devices, whereas without it, we set it in device_catchup(). If my argument above is correct, both calls would be premature. @fbuihuu, with this observation I don't understand how this patch fixes your issue. The extra "dead→plugged" transition should still observed, just a little bit earlier.

mwilck · 2022-05-19T15:22:22Z

src/core/device.c

         *   device_catchup() or uevents. */

-        if (MANAGER_IS_SWITCHING_ROOT(m)) {
+        if (MANAGER_IS_SWITCHING_ROOT(m) && !FLAGS_SET(d->enumerated_found, DEVICE_FOUND_UDEV)) {


I think this is wrong. We must reset the state to DEVICE_TENTATIVE to prevent premature device activation in the switching root case. The argument is the same as in #23208. Even with db_persist, the udev rules may have changed while switching root, and a device which was "plugged" in the initramfs may not be any more in the mounted root (SYSTEMD_READY="0" may be set on it). IMO DEVICE_READY should never be set after switching root until an actual "add" uevent for the device has been processed.

After looking at the dm-crypt example, I have to say this statement of mine was wrong. dm-crypt devices apparently need to be in "plugged" state early on. We can't wait for udev coldplug to establish this state, because cryptsetup@,service may need to be activated before udev after switching root, and that wouldn't happen if the device was only in tentative state.

src/core/manager-serialize.c

mwilck · 2022-05-20T07:45:26Z

Whatever we eventually do here, can we please separate the MANAGER_IS_SWITCHING_ROOT change from the bug fix?

mwilck · 2022-05-20T16:08:06Z

TL;DR: IMO this is no regression. The dead→plugged debug message has just been suppressed before, and isn't anymore after #23218.

Long story

@fbuihuu, this patch simply moves the spurious "dead→plugged" transition from device_catchup() to device_coldplug(). I think your test observation is a deception. My suspicion is that you were tricked (like myself this morning) by the fact that debug messages are likely to be suppressed in coldplug state after switching root. IOW, you saw the "dead→plugged" message in catchup(), but you didn't see it in coldplug().

During the coldplug stage, journald isn't running. That means that log messages are sent to /dev/kmsg (to be copied back into the journal by journald later). Rate limiting is applied for /dev/kmsg both on the kernel level and on the systemd level. I can override the kernel rate limiting with printk.devkmsg=on log_buf_size=4M, but for systemd, I had to patch write_to_kmsg(). After doing that, I'd see the messages from device_coldplug() that I didn't see before. The same doesn't hold for device_catchup(), which is called significantly later, after various services including journald:

[    6.868156] hellboy systemd[1]: Switching root.
[    7.297649] hellboy systemd[1]: Invoking unit coldplug() handlers…
[   12.113406] hellboy systemd[1]: Started Dispatch Password Requests to Console Directory Watch.
[   12.116364] hellboy systemd[1]: Reached target Local Encrypted Volumes.
[   12.120129] hellboy systemd[1]: Reached target Remote File Systems.
[   12.121027] hellboy systemd[1]: Reached target Slice Units.
[   12.121876] hellboy systemd[1]: Reached target Local Verity Protected Volumes.
[   12.140989] hellboy systemd[1]: Starting Create List of Static Device Nodes...
[   12.142845] hellboy systemd[1]: Starting Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling...
[   12.146504] hellboy systemd[1]: Starting Load Kernel Module configfs...
[   12.148307] hellboy systemd[1]: Starting Load Kernel Module drm...
[   12.150010] hellboy systemd[1]: Starting Load Kernel Module fuse...
[   12.155541] hellboy systemd[1]: Starting Journal Service...
[   12.157488] hellboy systemd[1]: Starting Load Kernel Modules...
[   12.159231] hellboy systemd[1]: Starting Remount Root and Kernel File Systems...
[   12.161172] hellboy systemd[1]: Starting Coldplug All udev Devices...
[   22.242575] hellboy systemd[1]: Reached target Swaps.
[   22.143867] hellboy systemd[1]: Invoking unit catchup() handlers…
[   22.301878] hellboy systemd[1]: Started Journal Service.

Now I see the following with the code from #23218:

[    4.551797] hellboy systemd[1]: sys-devices-virtual-block-dm\x2d0.device: Changed dead -> plugged
...
[    6.868156] hellboy systemd[1]: Switching root.
[    7.257259] hellboy systemd[1]: Deserializing state...
[    7.297649] hellboy systemd[1]: Invoking unit coldplug() handlers…
[   22.143867] hellboy systemd[1]: Invoking unit catchup() handlers…
[   22.175241] hellboy systemd[1]: sys-devices-virtual-block-dm\x2d0.device: Changed dead -> plugged

With a5dedd6 applied, it becomes:

[    4.558180] hellboy systemd[1]: sys-devices-virtual-block-dm\x2d0.device: Changed dead -> plugged
...
[    6.851002] hellboy systemd[1]: Switching root.
[    7.269579] hellboy systemd[1]: Deserializing state...
[    7.309934] hellboy systemd[1]: Invoking unit coldplug() handlers…
[    7.310348] hellboy systemd[1]: sys-devices-virtual-block-dm\x2d0.device: Changed dead -> plugged
[   18.828799] hellboy systemd[1]: Invoking unit catchup() handlers…

You can see that the dead->plugged is printed in device_catchup() with the patch, and in device_coldplug() with the patch.

yuwata · 2022-05-20T18:12:12Z

@mwilck Thank you for the investigation. Still I think this PR is OK, but let's postpone to v252.

yuwata · 2022-05-20T18:38:43Z

@mwilck
#23218 (comment)

@yuwata. recall this discussion we had previously? If we removed the lines
+                if (found == DEVICE_NOT_FOUND)
+                        state = DEVICE_DEAD; /* If nobody sees the device, downgrade more */
from 75d7b59, we'd still see transition messages but only "tentative→plugged", not the confusing "dead→plugged". I don't think this can be completely avoided. We apply the result from enumeration after the result from deserialization on purpose.

#23437 (comment)

I still think DEVICE_TENTATIVE is safer. It doesn't do harm, because the actual state will be resolved sooner or later either way, and it avoids premature activation.

I hope the last commit in this PR addresses that. WDYT?

mwilck · 2022-05-20T20:49:09Z

I hope the last commit in this PR addresses that. WDYT?

I guess so, but I haven't taken a closer look. I've tested these patches one-by-one.
One thing that I don't like so much about this PR is that it mixes enumerated_found into the device_coldplug() logic. Previouly, device_coldplug() would handle deserialized_found and either device_enumerate or device_catchup() would handle enumerated_found. Now, enumerated_found is part of the logic of both coldplug and catchup.

I've been wondering if it'd be possible (in the switching-root case at least) to postpone the call to device_set_state() , and perhaps the entire state calculation logic, to device_catchup(). Why do we need to call device_set_state() twice?

fbuihuu · 2022-05-23T09:43:25Z

You can see that the dead->plugged is printed in device_catchup() with the patch, and in device_coldplug() with the patch.

I see, thanks for the investigation, I've been fooled by the rate limiting in write_to_kmsg() :(.

Then issue #23429 is not a regression.

fbuihuu · 2022-05-23T09:45:25Z

I've been wondering if it'd be possible (in the switching-root case at least) to postpone the call to device_set_state() , and perhaps the entire state calculation logic, to device_catchup(). Why do we need to call device_set_state() twice?

This is basically what the last commit of #23436 is all about.

mwilck · 2022-05-24T09:50:02Z

Then issue #23429 is not a regression.

Well unfortunately it seems to be a regression for systems using dm-crypt for the root FS. Another example for the general theorem that whatever we do in this area, we overlook something 😭

I also have to revoke my statement above that "tentative" is always better than "plugged". I still think that we have a conflict between two principles here, as stated above, but at least the way things are set up today, we have to carry over "plugged" state from the initrd for db_persist devices. I can think about other approaches for dm-crypt, but let's not overcomplicate matters at this stage.

Will be used by the following commit.

…OOT()

On switching root, a device may have a persistent databse. In that case, Device.enumerated_found may have DEVICE_FOUND_UDEV flag, and it is not necessary to downgrade the Device.deserialized_found and Device.deserialized_state. Otherwise, the state of the device unit may be changed plugged -> dead -> plugged, if the device has not been mounted. Fixes systemd#23429.

…as a swap Otherwise, the device state will changed as plugged -> dead -> tentative, and the corresponding .mount or .swap units will be stopped. Fixes systemd#23429 (comment).

On switching root, a device may have a persistent databse. In that case, Device.enumerated_found may have DEVICE_FOUND_UDEV flag, and it is not necessary to downgrade the Device.deserialized_found and Device.deserialized_state. Otherwise, the state of the device unit may be changed plugged -> dead -> plugged, if the device has not been mounted. Fixes systemd#23429. [mwilck: cherry-picked from systemd#23437]

yuwata · 2022-05-26T18:27:46Z

Superseded by #23517. I will pick the cleanup commits by @fbuihuu in another PR. Closing.

On switching root, a device may have a persistent databse. In that case, Device.enumerated_found may have DEVICE_FOUND_UDEV flag, and it is not necessary to downgrade the Device.deserialized_found and Device.deserialized_state. Otherwise, the state of the device unit may be changed plugged -> dead -> plugged, if the device has not been mounted. Fixes systemd#23429. [mwilck: cherry-picked from systemd#23437] (cherry picked from commit 4fc69e8)

On switching root, a device may have a persistent databse. In that case, Device.enumerated_found may have DEVICE_FOUND_UDEV flag, and it is not necessary to downgrade the Device.deserialized_found and Device.deserialized_state. Otherwise, the state of the device unit may be changed plugged -> dead -> plugged, if the device has not been mounted. Fixes systemd#23429. [mwilck: cherry-picked from systemd#23437] (cherry picked from commit 4fc69e8) (cherry picked from commit 131206d)

On switching root, a device may have a persistent databse. In that case, Device.enumerated_found may have DEVICE_FOUND_UDEV flag, and it is not necessary to downgrade the Device.deserialized_found and Device.deserialized_state. Otherwise, the state of the device unit may be changed plugged -> dead -> plugged, if the device has not been mounted. Fixes systemd#23429. [mwilck: cherry-picked from systemd#23437] (cherry picked from commit 4fc69e8)

yuwata added the pid1 label May 19, 2022

yuwata added this to the v251 milestone May 19, 2022

fbuihuu reviewed May 19, 2022

View reviewed changes

mwilck suggested changes May 19, 2022

View reviewed changes

yuwata force-pushed the core-device-switching-root-persistent-database branch from 588ca27 to 1ec046e Compare May 19, 2022 19:29

mwilck reviewed May 20, 2022

View reviewed changes

src/core/manager-serialize.c Show resolved Hide resolved

mwilck mentioned this pull request May 20, 2022

PR#23218 introduced a regression for devices with db_persist option #23429

Closed

yuwata removed this from the v251 milestone May 20, 2022

mwilck mentioned this pull request May 24, 2022

Attempted fix for regression with LUKS devices introduced by PR#23218 #23489

Closed

fbuihuu and others added 4 commits May 25, 2022 13:34

core: introduce MANAGER_IS_SWITCHING_ROOT() helper function

e585555

Will be used by the following commit.

core: replace m->honor_device_enumeration with MANAGER_IS_SWITCHING_R…

6bd698a

…OOT()

core/device: do not downgrade to dead state if it is mounted or used …

fb4a249

…as a swap Otherwise, the device state will changed as plugged -> dead -> tentative, and the corresponding .mount or .swap units will be stopped. Fixes systemd#23429 (comment).

yuwata force-pushed the core-device-switching-root-persistent-database branch from 1ec046e to fb4a249 Compare May 25, 2022 04:41

yuwata mentioned this pull request May 25, 2022

core/device: honor deserialized state and check device existence on switching-root #23507

Closed

mwilck mentioned this pull request May 25, 2022

Make it possible to override rate limiting in write_to_kmsg() #23515

Closed

yuwata closed this May 26, 2022

yuwata deleted the core-device-switching-root-persistent-database branch May 26, 2022 18:27

Uh oh!

Conversation

yuwata commented May 19, 2022

Uh oh!

yuwata commented May 19, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mwilck May 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mwilck May 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mwilck commented May 20, 2022

Uh oh!

mwilck commented May 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Long story

Uh oh!

yuwata commented May 20, 2022

Uh oh!

yuwata commented May 20, 2022

Uh oh!

mwilck commented May 20, 2022

Uh oh!

fbuihuu commented May 23, 2022

Uh oh!

fbuihuu commented May 23, 2022

Uh oh!

mwilck commented May 24, 2022

Uh oh!

yuwata commented May 26, 2022

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

mwilck May 19, 2022 •

edited

Loading

mwilck May 19, 2022 •

edited

Loading

mwilck commented May 20, 2022 •

edited

Loading