Skip to content

Boot to emergency mode if devices are switched to SYSTEMD_READY=0 (multipath scenario) #23208

@mwilck

Description

@mwilck

systemd version the issue has been seen with

All versions since (at least) 246

Used distribution

openSUSE Tumbleweed, openSUSE Leap

Linux kernel version used (uname -a)

5.14, 5.17 (any)

CPU architecture issue was seen on

x86_64, s390x (any)

Expected behaviour you didn't see

System should boot cleanly

Unexpected behaviour you saw

System ends up in emergency mode with messages like

[    5.741385] m8315021 systemd-fsck[744]: /dev/sdd1 is in use.
[    5.743787] m8315021 mount[745]: mount: /mnt: /dev/sdd1 already mounted or mount point busy.

Steps to reproduce the problem

Prerequisites:

  • system with multipath hardware or (at least) multipathd.service enabled / started
  • system boots from normal (non-multipath) disk
  • multipath module not included in initial ramdisk
  • low-level (SCSI) driver for multipathed disks are included in initial ramdisk (a common situation, because the "hostonly" mode in recent dracut versions defaults to including all SCSI modules that were loaded when dracut built the initrd).

With this prerequisites, the problem is still timing-dependent, but not difficult to reproduce.

Additional program output to the terminal or log subsystem illustrating the issue

In the case at hand, we look at a SCSI multipath device. The device to be mounted is /dev/disk/by-uuid/56c5fcd8-35ad-48f3-8b9c-8fc7a89c5459, which is a multipath is on a device-mapper multipath device. The low-level SCSI devices with this UUID are /dev/sdd1 and /dev/sdh1. The desired mount point is /mnt. The following is a journalctl excerpt.

During initramfs processing, udev sees the SCSI devices and systemd considers them "plugged". Note that multipathd.service is disabled in the initramfs, thus SYSTEMD_READY=0 is not set on these devices.

[    3.332761] m8315021 systemd-udevd[300]: sdd1: Creating symlink '/dev/disk/by-uuid/56c5fcd8-35ad-48f3-8b9c-8fc7a89c5459' to '../../sdd1'
[    3.338510] m8315021 systemd[1]: dev-disk-by\x2duuid-56c5fcd8\x2d35ad\x2d48f3\x2d8b9c\x2d8fc7a89c5459.device: Changed dead -> plugged
[    3.376903] m8315021 systemd-udevd[306]: sdh1: Atomically replace '/dev/disk/by-uuid/56c5fcd8-35ad-48f3-8b9c-8fc7a89c5459'
[    3.380014] m8315021 systemd[1]: dev-disk-by\x2duuid-56c5fcd8\x2d35ad\x2d48f3\x2d8b9c\x2d8fc7a89c5459.device: Device dev-disk-by\x2duuid-56c5fcd8\x2d35ad\x2d48f3\x2d8b9c\x2d8fc7a89c5459.device appeared twice with different sysfs paths /sys/devices/css0/0.0.0004/0.0.1915/host1/rport-1:0-0/target1:0:0/1:0:0:1076052021/block/sdd/sdd1 and /sys/devices/css0/0.0.0005/0.0.1955/host0/rport-0:0-0/target0:0:0/0:0:0:1076052021/block/sdh/sdh1, ignoring the latter.

Side note: the above shows an inconsistency between udev's and systemd's handling of the two different devices having the same alias. While udev replaces the by-uuid symlink which now points to sdh1 rather than sdd1, systemd keeps the previous mapping to sdd1 and emits a warning. This is not the problem cause but worth mentioning.

[    3.690821] m8315021 systemd[1]: Switching root.

systemd reads the deserialized state and found attributes for this device, which are "plugged" and DEVICE_FOUND_UDEV, respectively. Device enumeration doesn't find the device in the udev db, which has been cleared. It's also not mounted, so it isn't found from a mount point, either.

After switching root multipathd.service is active, and thus the SCSI devices now get SYSTEMD_READY=0 set, as indicated by the messages below. Note that the partitions sdd1 and sdh1 inherit the DM_MULTIPATH_DEVICE_PATH=1 property from sdd and sdh, respectively, and that DM_MULTIPATH_DEVICE_PATH=1 implies SYSTEMD_READY=0.

[    5.063229] m8315021 systemd-udevd[526]: sdh: '/sbin/multipath -u sdh'(out) 'DM_MULTIPATH_DEVICE_PATH="1"'
[    5.083264] m8315021 systemd-udevd[532]: sdd: '/sbin/multipath -u sdd'(out) 'DM_MULTIPATH_DEVICE_PATH="1"'

However device_dispatch_io() does nothing for ADD events with devices that aren't ready. In particular, the plugged state, which had been deserialized, is not reset. Hence, as soon as local-fs-pre.target is reached, systemd tries to fsck and mount the device. But in the meantime, multipathd had seen the devices too, and created a multipath map from them, causing the fsck and mount to fail:

[    5.626369] m8315021 multipathd[707]: 3600507630bffc3200000000000003523: addmap [0 41943040 multipath 1 queue_if_no_path 1 alua 1 1 service-time 0 2 1 8:48 1 8:112 1]
[    5.728621] m8315021 systemd[740]: systemd-fsck@dev-disk-by\x2duuid-56c5fcd8\x2d35ad\x2d48f3\x2d8b9c\x2d8fc7a89c5459.service: Executing: /usr/lib/systemd/systemd-fsck /dev/disk/by-uuid/56c5fcd8-35ad-48f3-8b9c-8fc7a89c5459
[    5.741385] m8315021 systemd-fsck[744]: /dev/sdd1 is in use.
[    5.741385] m8315021 systemd-fsck[744]: e2fsck: Cannot continue, aborting.
[    5.742124] m8315021 systemd-fsck[740]: fsck failed with exit status 8.
[    5.742195] m8315021 systemd-fsck[740]: Ignoring error.
[    5.742379] m8315021 systemd[1]: systemd-fsck@dev-disk-by\x2duuid-56c5fcd8\x2d35ad\x2d48f3\x2d8b9c\x2d8fc7a89c5459.service: Job 164 systemd-fsck@dev-disk-by\x2duuid-56c5fcd8\x2d35ad\x2d48f3\x2d8b9c\x2d8fc7a89c5459.service/start finished, result=done
[    5.743046] m8315021 systemd[745]: mnt.mount: Executing: /usr/bin/mount /dev/disk/by-uuid/56c5fcd8-35ad-48f3-8b9c-8fc7a89c5459 /mnt -t ext2 -o acl,user_xattr,noatime
[    5.743787] m8315021 mount[745]: mount: /mnt: /dev/sdd1 already mounted or mount point busy.
[    5.744679] m8315021 systemd[1]: mnt.mount: Mount process exited, code=exited, status=32/n/a
[    5.746814] m8315021 systemd[1]: mnt.mount: Unit entered failed state.

Thus emergency mode is entered. Shortly later, the uevent that re-creates the by-uuid device (pointing to the correct device-mapper device now) arrives, but this has no effect on systemd, because it had considered this by-uuid device "plugged" before already.

[    5.763330] m8315021 systemd-udevd[544]: dm-4: Atomically replace '/dev/disk/by-uuid/56c5fcd8-35ad-48f3-8b9c-8fc7a89c5459'
[    5.803581] m8315021 systemd[1]: dev-disk-by\x2did-dm\x2duuid\x2dpart1\x2dmpath\x2d3600507630bffc3200000000000003523.device: Changed dead -> plugged
[    5.803658] m8315021 systemd[1]: dev-disk-by\x2did-dm\x2dname\x2d3600507630bffc3200000000000003523\x2dpart1.device: Changed dead -> plugged
[    5.803888] m8315021 systemd[1]: dev-dm\x2d4.device: Changed dead -> plugged

(note that various device aliases reach "plugged" state here, but the by-uuid alias is not mentioned).

Suggested solution (tentative)

This problem is loosely related to #12953, even though it looks like the opposite problem (while in #12953 we want to keep the "plugged" state, we want to drop it / replace it with "dead" here). The difference is that #12953 occured after a "simple" reload operation, whereas this occurs while switching root.

When we switch root, we know that udev will re-run all uevents during udev coldplug (systemd-udev-trigger.service). They have to be re-run because rules and thus device properties may have changed. In the problem case, even the crucial SYSTEMD_READY property changes. We know from the discussion in #12953 and its predecessors (#8675, #8832, #11997) that switching a "plugged" device back to "dead" is potentially dangerous, especially if the device in question might be already mounted or otherwise used. A possible solution has to avoid this. My proposed solution goes on top of the fix I proposed in #12953 (comment).

The idea is that if we serialize/deserialize device state and found flags while switching root, we shouldn't set "plugged" and DEVICE_FOUND_UDEV states. The udev db entries are gone when systemd reexecutes. Unused devices can just be "forgotten", they'll be rediscovered by udev. Devices that are mounted should keep their DEVICE_FOUND_MOUNT or DEVICE_FOUND_SWAP flags, but the state should be reset from DEVICE_PLUGGED to DEVICE_TENTATIVE (the same state that systemd uses in other similar situations when a device is encountered that seems to be in use, but not in the udev db). DEVICE_TENTATIVE has the effect that systemd will wait for these devices to switch to DEVICE_PLUGGED before attempting to mount them. The switch to DEVICE_PLUGGED will happen when the dm device created by multipathd is first seen by systemd. In order to detect "switching root" state, I use the already existing honor_device_enumeration flag, which is cleared during deserialization if and only if systemd executes a switch root operation.

The code can be inspected on my issue23208 branch.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug 🐛Programming errors, that need preferential fixingpid1udev

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions