Attempted fix for regression with LUKS devices introduced by PR#23218#23489
Attempted fix for regression with LUKS devices introduced by PR#23218#23489mwilck wants to merge 4 commits intosystemd:mainfrom
Conversation
|
well, if there's a regression slipped in v251, we need to fix it with some urgency |
|
I think we should try to reproduce the prob instead based on the description given in #23429 (comment) |
I proposed a minimal fix. That's my understanding how a regression should be addressed. Both other PRs contain additional unrelated changes. I wrote "hopefully" because the fix hadn't been verified, and I didn't want to boast of something I wasn't 100% certain of.
I'm unsure what you mean. I don't deny that some of my statements were wrong. But I don't think I'm the only one. At least I have tried to analyze the issues in depth. I haven't realized the LUKS issue. But nobody else did, either. |
|
Sorry I wasn't explained myself well enough. We all made some mistakes because it's hard to foresee the consequences of any changes in this part of the code even if we put a lot of thoughts. But I think my suggestion to reproduce the regression first and then verify that the fix actually is fixing it is reasonable instead of running for a fix that we're not again 100% sure. #23437 is already an attempt to fix this problem and has the same change that this PR is introducing. Indeed it has 2 small additional commits but those should be considered right now anyway as they're likely to be committed soon or later. We would avoid testing this stuff twice. I'm re-opening this PR and let @yuwata decides if he prefers dealing with this PR rather than his. |
|
My thinking was that if this PR actually fixes the issue(s) reported, it might be easier to merge this minimal PR than #23437. But I'm not familiar with how this is handled in systemd. If #23437 can be merged, I have no objections. Most importantly, we need tests by the affected users. @tanriol, as a Gentoo user, may be able to test this rather easily. @keszybz seems to have ways to distribute fixes to Fedora users quickly, too. That should cover most regressions reported so far. |
|
Sorry for late response. Will take a look. |
|
Sorry, applying this patch on top of systemd 251 did not fix the issue on my system. |
|
If you can reproduce it, can you add a test case too please to TEST-24-CRYPTSETUP? Otherwise we'll never know if another regression happens |
What I meant is I was able to set up a box with LUKS encrypted root, saw the same issue, and was able to fix it. I don't know how to transform this into a unit test. I can have a look later, but I know almost nothing about the systemd unit tests, except for the udev test which I've been working on in the past. |
|
If we can get the |
This should cover cases regarding devices with `OPTIONS+="db_persist"` during initrd->sysroot transition. See: * systemd#23429 * systemd#23218 * systemd#23489 * https://bugzilla.redhat.com/show_bug.cgi?id=2087225
I've already pinged @keszybz, since we might need to revert https://src.fedoraproject.org/rpms/systemd/c/9a48377e0adb6af26f8e7a89dd0c186cc951efa8?branch=rawhide once again, as the builders are still having some issues. |
I've opened #23517 which should add some coverage for this. Feel free to cherry-pick the commit into this branch, as it should make it easier to test the changes with the test included. |
On switching root, a device may have a persistent databse. In that case, Device.enumerated_found may have DEVICE_FOUND_UDEV flag, and it is not necessary to downgrade the Device.deserialized_found and Device.deserialized_state. Otherwise, the state of the device unit may be changed plugged -> dead -> plugged, if the device has not been mounted. Fixes systemd#23429. [mwilck: cherry-picked from systemd#23437]
dm-crypt device units generated by systemd-cryptsetup-generator habe BindsTo= dependencies on their backend devices. The dm-crypt devices have the db_persist flag set, and thus survive the udev db cleanup while switching root. But backend devices usually don't survive. These devices are neither mounted nor used for swap, thus they will seen as DEVICE_NOT_FOUND after switching root. The BindsTo dependency will cause systemd to schedule a stop job for the dm-crypt device, breaking boot: [ 68.929457] krypton systemd[1]: systemd-cryptsetup@cr_root.service: Unit is stopped because bound to inactive unit dev-disk-by\x2duuid-3bf91f73\x2d1ee8\x2d4cfc\x2d9048\x2d93ba349b786d.device. [ 68.945660] krypton systemd[1]: systemd-cryptsetup@cr_root.service: Trying to enqueue job systemd-cryptsetup@cr_root.service/stop/replace [ 69.473459] krypton systemd[1]: systemd-cryptsetup@cr_root.service: Installed new job systemd-cryptsetup@cr_root.service/stop as 343 Avoid this by not setting the state of the backend devices to DEVICE_DEAD. Fixes the LUKS setup issue reported in systemd#23429.
This should cover cases regarding devices with `OPTIONS+="db_persist"` during initrd->sysroot transition. See: * systemd#23429 * systemd#23218 * systemd#23489 * https://bugzilla.redhat.com/show_bug.cgi?id=2087225
|
@mwilck Thank you for the analysis of the issue. LGTM. I cherry-picked @mrc0mmand's commit about test update. |
|
Will test this tomorrow, sorry for delay. |
|
The two main commits are in PR #23517. Closing. |
This should cover cases regarding devices with `OPTIONS+="db_persist"` during initrd->sysroot transition. See: * systemd/systemd#23429 * systemd/systemd#23218 * systemd/systemd#23489 * https://bugzilla.redhat.com/show_bug.cgi?id=2087225 (cherry picked from commit 1fb7f8e)
This should cover cases regarding devices with `OPTIONS+="db_persist"` during initrd->sysroot transition. See: * systemd/systemd#23429 * systemd/systemd#23218 * systemd/systemd#23489 * https://bugzilla.redhat.com/show_bug.cgi?id=2087225 (cherry picked from commit 1fb7f8e)
On switching root, a device may have a persistent databse. In that case,
Device.enumerated_found may have DEVICE_FOUND_UDEV flag, and it is not
necessary to downgrade the Device.deserialized_found and
Device.deserialized_state. Otherwise, the state of the device unit may
be changed plugged -> dead -> plugged, if the device has not been mounted.
This is #23437, just reduced to a minimal bug fix.
Fixes #23429 (hopefully)