Skip to content

Attempted fix for regression with LUKS devices introduced by PR#23218#23489

Closed
mwilck wants to merge 4 commits intosystemd:mainfrom
mwilck:fix-23218
Closed

Attempted fix for regression with LUKS devices introduced by PR#23218#23489
mwilck wants to merge 4 commits intosystemd:mainfrom
mwilck:fix-23218

Conversation

@mwilck
Copy link
Copy Markdown
Contributor

@mwilck mwilck commented May 24, 2022

On switching root, a device may have a persistent databse. In that case,
Device.enumerated_found may have DEVICE_FOUND_UDEV flag, and it is not
necessary to downgrade the Device.deserialized_found and
Device.deserialized_state. Otherwise, the state of the device unit may
be changed plugged -> dead -> plugged, if the device has not been mounted.

This is #23437, just reduced to a minimal bug fix.
Fixes #23429 (hopefully)

@fbuihuu
Copy link
Copy Markdown
Contributor

fbuihuu commented May 24, 2022

I don't see the point of this PR. We already have #23437 and PR #23436, this one is adding only more confusion.

Again there is no rush so let's discuss the regression in on of the aforementioned PRs.

@fbuihuu fbuihuu closed this May 24, 2022
@bluca
Copy link
Copy Markdown
Member

bluca commented May 24, 2022

well, if there's a regression slipped in v251, we need to fix it with some urgency

@fbuihuu
Copy link
Copy Markdown
Contributor

fbuihuu commented May 24, 2022

As I wrote we already have 2 PRs. If we need a quick fix, working on #23437 should be enough I think. And "Fixes #23429 (hopefully)" doesn't make me confident at all.

See comments in #23429 where this wasn't supposed to be a regression and wasn't supposed to introduce any issue.

@fbuihuu
Copy link
Copy Markdown
Contributor

fbuihuu commented May 24, 2022

I think we should try to reproduce the prob instead based on the description given in #23429 (comment)

@mwilck
Copy link
Copy Markdown
Contributor Author

mwilck commented May 24, 2022

As I wrote we already have 2 PRs. If we need a quick fix, working on #23437 should be enough I think. And "Fixes #23429 (hopefully)" doesn't make me confident at all.

I proposed a minimal fix. That's my understanding how a regression should be addressed. Both other PRs contain additional unrelated changes. I wrote "hopefully" because the fix hadn't been verified, and I didn't want to boast of something I wasn't 100% certain of.

See comments in #23429 where this wasn't supposed to be a regression and wasn't supposed to introduce any issue.

I'm unsure what you mean. I don't deny that some of my statements were wrong. But I don't think I'm the only one. At least I have tried to analyze the issues in depth. I haven't realized the LUKS issue. But nobody else did, either.

@fbuihuu
Copy link
Copy Markdown
Contributor

fbuihuu commented May 24, 2022

Sorry I wasn't explained myself well enough. We all made some mistakes because it's hard to foresee the consequences of any changes in this part of the code even if we put a lot of thoughts. But I think my suggestion to reproduce the regression first and then verify that the fix actually is fixing it is reasonable instead of running for a fix that we're not again 100% sure.

#23437 is already an attempt to fix this problem and has the same change that this PR is introducing. Indeed it has 2 small additional commits but those should be considered right now anyway as they're likely to be committed soon or later. We would avoid testing this stuff twice.

I'm re-opening this PR and let @yuwata decides if he prefers dealing with this PR rather than his.

@fbuihuu fbuihuu reopened this May 24, 2022
@mwilck
Copy link
Copy Markdown
Contributor Author

mwilck commented May 24, 2022

My thinking was that if this PR actually fixes the issue(s) reported, it might be easier to merge this minimal PR than #23437. But I'm not familiar with how this is handled in systemd. If #23437 can be merged, I have no objections. Most importantly, we need tests by the affected users. @tanriol, as a Gentoo user, may be able to test this rather easily. @keszybz seems to have ways to distribute fixes to Fedora users quickly, too. That should cover most regressions reported so far.

@yuwata yuwata added the pid1 label May 24, 2022
@yuwata
Copy link
Copy Markdown
Member

yuwata commented May 24, 2022

Sorry for late response. Will take a look.

@tanriol
Copy link
Copy Markdown

tanriol commented May 24, 2022

Sorry, applying this patch on top of systemd 251 did not fix the issue on my system.

@mwilck
Copy link
Copy Markdown
Contributor Author

mwilck commented May 25, 2022

@tanriol, I've pushed another patch which has fixed the issue for me (I was able to reproduce your problem). I'd be very grateful if you could apply 0543ebd on top and test again.

@bluca
Copy link
Copy Markdown
Member

bluca commented May 25, 2022

If you can reproduce it, can you add a test case too please to TEST-24-CRYPTSETUP? Otherwise we'll never know if another regression happens

@mwilck
Copy link
Copy Markdown
Contributor Author

mwilck commented May 25, 2022

If you can reproduce it, can you add a test case too please to TEST-24-CRYPTSETUP?

What I meant is I was able to set up a box with LUKS encrypted root, saw the same issue, and was able to fix it. I don't know how to transform this into a unit test. I can have a look later, but I know almost nothing about the systemd unit tests, except for the udev test which I've been working on in the past.

@dustymabe
Copy link
Copy Markdown
Contributor

If we can get the rpm-build:fedora-rawhide-x86_64 action to succeed I can take the resulting RPM and run it through our Fedora CoreOS test suite.

mrc0mmand added a commit to mrc0mmand/systemd that referenced this pull request May 25, 2022
This should cover cases regarding devices with `OPTIONS+="db_persist"`
during initrd->sysroot transition.

See:
  * systemd#23429
  * systemd#23218
  * systemd#23489
  * https://bugzilla.redhat.com/show_bug.cgi?id=2087225
@mrc0mmand
Copy link
Copy Markdown
Member

If we can get the rpm-build:fedora-rawhide-x86_64 action to succeed I can take the resulting RPM and run it through our Fedora CoreOS test suite.

I've already pinged @keszybz, since we might need to revert https://src.fedoraproject.org/rpms/systemd/c/9a48377e0adb6af26f8e7a89dd0c186cc951efa8?branch=rawhide once again, as the builders are still having some issues.

@mrc0mmand
Copy link
Copy Markdown
Member

mrc0mmand commented May 25, 2022

If you can reproduce it, can you add a test case too please to TEST-24-CRYPTSETUP?

What I meant is I was able to set up a box with LUKS encrypted root, saw the same issue, and was able to fix it. I don't know how to transform this into a unit test. I can have a look later, but I know almost nothing about the systemd unit tests, except for the udev test which I've been working on in the past.

I've opened #23517 which should add some coverage for this. Feel free to cherry-pick the commit into this branch, as it should make it easier to test the changes with the test included.

yuwata and others added 3 commits May 26, 2022 08:39
On switching root, a device may have a persistent databse. In that case,
Device.enumerated_found may have DEVICE_FOUND_UDEV flag, and it is not
necessary to downgrade the Device.deserialized_found and
Device.deserialized_state. Otherwise, the state of the device unit may
be changed plugged -> dead -> plugged, if the device has not been mounted.

Fixes systemd#23429.

[mwilck: cherry-picked from systemd#23437]
dm-crypt device units generated by systemd-cryptsetup-generator
habe BindsTo= dependencies on their backend devices. The dm-crypt
devices have the db_persist flag set, and thus survive the udev db
cleanup while switching root. But backend devices usually don't survive.
These devices are neither mounted nor used for swap, thus they will
seen as DEVICE_NOT_FOUND after switching root.

The BindsTo dependency will cause systemd to schedule a stop
job for the dm-crypt device, breaking boot:

[   68.929457] krypton systemd[1]: systemd-cryptsetup@cr_root.service: Unit is stopped because bound to inactive unit dev-disk-by\x2duuid-3bf91f73\x2d1ee8\x2d4cfc\x2d9048\x2d93ba349b786d.device.
[   68.945660] krypton systemd[1]: systemd-cryptsetup@cr_root.service: Trying to enqueue job systemd-cryptsetup@cr_root.service/stop/replace
[   69.473459] krypton systemd[1]: systemd-cryptsetup@cr_root.service: Installed new job systemd-cryptsetup@cr_root.service/stop as 343

Avoid this by not setting the state of the backend devices to
DEVICE_DEAD.

Fixes the LUKS setup issue reported in systemd#23429.
This should cover cases regarding devices with `OPTIONS+="db_persist"`
during initrd->sysroot transition.

See:
  * systemd#23429
  * systemd#23218
  * systemd#23489
  * https://bugzilla.redhat.com/show_bug.cgi?id=2087225
@yuwata
Copy link
Copy Markdown
Member

yuwata commented May 25, 2022

@mwilck Thank you for the analysis of the issue. LGTM. I cherry-picked @mrc0mmand's commit about test update.

@yuwata yuwata added the good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed label May 25, 2022
@tanriol
Copy link
Copy Markdown

tanriol commented May 25, 2022

Will test this tomorrow, sorry for delay.

@yuwata yuwata removed the good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed label May 26, 2022
@tanriol
Copy link
Copy Markdown

tanriol commented May 26, 2022

systemd-251 with 2659b04 and 0543ebd on top works for me. Do I need to retest on newer commits in this PR?

@yuwata
Copy link
Copy Markdown
Member

yuwata commented May 26, 2022

systemd-251 with 2659b04 and 0543ebd on top works for me. Do I need to retest on newer commits in this PR?

Thanks. I think retest is not necessary, as they are just rebased.

@yuwata
Copy link
Copy Markdown
Member

yuwata commented May 26, 2022

The two main commits are in PR #23517. Closing.

@yuwata yuwata closed this May 26, 2022
yuwata pushed a commit to yuwata/systemd-stable that referenced this pull request May 26, 2022
This should cover cases regarding devices with `OPTIONS+="db_persist"`
during initrd->sysroot transition.

See:
  * systemd/systemd#23429
  * systemd/systemd#23218
  * systemd/systemd#23489
  * https://bugzilla.redhat.com/show_bug.cgi?id=2087225
(cherry picked from commit 1fb7f8e)
yuwata pushed a commit to systemd/systemd-stable that referenced this pull request May 27, 2022
This should cover cases regarding devices with `OPTIONS+="db_persist"`
during initrd->sysroot transition.

See:
  * systemd/systemd#23429
  * systemd/systemd#23218
  * systemd/systemd#23489
  * https://bugzilla.redhat.com/show_bug.cgi?id=2087225
(cherry picked from commit 1fb7f8e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

PR#23218 introduced a regression for devices with db_persist option

7 participants