Skip to content

disk-uuid: improve logic for UUID randomization#17

Merged
margamanterola merged 1 commit intoflatcar-masterfrom
marga-kinvolk/randomize-disk
Aug 4, 2020
Merged

disk-uuid: improve logic for UUID randomization#17
margamanterola merged 1 commit intoflatcar-masterfrom
marga-kinvolk/randomize-disk

Conversation

@margamanterola
Copy link
Copy Markdown
Contributor

Move UUID randomization code to its own script

We're removing the code in GRUB that detects whether the disk needs to be randomized or not (flatcar/scripts#82). So, we now need to detect whether the change needs to happen during initramfs.

Before, ignition-generator was trying to detect whether the disk-uuid unit should run or not. Due to ordering and timing issues, detecting whether there's a disk with UUID 00000000-0000-0000-0000-000000000001 in ignition-generator doesn't work. So, instead, move the logic to a separate script, that checks whether something needs to be done or not, and execute the unit unconditionally. The unit verifies if it needs to randomize the UUID and only does so when necessary

How to use / Testing done

Building an image with this change plus the change in flatcar/scripts#82 leads to GRUB booting successfully on a c3.medium.x86 machine followed by the disk UUID getting randomized.

WIP notice: Due to issues with current flatcar-master-alpha, I wasn't yet able to fully test this change on all platforms, I only manually tested it on Packet. Once alpha is fixed, I'll test again on all platforms.

if [[ -e "${DEVICE}" ]]; then
/usr/bin/cgpt repair ${DEVICE} && \
/usr/sbin/sgdisk --disk-guid=R ${DEVICE} && \
/usr/bin/udevadm settle
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the goal is to have the /dev/disk/by-diskuuid/000… entry go away and a new /dev/disk/by-diskuuid/ entry to appear, it would make sense to wait for this to happen. The udevadm settle command is more like a sleep 1 and not a reliable waiter. Maybe it's not needed and then it could be removed but if it is needed, I would prefer a real waiter.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what "a real waiter" would look like. But this part of the code comes from the previous unit and I'd rather not change that in this commit, since that part was working well enough already.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was there but through the reordering it is also done at a different time which could uncover a race condition.
The waiter could first count the number of entries in /dev/disk/by-diskuuid/, and then wait for /dev/disk/by-diskuuid/000… to not exist but the number of entries be at least the same as before.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave it as long as it works but I think it would be cleaner to remove it and see if it still works, and if not we know the condition that we want to wait for.

@margamanterola margamanterola force-pushed the marga-kinvolk/randomize-disk branch from e9d654e to 1ca175e Compare July 31, 2020 09:48
if [[ -e "${DEVICE}" ]]; then
/usr/bin/cgpt repair ${DEVICE} && \
/usr/sbin/sgdisk --disk-guid=R ${DEVICE} && \
/usr/bin/udevadm settle
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The service failed on Azure with exit code 1. Logs are here
I don't see why but I suggest to make the script more robust and report errors. I guess the udevadm settle command failed due to some unrelated timeout? We can either try to remove the call or log the error and just continue.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On all other platforms no test failed. I guess the Azure machine was very slow and also had some other problem: In the log udev is hanging for some time.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this udevadm settle failure on Azure is nothing new and already happens some times. Here is an output of a Azure test failing on the main branch – we see that sgdisk was running (The operation has completed successfully) which means that the failing process was udevadm settle:

Jul 31 12:54:38.845227 systemd[1]: Starting Generate new UUID for disk GPT dev/disk/by-diskuuid/00000000-0000-0000-0000-000000000001...
Jul 31 12:54:38.870575 cgpt[405]: Primary Header is updated.
Jul 31 12:54:38.870575 cgpt[405]: Secondary Entries is updated.
Jul 31 12:54:38.870575 cgpt[405]: Secondary Header is updated.
Jul 31 12:54:38.891154 systemd[1]: Found device Virtual_Disk OEM.
Jul 31 12:54:38.921469 systemd[1]: Found device Virtual_Disk EFI-SYSTEM.
Jul 31 12:54:38.929559 kernel: random: crng init done
Jul 31 12:54:39.018151 systemd[1]: Found device Virtual_Disk ROOT.
Jul 31 12:54:39.021250 systemd[1]: Reached target Initrd Root Device.
Jul 31 12:54:39.044066 systemd[1]: Found device Virtual_Disk USR-A.
Jul 31 12:54:40.020420 sgdisk[407]: The operation has completed successfully.
Jul 31 12:54:40.024299 kernel:  sda: sda1 sda2 sda3 sda4 sda6 sda7 sda9
Jul 31 12:55:38.791337 systemd-udevd[257]: eth0: Worker [304] processing SEQNUM=1346 is taking a long time
Jul 31 12:56:38.102290 systemd[1]: Finished dracut initqueue hook.
Jul 31 12:56:38.105000 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=dracut-initqueue comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jul 31 12:56:38.108621 systemd[1]: Reached target Remote File Systems (Pre).
Jul 31 12:56:38.128010 kernel: audit: type=1130 audit(1596200198.105:10): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=dracut-initqueue comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jul 31 12:56:38.121946 systemd[1]: Reached target Remote File Systems.
Jul 31 12:56:38.123074 systemd[1]: Starting dracut pre-mount hook...
Jul 31 12:56:38.147712 kernel: audit: type=1130 audit(1596200198.134:11): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=dracut-pre-mount comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jul 31 12:56:38.134000 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=dracut-pre-mount comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jul 31 12:56:38.133674 systemd[1]: Finished dracut pre-mount hook.
Jul 31 12:56:40.162062 systemd[1]: disk-uuid@dev-disk-by\x2ddiskuuid-00000000\x2d0000\x2d0000\x2d0000\x2d000000000001.service: Main process exited, code=exited, status=1/FAILURE
Jul 31 12:56:40.197052 kernel: audit: type=1130 audit(1596200200.163:12): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=disk-uuid@dev-disk-by\x2ddiskuuid-00000000\x2d0000\x2d0000\x2d0000\x2d000000000001 comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=failed'
Jul 31 12:56:40.163000 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=disk-uuid@dev-disk-by\x2ddiskuuid-00000000\x2d0000\x2d0000\x2d0000\x2d000000000001 comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=failed'
Jul 31 12:56:40.162240 systemd[1]: disk-uuid@dev-disk-by\x2ddiskuuid-00000000\x2d0000\x2d0000\x2d0000\x2d000000000001.service: Failed with result 'exit-code'.
Jul 31 12:56:40.162544 systemd[1]: Failed to start Generate new UUID for disk GPT dev/disk/by-diskuuid/00000000-0000-0000-0000-000000000001.
Jul 31 12:56:40.165358 systemd[1]: Dependency failed for Initrd Default Target.
Jul 31 12:56:40.166221 systemd[1]: initrd.target: Job initrd.target/start failed with result 'dependency'.

Full log here for the test on the main branch (well almost, I tested a networkd unit to exclude weave devices but this is totally unrelated).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'm applying your changes. I'm not sure ignoring a udev failure is the way to go, though.

Instead of detecting whether the disk-uuid unit should be executed or
not in the ignition-generator, move the logic to a separate script and
execute the unit as long as it's not PXE booting.
@margamanterola margamanterola force-pushed the marga-kinvolk/randomize-disk branch from 023449b to 784eca0 Compare August 3, 2020 09:07
Copy link
Copy Markdown
Member

@pothos pothos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are passing on all platform with just a few provisioning failures.
Booting after installation on Packet c3.medium was also successful, yet tests didn't run because I was connected to the console which made kola think that GRUB output wasn't present.

@margamanterola margamanterola changed the title WIP: disk-uuid: improve logic for UUID randomization disk-uuid: improve logic for UUID randomization Aug 4, 2020
@margamanterola margamanterola merged commit 45a62e8 into flatcar-master Aug 4, 2020
@margamanterola margamanterola deleted the marga-kinvolk/randomize-disk branch August 4, 2020 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants