disk-uuid: improve logic for UUID randomization#17
disk-uuid: improve logic for UUID randomization#17margamanterola merged 1 commit intoflatcar-masterfrom
Conversation
dracut/30disk-uuid/disk-uuid.sh
Outdated
| if [[ -e "${DEVICE}" ]]; then | ||
| /usr/bin/cgpt repair ${DEVICE} && \ | ||
| /usr/sbin/sgdisk --disk-guid=R ${DEVICE} && \ | ||
| /usr/bin/udevadm settle |
There was a problem hiding this comment.
If the goal is to have the /dev/disk/by-diskuuid/000… entry go away and a new /dev/disk/by-diskuuid/ entry to appear, it would make sense to wait for this to happen. The udevadm settle command is more like a sleep 1 and not a reliable waiter. Maybe it's not needed and then it could be removed but if it is needed, I would prefer a real waiter.
There was a problem hiding this comment.
I don't know what "a real waiter" would look like. But this part of the code comes from the previous unit and I'd rather not change that in this commit, since that part was working well enough already.
There was a problem hiding this comment.
It was there but through the reordering it is also done at a different time which could uncover a race condition.
The waiter could first count the number of entries in /dev/disk/by-diskuuid/, and then wait for /dev/disk/by-diskuuid/000… to not exist but the number of entries be at least the same as before.
There was a problem hiding this comment.
Let's leave it as long as it works but I think it would be cleaner to remove it and see if it still works, and if not we know the condition that we want to wait for.
e9d654e to
1ca175e
Compare
dracut/30disk-uuid/disk-uuid.sh
Outdated
| if [[ -e "${DEVICE}" ]]; then | ||
| /usr/bin/cgpt repair ${DEVICE} && \ | ||
| /usr/sbin/sgdisk --disk-guid=R ${DEVICE} && \ | ||
| /usr/bin/udevadm settle |
There was a problem hiding this comment.
There was a problem hiding this comment.
On all other platforms no test failed. I guess the Azure machine was very slow and also had some other problem: In the log udev is hanging for some time.
There was a problem hiding this comment.
It seems that this udevadm settle failure on Azure is nothing new and already happens some times. Here is an output of a Azure test failing on the main branch – we see that sgdisk was running (The operation has completed successfully) which means that the failing process was udevadm settle:
Jul 31 12:54:38.845227 systemd[1]: Starting Generate new UUID for disk GPT dev/disk/by-diskuuid/00000000-0000-0000-0000-000000000001...
Jul 31 12:54:38.870575 cgpt[405]: Primary Header is updated.
Jul 31 12:54:38.870575 cgpt[405]: Secondary Entries is updated.
Jul 31 12:54:38.870575 cgpt[405]: Secondary Header is updated.
Jul 31 12:54:38.891154 systemd[1]: Found device Virtual_Disk OEM.
Jul 31 12:54:38.921469 systemd[1]: Found device Virtual_Disk EFI-SYSTEM.
Jul 31 12:54:38.929559 kernel: random: crng init done
Jul 31 12:54:39.018151 systemd[1]: Found device Virtual_Disk ROOT.
Jul 31 12:54:39.021250 systemd[1]: Reached target Initrd Root Device.
Jul 31 12:54:39.044066 systemd[1]: Found device Virtual_Disk USR-A.
Jul 31 12:54:40.020420 sgdisk[407]: The operation has completed successfully.
Jul 31 12:54:40.024299 kernel: sda: sda1 sda2 sda3 sda4 sda6 sda7 sda9
Jul 31 12:55:38.791337 systemd-udevd[257]: eth0: Worker [304] processing SEQNUM=1346 is taking a long time
Jul 31 12:56:38.102290 systemd[1]: Finished dracut initqueue hook.
Jul 31 12:56:38.105000 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=dracut-initqueue comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jul 31 12:56:38.108621 systemd[1]: Reached target Remote File Systems (Pre).
Jul 31 12:56:38.128010 kernel: audit: type=1130 audit(1596200198.105:10): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=dracut-initqueue comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jul 31 12:56:38.121946 systemd[1]: Reached target Remote File Systems.
Jul 31 12:56:38.123074 systemd[1]: Starting dracut pre-mount hook...
Jul 31 12:56:38.147712 kernel: audit: type=1130 audit(1596200198.134:11): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=dracut-pre-mount comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jul 31 12:56:38.134000 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=dracut-pre-mount comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jul 31 12:56:38.133674 systemd[1]: Finished dracut pre-mount hook.
Jul 31 12:56:40.162062 systemd[1]: disk-uuid@dev-disk-by\x2ddiskuuid-00000000\x2d0000\x2d0000\x2d0000\x2d000000000001.service: Main process exited, code=exited, status=1/FAILURE
Jul 31 12:56:40.197052 kernel: audit: type=1130 audit(1596200200.163:12): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=disk-uuid@dev-disk-by\x2ddiskuuid-00000000\x2d0000\x2d0000\x2d0000\x2d000000000001 comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=failed'
Jul 31 12:56:40.163000 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=disk-uuid@dev-disk-by\x2ddiskuuid-00000000\x2d0000\x2d0000\x2d0000\x2d000000000001 comm="systemd" exe="/usr/lib64/systemd/systemd" hostname=? addr=? terminal=? res=failed'
Jul 31 12:56:40.162240 systemd[1]: disk-uuid@dev-disk-by\x2ddiskuuid-00000000\x2d0000\x2d0000\x2d0000\x2d000000000001.service: Failed with result 'exit-code'.
Jul 31 12:56:40.162544 systemd[1]: Failed to start Generate new UUID for disk GPT dev/disk/by-diskuuid/00000000-0000-0000-0000-000000000001.
Jul 31 12:56:40.165358 systemd[1]: Dependency failed for Initrd Default Target.
Jul 31 12:56:40.166221 systemd[1]: initrd.target: Job initrd.target/start failed with result 'dependency'.
Full log here for the test on the main branch (well almost, I tested a networkd unit to exclude weave devices but this is totally unrelated).
There was a problem hiding this comment.
Ok, I'm applying your changes. I'm not sure ignoring a udev failure is the way to go, though.
Instead of detecting whether the disk-uuid unit should be executed or not in the ignition-generator, move the logic to a separate script and execute the unit as long as it's not PXE booting.
023449b to
784eca0
Compare
There was a problem hiding this comment.
Tests are passing on all platform with just a few provisioning failures.
Booting after installation on Packet c3.medium was also successful, yet tests didn't run because I was connected to the console which made kola think that GRUB output wasn't present.
Move UUID randomization code to its own script
We're removing the code in GRUB that detects whether the disk needs to be randomized or not (flatcar/scripts#82). So, we now need to detect whether the change needs to happen during initramfs.
Before,
ignition-generatorwas trying to detect whether thedisk-uuidunit should run or not. Due to ordering and timing issues, detecting whether there's a disk with UUID 00000000-0000-0000-0000-000000000001 inignition-generatordoesn't work. So, instead, move the logic to a separate script, that checks whether something needs to be done or not, and execute the unit unconditionally. The unit verifies if it needs to randomize the UUID and only does so when necessaryHow to use / Testing done
Building an image with this change plus the change in flatcar/scripts#82 leads to GRUB booting successfully on a c3.medium.x86 machine followed by the disk UUID getting randomized.
WIP notice: Due to issues with current flatcar-master-alpha, I wasn't yet able to fully test this change on all platforms, I only manually tested it on Packet. Once alpha is fixed, I'll test again on all platforms.