Fixes & improvments for using homed-luks on 4k disks #35776

scarlet-storm · 2024-12-28T08:23:25Z

Mostly consists of fixes to

use the same sector_size as the fdisk context we are using, when converting between sectors returned by libfdisk to bytes. Fixes homectl create --image-path= does not work on 4Kn drives #30394 , Fixes Cannot login with user created using --luks-sector-size=4096 #30393
Use the explicit sector size if specified in the home record when are probing the image file using libblkid. Fixes Cannot login with user created using --luks-sector-size=4096 #30393

Also contains some other improvements with using physical block devices.

Automatically probe sector size of physical block device, if user does not pass luks-sector-size explicitly.
Assign partitions to 1 MiB boundaries, as it is the standard practice followed by all tools, fdisk, gptfdisk, gnu parted etc.
Avoid stacking of loop device on top of physical block device in home_create_luks as it leads to degradation of discard operations, and mkfs getting stuck.

src/home/homework-luks.c

poettering · 2025-01-03T11:41:26Z

src/home/homework-luks.c

-                return log_error_errno(errno, "Failed to wipe partition table: %m");
-        if ((size_t) n != ssz * 2)
-                return log_error_errno(SYNTHETIC_ERRNO(EIO), "Short write while wiping partition table.");
-


why remove this? i added this explicitly because libfdisk got confused by the old partition table, and its pmbr.

I did not face an issue with writing out the new partition table. Also I was not able to understand the exact issue that the pmbr is causing from the earlier commit message. I'm not sure why fdisk wouldn't like the existing partition table.
Could you explain the issue in detail, and I will try reproducing it.

Yeah, I think that it should be fine to drop this, if current libfdisk works without it. Maybe it was a bug in an old version, maybe it was a misunderstanding. If the issue pops up again, we'll revert the removal.

src/home/homework-luks.c

poettering · 2025-01-03T12:04:28Z

hmm, i am tempted to say we should use 4k sectors everywhere by default, unless specified in the user record otherwiwse. (and of course the gpt partition table should use the native sector size if we are talking about raw block devices).

afaik there's now downside of using larger sector sizes on the luks and fs level, but it makes our images more portable...

scarlet-storm · 2025-01-04T08:37:55Z

hmm, i am tempted to say we should use 4k sectors everywhere by default, unless specified in the user record otherwiwse. (and of course the gpt partition table should use the native sector size if we are talking about raw block devices).

afaik there's now downside of using larger sector sizes on the luks and fs level, but it makes our images more portable...

Hi, thanks for the review.
I was also thinking about using a 4k image size by default. Most of the filesystems use 4k sector size by default anyways even when the underlying sector size is 512. There should be downsides to using 4k by default for file based images.
The concerns might be

Small filesystems or lot of small files with sizes < 4k. xfs by default still uses a 512 default sector size, if the underlying image supports it and ext4 uses 512 sector size if filesystem < 2G iirc.
With respect to backwards compatibility we will have to always probe image files we are opening assuming a sector size of 512 & 4k, as the image might have been created with an old version of systemd-homework. This should be fine to implement and not an issue.

Also my thoughts were that these functions can be rewritten to use an internal systemd-repart library in the future. I think all the partition, luks & fs creation / resizing operations can be handled by the repart library. It will require creation of an repart api and add an additional dependency for homed

GreyXor · 2025-02-26T13:31:10Z

Hi @scarlet-storm
I have a 4k ssd and I'm interested in this patch. Do you need any tests? or do we miss anything?

scarlet-storm · 2025-03-03T07:37:27Z

I will work on cleaning up some of the code style issues mentioned in this PR.
I think the other changes can be follow up patches.

GreyXor · 2025-04-29T10:20:52Z

Hi @scarlet-storm,
Just wanted to check in regarding the code style cleanup and the follow-up patches you mentioned. Did you had a chance to make any progress or if there’s anything I can do to help move things forward.
Thanks!

systemd/systemd#35776

GreyXor · 2025-11-17T09:54:13Z

I will work on cleaning up some of the code style issues mentioned in this PR. I think the other changes can be follow up patches.

Hello, gentle ping :) any news about it ? thank you

scarlet-storm · 2025-11-18T07:47:07Z

I have addressed the code style issues. For the open comments

zeroing of PMBR, I don't understand why this is needed. Could you provide a simple example to reproduce this ? I did not face any issue with allowing fdisk to write out the partition table, without zeroing.
Changing default sector size to 4k for file based images. I think this can be taken up as part of a future PR. As I mentioned earlier, I think ideally we could rewrite the image creation to use systemd-repart

keszybz

Those changes look generally reasonable, but I don't think I grok all the implications. Some small comments on style.

keszybz · 2025-11-20T09:07:01Z

src/home/homework-luks.c

+/* Round down to the nearest 1 MiB size. Given that most tools generally align partitions to 1 MiB boundaries, let's align our
+ * partitions to that too. In the worst case we'll waste 1 MiB per partition that way, but I think I can live


Why do you want to do this? The commit describes what is being done, but not really why. Doing something the same as others is not enough motivation, unless the difference is causing problems.

After those changes, the comment doesn't quite make sense. The sentence about wasting 4k bytes was partially a joke. If we're wasting 1MB, a better explanation is needed.

For standardization. the code does not explicitly set the alignment when using fdisk as 4k, so by default fdisk is already using 1MiB alignment for initial partition creation. So the comment says 4k, but the created partition is already aligned to 1MiB boundary. Moreover during resize, fdisk_partition_size_explicit is called with enable=1 . This means that fdisk no longer will align the partitions. So we have a mix of partitions where the initial partition is aligned to 1MiB & if any resize is applied, the partition is aligned to 4KiB.

As for the reason why 1MiB is used a standard partition size by all linux tools (fdisk, gparted, gptfdisk), it's probably because it is a multiple of most common sector sizes. Note that even for 4K disk, the sector size of 4k is applicable for read/write operations. Other operations like discard might have much higher granularity like 8k / 16k. Also for nand cells probably write amplification also plays a part.

wasting 1 MiB is still a joke imo.

src/home/homework-luks.c

scarlet-storm · 2025-11-20T12:26:04Z

Those changes look generally reasonable, but I don't think I grok all the implications. Some small comments on style.

Thank you for the review. If there is anything particular you want to discuss in detail, let me know.

src/home/homework-luks.c

yuwata · 2026-01-04T19:44:46Z

Sorry for late review, but now this needs to be rebased.

Ensure we use the same sector size as used in the created fdisk context when converting between sectors and bytes.

Remove zeroing of PMBR before writing out the new partition table. There is no reason, to do this explicitly ?

Auto-probe the luks sector size, if not explicitly specified in the home record

Align partitions to 1MiB for consistency with regular partition tools which use 1MiB alignment by default

If there is an explicit sector size specified in the user record, use the same when probing the file using libblkid. The default is 512 bytes, which will not be able to find the signatures, if the partition table on regular file was created assuming 4k sectors

Ensure we don't create a loop device on top of a physical block device. This leads to huge performance degradation of discard operations if the physical device does not support discard_on_zeroes. - loop device historical semantics dictates that when the device is discarded, it needs to return zero data on read. This can be implemented easily on a filesystem. since fallocate zero-range would return immediately & the holes are handled at the filesystem level to return zero data on read. - For a raw block device, the feature (discard_zeroes_data) depends on the capabilities of the physical device that is exposed to the block layer by the driver. This means that to guarantee that the loop device stacked on a block device returns zero on discarded data, it needs to convert discarded range into write_zero op on the block device. https://github.com/torvalds/linux/blob/63676eefb7a026d04b51dcb7aaf54f358517a2ec/drivers/block/loop.c#L773 For example on one of my local nvme I can see the following: cat /sys/class/block/nvme1n1/queue/write_zeroes_max_bytes 131072 cat /sys/class/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040 This means maximum size of a write_zero operation can be 128KiB & maximum size of discard operation can be 2TiB on the block device. So discarding for example 1TB of data, which would be a single block device operation, gets split into 8.3 million block device operations when issued on top of stacked loop device.

Fix for specific case systemd#30393 where 4k sector luks container is created on a 512b device. In this case the partition table needs to be 512b, else the kernel will not be able to find the partition, and we will have to create a loop device to translate the partition table to 4k.

github-actions bot added the please-review PR is ready for (re-)review by a maintainer label Dec 28, 2024