Skip to content

Conversation

@scarlet-storm
Copy link

Mostly consists of fixes to

Also contains some other improvements with using physical block devices.

  • Automatically probe sector size of physical block device, if user does not pass luks-sector-size explicitly.
  • Assign partitions to 1 MiB boundaries, as it is the standard practice followed by all tools, fdisk, gptfdisk, gnu parted etc.
  • Avoid stacking of loop device on top of physical block device in home_create_luks as it leads to degradation of discard operations, and mkfs getting stuck.

@github-actions github-actions bot added the please-review PR is ready for (re-)review by a maintainer label Dec 28, 2024
return log_error_errno(errno, "Failed to wipe partition table: %m");
if ((size_t) n != ssz * 2)
return log_error_errno(SYNTHETIC_ERRNO(EIO), "Short write while wiping partition table.");

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove this? i added this explicitly because libfdisk got confused by the old partition table, and its pmbr.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not face an issue with writing out the new partition table. Also I was not able to understand the exact issue that the pmbr is causing from the earlier commit message. I'm not sure why fdisk wouldn't like the existing partition table.
Could you explain the issue in detail, and I will try reproducing it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that it should be fine to drop this, if current libfdisk works without it. Maybe it was a bug in an old version, maybe it was a misunderstanding. If the issue pops up again, we'll revert the removal.

@poettering
Copy link
Member

hmm, i am tempted to say we should use 4k sectors everywhere by default, unless specified in the user record otherwiwse. (and of course the gpt partition table should use the native sector size if we are talking about raw block devices).

afaik there's now downside of using larger sector sizes on the luks and fs level, but it makes our images more portable...

@poettering poettering added reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks homed homed, homectl, pam_homed and removed please-review PR is ready for (re-)review by a maintainer labels Jan 3, 2025
@scarlet-storm
Copy link
Author

hmm, i am tempted to say we should use 4k sectors everywhere by default, unless specified in the user record otherwiwse. (and of course the gpt partition table should use the native sector size if we are talking about raw block devices).

afaik there's now downside of using larger sector sizes on the luks and fs level, but it makes our images more portable...

Hi, thanks for the review.
I was also thinking about using a 4k image size by default. Most of the filesystems use 4k sector size by default anyways even when the underlying sector size is 512. There should be downsides to using 4k by default for file based images.
The concerns might be

  • Small filesystems or lot of small files with sizes < 4k. xfs by default still uses a 512 default sector size, if the underlying image supports it and ext4 uses 512 sector size if filesystem < 2G iirc.
  • With respect to backwards compatibility we will have to always probe image files we are opening assuming a sector size of 512 & 4k, as the image might have been created with an old version of systemd-homework. This should be fine to implement and not an issue.

Also my thoughts were that these functions can be rewritten to use an internal systemd-repart library in the future. I think all the partition, luks & fs creation / resizing operations can be handled by the repart library. It will require creation of an repart api and add an additional dependency for homed

@GreyXor
Copy link

GreyXor commented Feb 26, 2025

Hi @scarlet-storm
I have a 4k ssd and I'm interested in this patch. Do you need any tests? or do we miss anything?

@scarlet-storm
Copy link
Author

I will work on cleaning up some of the code style issues mentioned in this PR.
I think the other changes can be follow up patches.

@GreyXor
Copy link

GreyXor commented Apr 29, 2025

Hi @scarlet-storm,
Just wanted to check in regarding the code style cleanup and the follow-up patches you mentioned. Did you had a chance to make any progress or if there’s anything I can do to help move things forward.
Thanks!

scarlet-storm added a commit to scarlet-storm/flake that referenced this pull request Aug 9, 2025
@GreyXor
Copy link

GreyXor commented Nov 17, 2025

I will work on cleaning up some of the code style issues mentioned in this PR. I think the other changes can be follow up patches.

Hello, gentle ping :) any news about it ? thank you

@github-actions github-actions bot added please-review PR is ready for (re-)review by a maintainer and removed reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks labels Nov 18, 2025
@scarlet-storm scarlet-storm marked this pull request as draft November 18, 2025 05:43
@scarlet-storm scarlet-storm marked this pull request as ready for review November 18, 2025 07:33
@scarlet-storm
Copy link
Author

I have addressed the code style issues. For the open comments

  • zeroing of PMBR, I don't understand why this is needed. Could you provide a simple example to reproduce this ? I did not face any issue with allowing fdisk to write out the partition table, without zeroing.
  • Changing default sector size to 4k for file based images. I think this can be taken up as part of a future PR. As I mentioned earlier, I think ideally we could rewrite the image creation to use systemd-repart

@yuwata yuwata added this to the v259 milestone Nov 19, 2025
Copy link
Member

@keszybz keszybz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those changes look generally reasonable, but I don't think I grok all the implications. Some small comments on style.

Comment on lines +68 to +70
/* Round down to the nearest 1 MiB size. Given that most tools generally align partitions to 1 MiB boundaries, let's align our
* partitions to that too. In the worst case we'll waste 1 MiB per partition that way, but I think I can live
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want to do this? The commit describes what is being done, but not really why. Doing something the same as others is not enough motivation, unless the difference is causing problems.

After those changes, the comment doesn't quite make sense. The sentence about wasting 4k bytes was partially a joke. If we're wasting 1MB, a better explanation is needed.

Copy link
Author

@scarlet-storm scarlet-storm Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • For standardization. the code does not explicitly set the alignment when using fdisk as 4k, so by default fdisk is already using 1MiB alignment for initial partition creation. So the comment says 4k, but the created partition is already aligned to 1MiB boundary. Moreover during resize, fdisk_partition_size_explicit is called with enable=1 . This means that fdisk no longer will align the partitions. So we have a mix of partitions where the initial partition is aligned to 1MiB & if any resize is applied, the partition is aligned to 4KiB.
  • As for the reason why 1MiB is used a standard partition size by all linux tools (fdisk, gparted, gptfdisk), it's probably because it is a multiple of most common sector sizes. Note that even for 4K disk, the sector size of 4k is applicable for read/write operations. Other operations like discard might have much higher granularity like 8k / 16k. Also for nand cells probably write amplification also plays a part.
  • wasting 1 MiB is still a joke imo.

@scarlet-storm
Copy link
Author

scarlet-storm commented Nov 20, 2025

Those changes look generally reasonable, but I don't think I grok all the implications. Some small comments on style.

Thank you for the review. If there is anything particular you want to discuss in detail, let me know.

@bluca bluca modified the milestones: v259, v260 Nov 25, 2025
@scarlet-storm scarlet-storm force-pushed the homework-4k-disk branch 2 times, most recently from 6de4f5d to 329ec52 Compare November 29, 2025 06:39
@yuwata
Copy link
Member

yuwata commented Jan 4, 2026

Sorry for late review, but now this needs to be rebased.

@yuwata yuwata added needs-rebase and removed please-review PR is ready for (re-)review by a maintainer labels Jan 4, 2026
Ensure we use the same sector size as used in the created
fdisk context when converting between sectors and bytes.
Remove zeroing of PMBR before writing out the new partition table.
There is no reason, to do this explicitly ?
Auto-probe the luks sector size, if not explicitly
specified in the home record
Align partitions to 1MiB for consistency with regular partition tools
which use 1MiB alignment by default
If there is an explicit sector size specified in the user record,
use the same when probing the file using libblkid. The default
is 512 bytes, which will not be able to find the signatures, if the
partition table on regular file was created assuming 4k sectors
Ensure we don't create a loop device on top of a physical block device.
This leads to huge performance degradation of discard operations if the
physical device does not support discard_on_zeroes.

- loop device historical semantics dictates that when the device is
  discarded, it needs to return zero data on read. This can be
  implemented easily on a filesystem. since fallocate zero-range
  would return immediately & the holes are handled at the filesystem
  level to return zero data on read.
- For a raw block device, the feature (discard_zeroes_data) depends on
  the capabilities of the physical device that is exposed to the
  block layer by the driver. This means that to guarantee that the loop
  device stacked on a block device returns zero on discarded data,
  it needs to convert discarded range into write_zero op on the block device.
  https://github.com/torvalds/linux/blob/63676eefb7a026d04b51dcb7aaf54f358517a2ec/drivers/block/loop.c#L773

For example on one of my local nvme I can see the following:
cat /sys/class/block/nvme1n1/queue/write_zeroes_max_bytes
131072
cat /sys/class/block/nvme0n1/queue/discard_max_hw_bytes
2199023255040

This means maximum size of a write_zero operation can be 128KiB &
maximum size of discard operation can be 2TiB on the block device.
So discarding for example 1TB of data, which would be a single block
device operation, gets split into 8.3 million block device operations
when issued on top of stacked loop device.
Fix for specific case systemd#30393 where 4k sector luks container is created
on a 512b device. In this case the partition table needs to be 512b,
else the kernel will not be able to find the partition, and we will
have to create a loop device to translate the partition table to 4k.
@github-actions github-actions bot added please-review PR is ready for (re-)review by a maintainer and removed needs-rebase labels Jan 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

homed homed, homectl, pam_homed please-review PR is ready for (re-)review by a maintainer

Development

Successfully merging this pull request may close these issues.

homectl create --image-path= does not work on 4Kn drives Cannot login with user created using --luks-sector-size=4096

6 participants