Skip to content

extend-filesystems.service fails sometimes #235

@akunszt

Description

@akunszt

Hello!

We're hunting an elusive issue for a long time ago. We noticed that sometimes our Flatcar Linux instances fails to boot up properly. The basic system is up and running but our SystemD services configured through Ignition fails to start at all. They are all masked. It happens very rarely so it's hard to debug but we noticed that when this happens the root filesystem isn't resized properly.

This is a non-working node.

ip-10-19-36-22 ~ # df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p9  2.0G   33M  1.9G   2% /
ip-10-19-36-22 ~ # journalctl -u extend-filesystems
-- Logs begin at Thu 2020-11-12 16:23:53 UTC, end at Thu 2020-11-12 16:50:10 UTC. --
Nov 12 16:25:43 localhost systemd[1]: Starting Extend Filesystems...
Nov 12 16:25:43 localhost systemd[1]: extend-filesystems.service: Succeeded.
Nov 12 16:25:43 localhost systemd[1]: Started Extend Filesystems.

You can see that the extend-filesystems didn't do anything. This is how it looks like on a working node.

Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p9  122G  4.0G  113G   4% /
ip-10-19-100-248 ~ # journalctl -u extend-filesystems
-- Logs begin at Thu 2020-11-12 16:21:18 UTC, end at Thu 2020-11-12 17:01:57 UTC. --
Nov 12 16:21:30 localhost systemd[1]: Starting Extend Filesystems...
Nov 12 16:21:30 localhost extend-filesystems[1678]: resize2fs 1.44.5 (15-Dec-2018)
Nov 12 16:21:31 ip-10-19-100-248 extend-filesystems[1678]: Filesystem at /dev/nvme0n1p9 is mounted on /; on-line resizing required
Nov 12 16:21:31 ip-10-19-100-248 extend-filesystems[1678]: old_desc_blocks = 1, new_desc_blocks = 16
Nov 12 16:21:31 ip-10-19-100-248 extend-filesystems[1678]: The filesystem on /dev/nvme0n1p9 is now 32947195 (4k) blocks long.
Nov 12 16:21:31 ip-10-19-100-248 systemd[1]: extend-filesystems.service: Succeeded.
Nov 12 16:21:31 ip-10-19-100-248 systemd[1]: Started Extend Filesystems.

Based on the /usr/lib/flatcar/extend-filesystems script that line could be the issue.

mapfile DEV_LIST < <(lsblk -P -o NAME,PARTTYPE,FSTYPE,MOUNTPOINT)

There might be a race condition between this line and filling up the necessary information about the block devices but this is just a wild guess. Maybe it's totally empty, maybe the UUID is missing - the script searches for that UUID later - or the FSTYPE is empty. We couldn't tell.

Our other idea is that the cgpt resize "${device}" doesn't do anything as the partition is already at the extended size. On the failed node the blockdev --getsz /dev/nvme0n1p9 gave back the good, extended value. So when we reran the extend-filesystems script it didn't do anything because the old and the new size were the same, but the filesystem was still only 2GiB. Unfortunately on this node we ran the script before we checked the blockdev's output so we can't tell for sure what was the state after the boot finished.

We're using Flatcar Container Linux by Kinvolk stable (2345.3.0) on this specific node but we noticed this behaviour for a long time ago even in the old CoreOS days. We're running EBS backed EC2 instances in AWS.

Did you see this before? Did anyone else reported this or similar? Does an upgrade will solve this? Can you add a logging line to show what is in the DEV_LIST for future debugging sessions? Also can you add an echo to show the device, old_size and new_size also for future debugging purposes? Is it possible for us to create a PR about this?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions