extend-filesystems.service fails sometimes

Hello!

We're hunting an elusive issue for a long time ago. We noticed that sometimes our Flatcar Linux instances fails to boot up properly. The basic system is up and running but our SystemD services configured through Ignition fails to start at all. They are all masked. It happens very rarely so it's hard to debug but we noticed that when this happens the root filesystem isn't resized properly.

This is a non-working node.
```
ip-10-19-36-22 ~ # df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p9  2.0G   33M  1.9G   2% /
ip-10-19-36-22 ~ # journalctl -u extend-filesystems
-- Logs begin at Thu 2020-11-12 16:23:53 UTC, end at Thu 2020-11-12 16:50:10 UTC. --
Nov 12 16:25:43 localhost systemd[1]: Starting Extend Filesystems...
Nov 12 16:25:43 localhost systemd[1]: extend-filesystems.service: Succeeded.
Nov 12 16:25:43 localhost systemd[1]: Started Extend Filesystems.
```

You can see that the extend-filesystems didn't do anything. This is how it looks like on a working node.

```ip-10-19-100-248 ~ # df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p9  122G  4.0G  113G   4% /
ip-10-19-100-248 ~ # journalctl -u extend-filesystems
-- Logs begin at Thu 2020-11-12 16:21:18 UTC, end at Thu 2020-11-12 17:01:57 UTC. --
Nov 12 16:21:30 localhost systemd[1]: Starting Extend Filesystems...
Nov 12 16:21:30 localhost extend-filesystems[1678]: resize2fs 1.44.5 (15-Dec-2018)
Nov 12 16:21:31 ip-10-19-100-248 extend-filesystems[1678]: Filesystem at /dev/nvme0n1p9 is mounted on /; on-line resizing required
Nov 12 16:21:31 ip-10-19-100-248 extend-filesystems[1678]: old_desc_blocks = 1, new_desc_blocks = 16
Nov 12 16:21:31 ip-10-19-100-248 extend-filesystems[1678]: The filesystem on /dev/nvme0n1p9 is now 32947195 (4k) blocks long.
Nov 12 16:21:31 ip-10-19-100-248 systemd[1]: extend-filesystems.service: Succeeded.
Nov 12 16:21:31 ip-10-19-100-248 systemd[1]: Started Extend Filesystems.
```

Based on the `/usr/lib/flatcar/extend-filesystems` script that line could be the issue.
```
mapfile DEV_LIST < <(lsblk -P -o NAME,PARTTYPE,FSTYPE,MOUNTPOINT)
```

There might be a race condition between this line and filling up the necessary information about the block devices but this is just a wild guess. Maybe it's totally empty, maybe the UUID is missing - the script searches for that UUID later - or the FSTYPE is empty. We couldn't tell.

Our other idea is that the `cgpt resize "${device}"` doesn't do anything as the partition is already at the extended size. On the failed node the `blockdev --getsz /dev/nvme0n1p9` gave back the good, extended value. So when we reran the `extend-filesystems` script it didn't do anything because the old and the new size were the same, but the filesystem was still only 2GiB. Unfortunately on this node we ran the script before we checked the `blockdev`'s output so we can't tell for sure what was the state after the boot finished.

We're using `Flatcar Container Linux by Kinvolk stable (2345.3.0)` on this specific node but we noticed this behaviour for a long time ago even in the old `CoreOS` days. We're running EBS backed EC2 instances in AWS.

Did you see this before? Did anyone else reported this or similar? Does an upgrade will solve this? Can you add a logging line to show what is in the DEV_LIST for future debugging sessions? Also can you add an echo to show the device, old_size and new_size also for future debugging purposes? Is it possible for us to create a PR about this?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extend-filesystems.service fails sometimes #235

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

extend-filesystems.service fails sometimes #235

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions