Stable 3227.2.2 randomly causes processes to hang on I/O related operations

## Description

We've seen multiple nodes (different regions and environments) stalling completely (unresponsive kubelets and containerds, journald and timesyncd units failing, no login possible, ...) on the 3227.2.2 release. This seems to be happening mostly on the reboot after the update, but we also had this occurring **at random**.

## Impact

This causes Kubernetes nodes to become `NotReady` before being drained, which involves volumes not being able to be moved and therefore service interruptions.

## Environment and steps to reproduce

1. **Set-up**: Flatcar Stable 3227.2.2, OpenStack, ESXi hypervisors
2. **Task**: on node start, but also during normal operation
3. **Action(s)**: no clue if anything specific causes this
4. **Error**:
  -  we get `task blocked for more than 120 seconds` errors and related call stacks, see the screenshot
  - <img width="640" alt="Screenshot 2022-09-07 at 14 49 03" src="https://user-images.githubusercontent.com/10918541/188882299-860e12f2-4443-4430-a39f-c7934d876e3c.png">
  - CPU usage is very higher when the issue occurs
  - the journal gets corrupted when this issue occurs
  - rebooting once more, brings the node back
  - once the node is up again, partitions and filesystems appear healthy

## Expected behavior

The nodes do not stall completely.

## Additional information

We have the feeling that we may hit a kernel bug as we only see this on the 3227.2.2 release were basically only the kernel was changed. Do you have ideas how we can diagnose this further? Thanks.

cc @databus23 @defo89 @jknipper


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stable 3227.2.2 randomly causes processes to hang on I/O related operations #847

Description

Impact

Environment and steps to reproduce

Expected behavior

Additional information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Stable 3227.2.2 randomly causes processes to hang on I/O related operations #847

Description

Description

Impact

Environment and steps to reproduce

Expected behavior

Additional information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions