Skip to content

Stable 3227.2.2 randomly causes processes to hang on I/O related operations #847

@Nuckal777

Description

@Nuckal777

Description

We've seen multiple nodes (different regions and environments) stalling completely (unresponsive kubelets and containerds, journald and timesyncd units failing, no login possible, ...) on the 3227.2.2 release. This seems to be happening mostly on the reboot after the update, but we also had this occurring at random.

Impact

This causes Kubernetes nodes to become NotReady before being drained, which involves volumes not being able to be moved and therefore service interruptions.

Environment and steps to reproduce

  1. Set-up: Flatcar Stable 3227.2.2, OpenStack, ESXi hypervisors
  2. Task: on node start, but also during normal operation
  3. Action(s): no clue if anything specific causes this
  4. Error:
  • we get task blocked for more than 120 seconds errors and related call stacks, see the screenshot
  • Screenshot 2022-09-07 at 14 49 03
  • CPU usage is very higher when the issue occurs
  • the journal gets corrupted when this issue occurs
  • rebooting once more, brings the node back
  • once the node is up again, partitions and filesystems appear healthy

Expected behavior

The nodes do not stall completely.

Additional information

We have the feeling that we may hit a kernel bug as we only see this on the 3227.2.2 release were basically only the kernel was changed. Do you have ideas how we can diagnose this further? Thanks.

cc @databus23 @defo89 @jknipper

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/kernelIssues related to kernelkind/bugSomething isn't working

    Type

    No type

    Projects

    Status

    Implemented

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions