-
Notifications
You must be signed in to change notification settings - Fork 49
Stable 3227.2.2 randomly causes processes to hang on I/O related operations #847
Copy link
Copy link
Closed
flatcar-archive/coreos-overlay
#2315Labels
area/kernelIssues related to kernelIssues related to kernelkind/bugSomething isn't workingSomething isn't working
Description
Description
We've seen multiple nodes (different regions and environments) stalling completely (unresponsive kubelets and containerds, journald and timesyncd units failing, no login possible, ...) on the 3227.2.2 release. This seems to be happening mostly on the reboot after the update, but we also had this occurring at random.
Impact
This causes Kubernetes nodes to become NotReady before being drained, which involves volumes not being able to be moved and therefore service interruptions.
Environment and steps to reproduce
- Set-up: Flatcar Stable 3227.2.2, OpenStack, ESXi hypervisors
- Task: on node start, but also during normal operation
- Action(s): no clue if anything specific causes this
- Error:
- we get
task blocked for more than 120 secondserrors and related call stacks, see the screenshot -
- CPU usage is very higher when the issue occurs
- the journal gets corrupted when this issue occurs
- rebooting once more, brings the node back
- once the node is up again, partitions and filesystems appear healthy
Expected behavior
The nodes do not stall completely.
Additional information
We have the feeling that we may hit a kernel bug as we only see this on the 3227.2.2 release were basically only the kernel was changed. Do you have ideas how we can diagnose this further? Thanks.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
area/kernelIssues related to kernelIssues related to kernelkind/bugSomething isn't workingSomething isn't working
Type
Projects
Status
Implemented