Project

General

Profile

Actions

Bug #51733

closed

offline host hangs serve loop for 15 mins

Added by Daniel Pivonka over 4 years ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
cephadm
Target version:
-
% Done:

0%

Source:
Backport:
quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Tags (freeform):
Fixed In:
v17.0.0-11228-g73f8d0fdcd
Released In:
v18.2.0~2522
Upkeep Timestamp:
2025-07-14T17:44:35+00:00

Description

when a host in your cluster goes offline the next time the serve loop starts _refresh_hosts_and_daemons() will be called and eventually _run_cephadm(gather-facts) will be called cause cephadm doesnt know its offline yet.

in _run_cephadm() _remote_connection() will be called to get a connection to the host.
https://github.com/ceph/ceph/blob/master/src/pybind/mgr/cephadm/serve.py#L1166

_remote_connection() calls _get_connection() which will return the current connection if it has one or will open a new connection. if it cant make a connection it then marks the host as offline.
https://github.com/ceph/ceph/blob/master/src/pybind/mgr/cephadm/serve.py#L1347
https://github.com/ceph/ceph/blob/64dbe17fdbb27abd89755c61ef01744da5d683cc/src/pybind/mgr/cephadm/module.py#L1301

unfortunately its returning a old current connection to the host that is actually offline and trys to run gather facts on the host through that connection.
it then takes 15 mins for it to error out cause that connection is not going to work cause the host is actually offline. during that time the serve loop is stuck.

once it errors out the next time the serve loop starts the host is marked as offline correctly.

ive attached a log of this happening. vm-03 is the offline host


Files

offlinehostserveloophang.txt (8.86 KB) offlinehostserveloophang.txt Daniel Pivonka, 07/19/2021 08:02 PM

Related issues 1 (0 open1 closed)

Related to Orchestrator - Bug #51736: mgr hung forever when execute multiprocessing.pool.ThreadPool accidentallyResolved

Actions
Actions #1

Updated by Sebastian Wagner over 4 years ago

  • Related to Bug #51736: mgr hung forever when execute multiprocessing.pool.ThreadPool accidentally added
Actions #2

Updated by Daniel Pivonka over 4 years ago

only is happening is host is not gracefully shutdown

Actions #3

Updated by Adam King about 4 years ago

  • Status changed from New to In Progress
  • Assignee set to Adam King
  • Pull request ID set to 45286
Actions #4

Updated by Adam King almost 4 years ago

  • Status changed from In Progress to Pending Backport
Actions #5

Updated by Redouane Kachach Elhicou almost 4 years ago

  • Backport set to quincy,pacific
Actions #6

Updated by Redouane Kachach Elhicou almost 4 years ago

  • Status changed from Pending Backport to Resolved
Actions #7

Updated by Upkeep Bot 8 months ago

  • Merge Commit set to 73f8d0fdcdb73cf4a910b21fc4ceace2b74dffc8
  • Fixed In set to v17.0.0-11228-g73f8d0fdcd
  • Released In set to v18.2.0~2522
  • Upkeep Timestamp set to 2025-07-14T17:44:35+00:00
Actions

Also available in: Atom PDF