Bug #69760: Monitors crash largely due to the structure of pg-upmap-primary - RADOS - Ceph

Actions

Copy link

Bug #69760

closed

Monitors crash largely due to the structure of pg-upmap-primary

Added by Michal Strnad about 1 year ago. Updated about 1 year ago.

Status:

Duplicate

Priority:

High

Assignee:

Laura Flores

Category:

Monitor

Target version:

% Done:

Source:

Backport:

reef,squid

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

Ceph - v18.2.1, Ceph - v18.2.4

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Tags (freeform):

Merge Commit:

Fixed In:

Released In:

Upkeep Timestamp:

Description

Hi.

We ran into an issue with "pg-upmap-primary", which resulted in our monitors crashing massively (around 1,000 crashes per day). According to [1], it should be possible to remove these pg-upmap-primaries. Unfortunately, we are unable to do so because these PGs, and therefore the pool, no longer exist.

root@app001.clX ~ # ceph osd dump | grep 'pg_upmap_primary' | grep 24.ffc
pg_upmap_primary 24.ffc 232
root@app001.clX ~ # ceph osd rm-pg-upmap-primary 24.ffc
Error ENOENT: pgid '24.ffc' does not exist
root@app001.clX ~ # ceph pg dump | grep "^24\."
dumped all

Is there any way to remove these structures?

We also tried upgrading from the current version 18.2.1 to 18.2.4, but this led to a state on our three-node test cluster where one of the three monitors failed to start, along with a third of the OSDs, due to issues with the mentioned structure. Restarting the daemon didn’t help.

Does anyone have a solution or an idea? This is becoming quite a problem for us.

Below, I am attaching one of the many monitor crash logs.

Thank you very much for any advice!

Michal

[1] https://tracker.ceph.com/issues/61948#note-32

{
    "assert_condition": "pg_upmap_primaries.empty()",
    "assert_file": "/builddir/build/BUILD/ceph-18.2.1/src/osd/OSDMap.cc",
    "assert_func": "void OSDMap::encode(ceph::buffer::v15_2_0::list&, uint64_t) const",
    "assert_line": 3239,
    "assert_msg": "/builddir/build/BUILD/ceph-18.2.1/src/osd/OSDMap.cc: In function 'void OSDMap::encode(ceph::buffer::v15_2_0::list&, uint64_t) const' thread 7f94216a3640 time 2025-02-02T19:16:03.629964+0100\n/builddir/build/BUILD/ceph-18.2.1/src/osd/OSDMap.cc: 3239: FAILED ceph_assert(pg_upmap_primaries.empty())\n",
    "assert_thread_name": "ms_dispatch",
    "backtrace": [
        "/lib64/libc.so.6(+0x54db0) [0x7f9429054db0]",
        "/lib64/libc.so.6(+0xa365c) [0x7f94290a365c]",
        "raise()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7f9429d630df]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x163243) [0x7f9429d63243]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x1a0f38) [0x7f9429da0f38]",
        "(OSDMonitor::reencode_full_map(ceph::buffer::v15_2_0::list&, unsigned long)+0xe2) [0x55ca54957e22]",
        "(OSDMonitor::get_version_full(unsigned long, unsigned long, ceph::buffer::v15_2_0::list&)+0x1de) [0x55ca549596ae]",
        "(OSDMonitor::build_latest_full(unsigned long)+0x2a3) [0x55ca549599a3]",
        "(OSDMonitor::check_osdmap_sub(Subscription*)+0xc8) [0x55ca5495be98]",
        "(Monitor::handle_subscribe(boost::intrusive_ptr<MonOpRequest>)+0xf04) [0x55ca54834dd4]",
        "(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x6a6) [0x55ca548359c6]",
        "(Monitor::_ms_dispatch(Message*)+0x779) [0x55ca54836d59]",
        "/usr/bin/ceph-mon(+0x2f3dfe) [0x55ca547f1dfe]",
        "(DispatchQueue::entry()+0x52a) [0x7f9429f5766a]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x3e7321) [0x7f9429fe7321]",
        "/lib64/libc.so.6(+0xa1912) [0x7f94290a1912]",
        "/lib64/libc.so.6(+0x3f450) [0x7f942903f450]" 
    ],
    "ceph_version": "18.2.1",
    "crash_id": "2025-02-02T18:16:03.632571Z_f5516ed0-6df5-4267-bada-71f5d8d764ba",
    "entity_name": "mon.mon001-clX",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "9",
    "os_version_id": "9",
    "process_name": "ceph-mon",
    "stack_sig": "772ef523b041edc5147d1d9905926fb794d32b2635368a8199f6e2e4f2d688bf",
    "timestamp": "2025-02-02T18:16:03.632571Z",
    "utsname_hostname": "app001.clX",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-402.el9.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Thu Dec 21 19:46:35 UTC 2023" 
}

Files

sessions_anonymized (951 KB) sessions_anonymized

Michal Strnad, 02/27/2025 02:42 PM

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Michal Strnad about 1 year ago

Hi,

would anyone happen to have any suggestions on how to address this? We're still operating on the edge and have lost monitor quorum multiple times due to the high frequency of crashes, as I previously described.

Any feedback would be greatly appreciated. Thank you!

Michal

Actions

Copy link

Updated by Laura Flores about 1 year ago

Project changed from Ceph to RADOS
Category changed from Monitor to Monitor

Actions

Copy link

Updated by Radoslaw Zarzynski about 1 year ago

Description updated (diff)

Actions

Copy link

Updated by Yaarit Hatuka about 1 year ago

Crash signature (v1) updated (diff)
Crash signature (v2) updated (diff)

Crashes reported in telemetry with the same assert_func and assert_condition:
http://telemetry.front.sepia.ceph.com:4000/d/Nvj6XTaMk/spec-search?orgId=1&var-substr_1=&var-substr_2=&var-substr_3=&var-majors_affected=&var-minors_affected=&var-assert_function=void%20OSDMap::encode(ceph::buffer::v15_2_0::list%26,%20uint64_t)%20const&var-assert_condition=pg_upmap_primaries.empty&var-sig_v1=&var-sig_v2=&var-daemons=&var-only_new_fingerprints=false&var-status_description=All&var-only_open=false

This one seems related:
http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?var-sig_v2=ccf7b97344c6c787cba071e014d31eeae4ae7710a07be998d4976461dfd5735a&orgId=1

Actions

Copy link

Updated by Radoslaw Zarzynski about 1 year ago · Edited

root@app001.clX ~ # ceph osd dump | grep 'pg_upmap_primary' | grep 24.ffc
pg_upmap_primary 24.ffc 232
root@app001.clX ~ # ceph osd rm-pg-upmap-primary 24.ffc
Error ENOENT: pgid '24.ffc' does not exist
root@app001.clX ~ # ceph pg dump | grep "^24\."
dumped all

Yeah, we should have --force for the rm-pg-upmap-primary to bypass the two pool-related checks.

Is there any way to remove these structures?

Without further patches only very low-level, hackhish editing of the maps comes to mind.
Finally, we should get support in osdmaptool for purging all the upmap-primary mappings.

"ceph_version": "18.2.1",

This is interesting. 18.2.1 takes the assert path only for very old, pre-nautilus peers.
Perhaps there is a really outdated client in the cluster. If so, cutting it off can help.
ceph daemon mon.X sesssions or msgr's (debug_ms at at least 10) will be useful:

  connection->set_features((uint64_t)connect_reply.features &
                           (uint64_t)connection->policy.features_supported);
  ldout(cct, 10) << __func__ << " connect success " << connect_seq
                 << ", lossy = " << connection->policy.lossy << ", features " 
                 << connection->get_features() << dendl;

We also tried upgrading from the current version 18.2.1 to 18.2.4, but this led to a state on our three-node test cluster where one of the three monitors failed to start, along with a third of the OSDs, due to issues with the mentioned structure. Restarting the daemon didn’t help.

Do you have logs by any chance?

Actions

Copy link

Updated by Radoslaw Zarzynski about 1 year ago

Another question about the testing cluster: all the monitors have been upgraded to 18.2.4? I see a scenario (around encode_features of an OSDMap::Incremental) when 18.2.0 or 18.2.1 (but not 18.2.2) mon can cause newer mons to take the assert path.

Also, the snippet below might help with nailing down very-old-clients:

$ bin/unittest_features 
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from features
[ RUN      ] features.release_features
1 argonaut features 0x40000 looks like argonaut
2 bobtail features 0x40000 looks like argonaut
3 cuttlefish features 0x40000 looks like argonaut
4 dumpling features 0x42040000 looks like dumpling
5 emperor features 0x42040000 looks like dumpling
6 firefly features 0x20842040000 looks like firefly
7 giant features 0x20842040000 looks like firefly
8 hammer features 0x1020842040000 looks like hammer
9 infernalis features 0x1020842040000 looks like hammer
10 jewel features 0x401020842040000 looks like jewel
11 kraken features 0xc01020842040000 looks like kraken
12 luminous features 0xe01020842240000 looks like luminous
13 mimic features 0xe01020842240000 looks like luminous
14 nautilus features 0xe01020842240000 looks like luminous
15 octopus features 0xe01020842240000 looks like luminous
16 pacific features 0xe01020842240000 looks like luminous
17 quincy features 0xe01020842240000 looks like luminous
18 reef features 0xe010208d2240000 looks like reef
19 squid features 0xe010248d2240000 looks like squid
20 tentacle features 0xe010248d2240000 looks like squid

Actions

Copy link

Updated by Radoslaw Zarzynski about 1 year ago

Another very important question about the testing cluster: is the require-min-compat-client set to reef?
ceph osd get-require-min-compat-client should tell. If don't, then even incompliant clients can connect and tigger the assertion.
The min-compat is enforced by the upmap-primary mgmt commands but clients are validated since 18.2.4.

Actions

Copy link

Updated by Radoslaw Zarzynski about 1 year ago

Assignee set to Laura Flores
Priority changed from Normal to High

Laura is lookling into the purge facilities.

Actions

Copy link

Updated by Michal Strnad about 1 year ago

File sessions_anonymized sessions_anonymized added

Hi.

Thank you very much for your response. The output of the command "ceph osd get-require-min-compat-client" returns reef.

I am attaching the output of "ceph daemon mon.monX sessions", where I have obfuscated/replaced sensitive data such as IP addresses and client IDs... but I have left the features unchanged.

Yes, the monitors were also upgraded to 18.2.4 during the test. I can repeat the test again if needed.

Once again, thank you very much for dedicating your time to this.

Cheers,
Michal Strnad

Actions

Copy link

#10

Updated by Radoslaw Zarzynski about 1 year ago

$ grep -r con_features_hex /tmp/sessions_anonymized | sort | uniq
        "con_features_hex": "2f018fb87aa4aafe",
        "con_features_hex": "3f01cfbffffdffff",

while

#define CEPH_FEATURE_INCARNATION_1 (0ull)
#define CEPH_FEATURE_INCARNATION_2 (1ull<<57)              // SERVER_JEWEL
#define CEPH_FEATURE_INCARNATION_3 ((1ull<<57)|(1ull<<28)) // SERVER_MIMIC

#define DEFINE_CEPH_FEATURE(bit, incarnation, name)                     \
        const static uint64_t CEPH_FEATURE_##name = (1ULL<<bit);                \
        const static uint64_t CEPH_FEATUREMASK_##name =                 \
                (1ULL<<bit | CEPH_FEATURE_INCARNATION_##incarnation);

// ...
DEFINE_CEPH_FEATURE( 2, 3, SERVER_NAUTILUS)

so SERVER_NAUTILUS requires 1<<2 | ((1<<57)|(1<<28)) (0x200000010000004) to be in the feature field.
Let's evaluate:

>>> hex(0x2f018fb87aa4aafe & 0x200000010000004)
'0x200000010000004'
>>> hex(0x3f01cfbffffdffff & 0x200000010000004)
'0x200000010000004'

In short: at this particular moment there was no pre-nautilus client in cluster which actually might be the case as we know the crashes happen frequently but not all-the-time.
It might be that the old clients connects from time-to-time.

You might try loop the sessions asok hoping the script will win the race with a crash or resort to the messenger's logs analysis.

Actions

Copy link

#11

Updated by Radoslaw Zarzynski about 1 year ago

Let me please reask about the testring cluster:: is the require-min-compat-client set to reef there?

Actions

Copy link

#12

Updated by Michal Strnad about 1 year ago

Hi.

Yes, on the testing cluster is set require-min-compat-client on reef.

Ad. My colleague tried upgrading from 18.2.1 to 18.2.4 again and didn't encounter any monitori crashes, so the question is whether something else was introduced in the previous test ...

Thank you
Michal

Actions

Copy link

#13

Updated by Laura Flores about 1 year ago

Backport set to reef,squid

Actions

Copy link

#14

Updated by Laura Flores about 1 year ago

Related to Bug #67179: Make rm-pg-upmap-primary able to remove mappings by force added

Actions

Copy link

#15

Updated by Laura Flores about 1 year ago · Edited

Hi Michal,

We also tried upgrading from the current version 18.2.1 to 18.2.4, but this led to a state on our three-node test cluster where one of the three monitors failed to start, along with a third of the OSDs, due to issues with the mentioned structure. Restarting the daemon didn’t help.

Which version were the failed mon and OSDs running on? From the information on this tracker issue, it looks like the crash only occurred on v18.2.1 daemons. In other words, did any daemons running v18.2.4 hit the crash? Or were you only prevented from upgrading due to the crash happening on v18.2.1 daemons?

Also, do you know what conditions were different that led to your eventual successful upgrade to v18.2.4? Did you happen to evict any older pre-nautilus clients? It is known that v18.2.1 did not have proper guards to disallow older clients from connecting, so this is a possible scenario if the crash only happened for you on v18.2.1.

Actions

Copy link

#16

Updated by Laura Flores about 1 year ago

Status changed from New to In Progress

Meant to update this to "In Progress" a while ago.

Actions

Copy link

#17

Updated by Michal Strnad about 1 year ago

Hi.

From the latest test results, it really seems that the monitor crashes related to pg-upmap-primaries are only on version 18.2.1. We haven't been able to reproduce this on 18.2.4, so unfortunately, I cannot properly answer your question about what ultimately led to the successful upgrade. Given the number of successful versus one failed measurement, I would consider this a possible measurement error.

During further testing of 18.2.4, which might partially resolve the situation due to the client version check, as you mentioned, we encountered other issues. Specifically, after upgrading the test Ceph cluster to 18.2.4, we were unable to mount the RBD image on a different machine (outside the cluster) even on a machine, which is part of the cluster. We tried a range of kernel versions with the Reef Ceph version, but none of them passed. In the client logs attempting to mount the RBD image, we found:

feature set mismatch, my 2f018fb87aa4aafe < server's 2f018fb8faa4aafe, missing 80000000

From this, it seems that upgrading from 18.2.1 to 18.2.4 would solve the monitor crash issue, but then no client would be able to mount the RBD image.

The test cluster is accessible from the internet, and if you send us your public SSH key, we will grant you access. You can then try what you need, as this is a set of virtual machines in OpenStack, and we have snapshots created for them, so we can revert the state if necessary.

Thank you very much
Michal

Actions

Copy link

#18

Updated by Laura Flores about 1 year ago · Edited

Michal Strnad wrote in #note-17:

Hi.

From the latest test results, it really seems that the monitor crashes related to pg-upmap-primaries are only on version 18.2.1. We haven't been able to reproduce this on 18.2.4, so unfortunately, I cannot properly answer your question about what ultimately led to the successful upgrade. Given the number of successful versus one failed measurement, I would consider this a possible measurement error.

I see, thanks for clarifying. In that case, that bug is known and is tracked for versions < v18.2.4 in https://tracker.ceph.com/issues/61948. It has been fixed in v18.2.4, in which we provide proper guards to disallow old clients from connecting if pg_upmap_primary is in use.

During further testing of 18.2.4, which might partially resolve the situation due to the client version check, as you mentioned, we encountered other issues. Specifically, after upgrading the test Ceph cluster to 18.2.4, we were unable to mount the RBD image on a different machine (outside the cluster) even on a machine, which is part of the cluster. We tried a range of kernel versions with the Reef Ceph version, but none of them passed. In the client logs attempting to mount the RBD image, we found:

feature set mismatch, my 2f018fb87aa4aafe < server's 2f018fb8faa4aafe, missing 80000000

This error message means that the fix in v18.2.4 is working, although a little too well in your case. Now, the cluster detects that pg_upmap_primary is in use, and the krbd client is too old to understand the feature. With the fix in v18.2.4, pg-upmap-primary or any reef-specific features should not be allowed if an old client is connected, however, since the mappings were added in your cluster in v18.2.1, they were incorrectly "allowed".

The remedy for this would be to apply `ceph osd rm-pg-upmap-primary <pgid>` to all pg-upmap-primary mappings in the osdmap, although I know that in your case you can't since the pool was deleted, and the command doesn't recognize the pgids anymore.

In v18.2.5, which is pending release soon, we fixed the pool deletion issue (tracked in https://tracker.ceph.com/issues/66867), which will prevent this from happening to existing and future pools.

We are also adding a new command that will allow users to remove all pg-upmap-primary mappings at once, which should allow you to cleanly remove the invalid mappings via `ceph osd rm-pg-upmap-primary-all`. This fix is tracked here: https://tracker.ceph.com/issues/67179

The latter will especially help in your case. Removing the mappings should allow the older krbd client to connect.

Actions

Copy link

#19

Updated by Laura Flores about 1 year ago

Related to Bug #66867: pg_upmap_primary items are retained in OSD map for a pool which is already deleted added

Actions

Copy link

#20

Updated by Laura Flores about 1 year ago

Status changed from In Progress to Duplicate

Actions

Copy link

#21

Updated by Laura Flores about 1 year ago

Related to deleted (Bug #67179: Make rm-pg-upmap-primary able to remove mappings by force)

Actions

Copy link

#22

Updated by Laura Flores about 1 year ago

Is duplicate of Bug #67179: Make rm-pg-upmap-primary able to remove mappings by force added

Actions

Copy link

#23

Updated by Laura Flores about 1 year ago

Users affected by [1] (a bug in which some pg_upmap_primary mappings are unable to be removed after pool deletion) may use the new command, `ceph osd rm-pg-upmap-primary-all`, to remove all pg-upmap-primary mappings from the osdmap.

Note that both valid and invalid pg-upmap-primary mappings will be removed, which is acceptable since there should be no data movement involved, and it is better algorithmically to start with fresh mappings. After running the command, the user may then rerun the read balancer manually if on Reef or Squid [2], or let the balancing happen automatically via the mgr module if on Squid [3].

[1] https://tracker.ceph.com/issues/66867
[2] https://docs.ceph.com/en/reef/rados/operations/read-balancer/#offline-optimization
[3] https://docs.ceph.com/en/squid/rados/operations/read-balancer/#online-optimization

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Tags

Custom queries

Bug #69760

Monitors crash largely due to the structure of pg-upmap-primary

Updated by Michal Strnad about 1 year ago

Updated by Laura Flores about 1 year ago

Updated by Radoslaw Zarzynski about 1 year ago

Updated by Yaarit Hatuka about 1 year ago

Updated by Radoslaw Zarzynski about 1 year ago · Edited

Updated by Radoslaw Zarzynski about 1 year ago

Updated by Radoslaw Zarzynski about 1 year ago

Updated by Radoslaw Zarzynski about 1 year ago

Updated by Michal Strnad about 1 year ago

Updated by Radoslaw Zarzynski about 1 year ago

Updated by Radoslaw Zarzynski about 1 year ago

Updated by Michal Strnad about 1 year ago

Updated by Laura Flores about 1 year ago

Updated by Laura Flores about 1 year ago

Updated by Laura Flores about 1 year ago · Edited

Updated by Laura Flores about 1 year ago

Updated by Michal Strnad about 1 year ago

Updated by Laura Flores about 1 year ago · Edited

Updated by Laura Flores about 1 year ago

Updated by Laura Flores about 1 year ago

Updated by Laura Flores about 1 year ago

Updated by Laura Flores about 1 year ago

Updated by Laura Flores about 1 year ago