Bug #69760
closedMonitors crash largely due to the structure of pg-upmap-primary
0%
Description
Hi.
We ran into an issue with "pg-upmap-primary", which resulted in our monitors crashing massively (around 1,000 crashes per day). According to [1], it should be possible to remove these pg-upmap-primaries. Unfortunately, we are unable to do so because these PGs, and therefore the pool, no longer exist.
root@app001.clX ~ # ceph osd dump | grep 'pg_upmap_primary' | grep 24.ffc
pg_upmap_primary 24.ffc 232
root@app001.clX ~ # ceph osd rm-pg-upmap-primary 24.ffc
Error ENOENT: pgid '24.ffc' does not exist
root@app001.clX ~ # ceph pg dump | grep "^24\."
dumped all
Is there any way to remove these structures?
We also tried upgrading from the current version 18.2.1 to 18.2.4, but this led to a state on our three-node test cluster where one of the three monitors failed to start, along with a third of the OSDs, due to issues with the mentioned structure. Restarting the daemon didn’t help.
Does anyone have a solution or an idea? This is becoming quite a problem for us.
Below, I am attaching one of the many monitor crash logs.
Thank you very much for any advice!
Michal
[1] https://tracker.ceph.com/issues/61948#note-32
{
"assert_condition": "pg_upmap_primaries.empty()",
"assert_file": "/builddir/build/BUILD/ceph-18.2.1/src/osd/OSDMap.cc",
"assert_func": "void OSDMap::encode(ceph::buffer::v15_2_0::list&, uint64_t) const",
"assert_line": 3239,
"assert_msg": "/builddir/build/BUILD/ceph-18.2.1/src/osd/OSDMap.cc: In function 'void OSDMap::encode(ceph::buffer::v15_2_0::list&, uint64_t) const' thread 7f94216a3640 time 2025-02-02T19:16:03.629964+0100\n/builddir/build/BUILD/ceph-18.2.1/src/osd/OSDMap.cc: 3239: FAILED ceph_assert(pg_upmap_primaries.empty())\n",
"assert_thread_name": "ms_dispatch",
"backtrace": [
"/lib64/libc.so.6(+0x54db0) [0x7f9429054db0]",
"/lib64/libc.so.6(+0xa365c) [0x7f94290a365c]",
"raise()",
"abort()",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7f9429d630df]",
"/usr/lib64/ceph/libceph-common.so.2(+0x163243) [0x7f9429d63243]",
"/usr/lib64/ceph/libceph-common.so.2(+0x1a0f38) [0x7f9429da0f38]",
"(OSDMonitor::reencode_full_map(ceph::buffer::v15_2_0::list&, unsigned long)+0xe2) [0x55ca54957e22]",
"(OSDMonitor::get_version_full(unsigned long, unsigned long, ceph::buffer::v15_2_0::list&)+0x1de) [0x55ca549596ae]",
"(OSDMonitor::build_latest_full(unsigned long)+0x2a3) [0x55ca549599a3]",
"(OSDMonitor::check_osdmap_sub(Subscription*)+0xc8) [0x55ca5495be98]",
"(Monitor::handle_subscribe(boost::intrusive_ptr<MonOpRequest>)+0xf04) [0x55ca54834dd4]",
"(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x6a6) [0x55ca548359c6]",
"(Monitor::_ms_dispatch(Message*)+0x779) [0x55ca54836d59]",
"/usr/bin/ceph-mon(+0x2f3dfe) [0x55ca547f1dfe]",
"(DispatchQueue::entry()+0x52a) [0x7f9429f5766a]",
"/usr/lib64/ceph/libceph-common.so.2(+0x3e7321) [0x7f9429fe7321]",
"/lib64/libc.so.6(+0xa1912) [0x7f94290a1912]",
"/lib64/libc.so.6(+0x3f450) [0x7f942903f450]"
],
"ceph_version": "18.2.1",
"crash_id": "2025-02-02T18:16:03.632571Z_f5516ed0-6df5-4267-bada-71f5d8d764ba",
"entity_name": "mon.mon001-clX",
"os_id": "centos",
"os_name": "CentOS Stream",
"os_version": "9",
"os_version_id": "9",
"process_name": "ceph-mon",
"stack_sig": "772ef523b041edc5147d1d9905926fb794d32b2635368a8199f6e2e4f2d688bf",
"timestamp": "2025-02-02T18:16:03.632571Z",
"utsname_hostname": "app001.clX",
"utsname_machine": "x86_64",
"utsname_release": "5.14.0-402.el9.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP PREEMPT_DYNAMIC Thu Dec 21 19:46:35 UTC 2023"
}
Files
Updated by Michal Strnad about 1 year ago
Hi,
would anyone happen to have any suggestions on how to address this? We're still operating on the edge and have lost monitor quorum multiple times due to the high frequency of crashes, as I previously described.
Any feedback would be greatly appreciated. Thank you!
Michal
Updated by Laura Flores about 1 year ago
- Project changed from Ceph to RADOS
- Category changed from Monitor to Monitor
Updated by Yaarit Hatuka about 1 year ago
Crashes reported in telemetry with the same assert_func and assert_condition:
http://telemetry.front.sepia.ceph.com:4000/d/Nvj6XTaMk/spec-search?orgId=1&var-substr_1=&var-substr_2=&var-substr_3=&var-majors_affected=&var-minors_affected=&var-assert_function=void%20OSDMap::encode(ceph::buffer::v15_2_0::list%26,%20uint64_t)%20const&var-assert_condition=pg_upmap_primaries.empty&var-sig_v1=&var-sig_v2=&var-daemons=&var-only_new_fingerprints=false&var-status_description=All&var-only_open=false
This one seems related:
http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?var-sig_v2=ccf7b97344c6c787cba071e014d31eeae4ae7710a07be998d4976461dfd5735a&orgId=1
Updated by Radoslaw Zarzynski about 1 year ago · Edited
root@app001.clX ~ # ceph osd dump | grep 'pg_upmap_primary' | grep 24.ffc
pg_upmap_primary 24.ffc 232
root@app001.clX ~ # ceph osd rm-pg-upmap-primary 24.ffc
Error ENOENT: pgid '24.ffc' does not exist
root@app001.clX ~ # ceph pg dump | grep "^24\."
dumped all
Yeah, we should have --force for the rm-pg-upmap-primary to bypass the two pool-related checks.
Is there any way to remove these structures?
Without further patches only very low-level, hackhish editing of the maps comes to mind.
Finally, we should get support in osdmaptool for purging all the upmap-primary mappings.
"ceph_version": "18.2.1",
This is interesting. 18.2.1 takes the assert path only for very old, pre-nautilus peers.
Perhaps there is a really outdated client in the cluster. If so, cutting it off can help.ceph daemon mon.X sesssions or msgr's (debug_ms at at least 10) will be useful:
connection->set_features((uint64_t)connect_reply.features &
(uint64_t)connection->policy.features_supported);
ldout(cct, 10) << __func__ << " connect success " << connect_seq
<< ", lossy = " << connection->policy.lossy << ", features "
<< connection->get_features() << dendl;
We also tried upgrading from the current version 18.2.1 to 18.2.4, but this led to a state on our three-node test cluster where one of the three monitors failed to start, along with a third of the OSDs, due to issues with the mentioned structure. Restarting the daemon didn’t help.
Do you have logs by any chance?
Updated by Radoslaw Zarzynski about 1 year ago
Another question about the testing cluster: all the monitors have been upgraded to 18.2.4? I see a scenario (around encode_features of an OSDMap::Incremental) when 18.2.0 or 18.2.1 (but not 18.2.2) mon can cause newer mons to take the assert path.
Also, the snippet below might help with nailing down very-old-clients:
$ bin/unittest_features [==========] Running 2 tests from 1 test suite. [----------] Global test environment set-up. [----------] 2 tests from features [ RUN ] features.release_features 1 argonaut features 0x40000 looks like argonaut 2 bobtail features 0x40000 looks like argonaut 3 cuttlefish features 0x40000 looks like argonaut 4 dumpling features 0x42040000 looks like dumpling 5 emperor features 0x42040000 looks like dumpling 6 firefly features 0x20842040000 looks like firefly 7 giant features 0x20842040000 looks like firefly 8 hammer features 0x1020842040000 looks like hammer 9 infernalis features 0x1020842040000 looks like hammer 10 jewel features 0x401020842040000 looks like jewel 11 kraken features 0xc01020842040000 looks like kraken 12 luminous features 0xe01020842240000 looks like luminous 13 mimic features 0xe01020842240000 looks like luminous 14 nautilus features 0xe01020842240000 looks like luminous 15 octopus features 0xe01020842240000 looks like luminous 16 pacific features 0xe01020842240000 looks like luminous 17 quincy features 0xe01020842240000 looks like luminous 18 reef features 0xe010208d2240000 looks like reef 19 squid features 0xe010248d2240000 looks like squid 20 tentacle features 0xe010248d2240000 looks like squid
Updated by Radoslaw Zarzynski about 1 year ago
Another very important question about the testing cluster: is the require-min-compat-client set to reef?ceph osd get-require-min-compat-client should tell. If don't, then even incompliant clients can connect and tigger the assertion.
The min-compat is enforced by the upmap-primary mgmt commands but clients are validated since 18.2.4.
Updated by Radoslaw Zarzynski about 1 year ago
- Assignee set to Laura Flores
- Priority changed from Normal to High
Laura is lookling into the purge facilities.
Updated by Michal Strnad about 1 year ago
- File sessions_anonymized sessions_anonymized added
Hi.
Thank you very much for your response. The output of the command "ceph osd get-require-min-compat-client" returns reef.
I am attaching the output of "ceph daemon mon.monX sessions", where I have obfuscated/replaced sensitive data such as IP addresses and client IDs... but I have left the features unchanged.
Yes, the monitors were also upgraded to 18.2.4 during the test. I can repeat the test again if needed.
Once again, thank you very much for dedicating your time to this.
Cheers,
Michal Strnad
Updated by Radoslaw Zarzynski about 1 year ago
$ grep -r con_features_hex /tmp/sessions_anonymized | sort | uniq
"con_features_hex": "2f018fb87aa4aafe",
"con_features_hex": "3f01cfbffffdffff",
while
#define CEPH_FEATURE_INCARNATION_1 (0ull)
#define CEPH_FEATURE_INCARNATION_2 (1ull<<57) // SERVER_JEWEL
#define CEPH_FEATURE_INCARNATION_3 ((1ull<<57)|(1ull<<28)) // SERVER_MIMIC
#define DEFINE_CEPH_FEATURE(bit, incarnation, name) \
const static uint64_t CEPH_FEATURE_##name = (1ULL<<bit); \
const static uint64_t CEPH_FEATUREMASK_##name = \
(1ULL<<bit | CEPH_FEATURE_INCARNATION_##incarnation);
// ...
DEFINE_CEPH_FEATURE( 2, 3, SERVER_NAUTILUS)
so SERVER_NAUTILUS requires 1<<2 | ((1<<57)|(1<<28)) (0x200000010000004) to be in the feature field.
Let's evaluate:
>>> hex(0x2f018fb87aa4aafe & 0x200000010000004) '0x200000010000004' >>> hex(0x3f01cfbffffdffff & 0x200000010000004) '0x200000010000004'
In short: at this particular moment there was no pre-nautilus client in cluster which actually might be the case as we know the crashes happen frequently but not all-the-time.
It might be that the old clients connects from time-to-time.
You might try loop the sessions asok hoping the script will win the race with a crash or resort to the messenger's logs analysis.
Updated by Radoslaw Zarzynski about 1 year ago
Let me please reask about the testring cluster:: is the require-min-compat-client set to reef there?
Updated by Michal Strnad about 1 year ago
Hi.
Yes, on the testing cluster is set require-min-compat-client on reef.
Ad. My colleague tried upgrading from 18.2.1 to 18.2.4 again and didn't encounter any monitori crashes, so the question is whether something else was introduced in the previous test ...
Thank you
Michal
Updated by Laura Flores about 1 year ago
- Related to Bug #67179: Make rm-pg-upmap-primary able to remove mappings by force added
Updated by Laura Flores about 1 year ago · Edited
Hi Michal,
We also tried upgrading from the current version 18.2.1 to 18.2.4, but this led to a state on our three-node test cluster where one of the three monitors failed to start, along with a third of the OSDs, due to issues with the mentioned structure. Restarting the daemon didn’t help.
Which version were the failed mon and OSDs running on? From the information on this tracker issue, it looks like the crash only occurred on v18.2.1 daemons. In other words, did any daemons running v18.2.4 hit the crash? Or were you only prevented from upgrading due to the crash happening on v18.2.1 daemons?
Also, do you know what conditions were different that led to your eventual successful upgrade to v18.2.4? Did you happen to evict any older pre-nautilus clients? It is known that v18.2.1 did not have proper guards to disallow older clients from connecting, so this is a possible scenario if the crash only happened for you on v18.2.1.
Updated by Laura Flores about 1 year ago
- Status changed from New to In Progress
Meant to update this to "In Progress" a while ago.
Updated by Michal Strnad about 1 year ago
Hi.
From the latest test results, it really seems that the monitor crashes related to pg-upmap-primaries are only on version 18.2.1. We haven't been able to reproduce this on 18.2.4, so unfortunately, I cannot properly answer your question about what ultimately led to the successful upgrade. Given the number of successful versus one failed measurement, I would consider this a possible measurement error.
During further testing of 18.2.4, which might partially resolve the situation due to the client version check, as you mentioned, we encountered other issues. Specifically, after upgrading the test Ceph cluster to 18.2.4, we were unable to mount the RBD image on a different machine (outside the cluster) even on a machine, which is part of the cluster. We tried a range of kernel versions with the Reef Ceph version, but none of them passed. In the client logs attempting to mount the RBD image, we found:
feature set mismatch, my 2f018fb87aa4aafe < server's 2f018fb8faa4aafe, missing 80000000
From this, it seems that upgrading from 18.2.1 to 18.2.4 would solve the monitor crash issue, but then no client would be able to mount the RBD image.
The test cluster is accessible from the internet, and if you send us your public SSH key, we will grant you access. You can then try what you need, as this is a set of virtual machines in OpenStack, and we have snapshots created for them, so we can revert the state if necessary.
Thank you very much
Michal
Updated by Laura Flores about 1 year ago · Edited
Michal Strnad wrote in #note-17:
Hi.
From the latest test results, it really seems that the monitor crashes related to pg-upmap-primaries are only on version 18.2.1. We haven't been able to reproduce this on 18.2.4, so unfortunately, I cannot properly answer your question about what ultimately led to the successful upgrade. Given the number of successful versus one failed measurement, I would consider this a possible measurement error.
I see, thanks for clarifying. In that case, that bug is known and is tracked for versions < v18.2.4 in https://tracker.ceph.com/issues/61948. It has been fixed in v18.2.4, in which we provide proper guards to disallow old clients from connecting if pg_upmap_primary is in use.
During further testing of 18.2.4, which might partially resolve the situation due to the client version check, as you mentioned, we encountered other issues. Specifically, after upgrading the test Ceph cluster to 18.2.4, we were unable to mount the RBD image on a different machine (outside the cluster) even on a machine, which is part of the cluster. We tried a range of kernel versions with the Reef Ceph version, but none of them passed. In the client logs attempting to mount the RBD image, we found:
feature set mismatch, my 2f018fb87aa4aafe < server's 2f018fb8faa4aafe, missing 80000000
This error message means that the fix in v18.2.4 is working, although a little too well in your case. Now, the cluster detects that pg_upmap_primary is in use, and the krbd client is too old to understand the feature. With the fix in v18.2.4, pg-upmap-primary or any reef-specific features should not be allowed if an old client is connected, however, since the mappings were added in your cluster in v18.2.1, they were incorrectly "allowed".
The remedy for this would be to apply `ceph osd rm-pg-upmap-primary <pgid>` to all pg-upmap-primary mappings in the osdmap, although I know that in your case you can't since the pool was deleted, and the command doesn't recognize the pgids anymore.
In v18.2.5, which is pending release soon, we fixed the pool deletion issue (tracked in https://tracker.ceph.com/issues/66867), which will prevent this from happening to existing and future pools.
We are also adding a new command that will allow users to remove all pg-upmap-primary mappings at once, which should allow you to cleanly remove the invalid mappings via `ceph osd rm-pg-upmap-primary-all`. This fix is tracked here: https://tracker.ceph.com/issues/67179
The latter will especially help in your case. Removing the mappings should allow the older krbd client to connect.
Updated by Laura Flores about 1 year ago
- Related to Bug #66867: pg_upmap_primary items are retained in OSD map for a pool which is already deleted added
Updated by Laura Flores about 1 year ago
- Status changed from In Progress to Duplicate
Updated by Laura Flores about 1 year ago
- Related to deleted (Bug #67179: Make rm-pg-upmap-primary able to remove mappings by force)
Updated by Laura Flores about 1 year ago
- Is duplicate of Bug #67179: Make rm-pg-upmap-primary able to remove mappings by force added
Updated by Laura Flores about 1 year ago
Users affected by [1] (a bug in which some pg_upmap_primary mappings are unable to be removed after pool deletion) may use the new command, `ceph osd rm-pg-upmap-primary-all`, to remove all pg-upmap-primary mappings from the osdmap.
Note that both valid and invalid pg-upmap-primary mappings will be removed, which is acceptable since there should be no data movement involved, and it is better algorithmically to start with fresh mappings. After running the command, the user may then rerun the read balancer manually if on Reef or Squid [2], or let the balancing happen automatically via the mgr module if on Squid [3].
[1] https://tracker.ceph.com/issues/66867
[2] https://docs.ceph.com/en/reef/rados/operations/read-balancer/#offline-optimization
[3] https://docs.ceph.com/en/squid/rados/operations/read-balancer/#online-optimization