Bug #74379
openActivePyModules::get_store_prefix segfault during crash module load
0%
Description
from tracker note: https://tracker.ceph.com/issues/68550#note-126
Updated by Geethu George 7 days ago HI Team, We wanted to update that currently we cannot get any manager active, until we disable the crash module. None of the managers wont start when the crash module is enabled. this is a recent issue that has emerged. Could you check the logs that was shared and also share any suggestions on this. Quote Edit #124Updated by Geethu George 6 days ago
Hi Laura,
Thanks for the update, the logs shared before is the one that Junior has requested after deploying the custom packages he had given.
the issue with the managers not starting after crash module is after that.
I will try to get the logs like you have requested.
Quote Edit
#126
Updated by Geethu George 6 days ago
File clipboard-202601061212-ontlj.png clipboard-202601061212-ontlj.png added
File clipboard-202601061212-f9yom.png clipboard-202601061212-f9yom.png added
HI team,
below is the current cluster status
The steps we did to fix the crash was below,we did not disable the crash module using the method you had told
mv /usr/share/ceph/mgr/crash /usr/share/ceph/mgr/crash.disabled
restarted manager
I am not able to replicate the crash anymore on the other nodes where this mv was not done. restart seems to work now.
Below is the error that we saw in log at the time it was crashing a day before.
2026-01-01T01:09:50.292+0430 7fbaaf94b640 0 [crash DEBUG root] setting log level based on debug_mgr: INFO (2/5)
2026-01-01T01:09:50.292+0430 7fbaaf94b640 1 mgr load Constructed class from module: crash
2026-01-01T01:09:50.298+0430 7fbaa893d640 -1 ** Caught signal (Segmentation fault) *
in thread 7fbaa893d640 thread_name:crash
ceph version 18.2.7 (6b0e988052ec84cf2d4a54ff9bbbc5e720b621ad) reef (stable)
1: /lib64/libc.so.6(+0x3ebf0) [0x7fbbd103ebf0]
2: /lib64/libpython3.9.so.1.0(+0x1209ec) [0x7fbbd27209ec]
3: (PyFormatter::dump_pyobject(std::basic_string_view<char, std::char_traits<char> >, object*)+0x63) [0x55ddecbca5f3]
4: (ActivePyModules::get_store_prefix(std::_cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >
const&) const+0x340) [0x55ddecb358c0]
5: /usr/bin/ceph-mgr(+0x150273) [0x55ddecb4f273]
6: /lib64/libpython3.9.so.1.0(+0x15b427) [0x7fbbd275b427]
7: _PyEval_EvalFrameDefault()
8: /lib64/libpython3.9.so.1.0(+0x14173b) [0x7fbbd274173b]
9: _PyEval_EvalFrameDefault()
10: /lib64/libpython3.9.so.1.0(+0x14173b) [0x7fbbd274173b]
11: _PyEval_EvalFrameDefault()
12: /lib64/libpython3.9.so.1.0(+0x12ed59) [0x7fbbd272ed59]
13: _PyFunction_Vectorcall()
14: _PyEval_EvalFrameDefault()
15: /lib64/libpython3.9.so.1.0(+0x14173b) [0x7fbbd274173b]
16: /lib64/libpython3.9.so.1.0(+0x14fb85) [0x7fbbd274fb85]
17: /lib64/libpython3.9.so.1.0(+0x138825) [0x7fbbd2738825]
18: _PyObject_CallMethod_SizeT()
19: (PyModuleRunner::serve()+0x66) [0x55ddecbe0276]
20: (PyModuleRunner::PyModuleRunnerThread::entry()+0x15d) [0x55ddecbe213d]
21: /lib64/libc.so.6(+0x8a19a) [0x7fbbd108a19a]
22: /lib64/libc.so.6(+0x10f210) [0x7fbbd110f210]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- begin dump of recent events ---
-1826> 2026-01-01T01:09:40.834+0430 7fbbd120e140 5 asok(0x55ddee7da000) register_command assert hook 0x55ddee72acb0
-1825> 2026-01-01T01:09:40.834+0430 7fbbd120e140 5 asok(0x55ddee7da000) register_command abort hook 0x55ddee72acb0
-1824> 2026-01-01T01:09:40.834+0430 7fbbd120e140 5 asok(0x55ddee7da000) register_command leak_some_memory hook 0x55ddee72acb0
-1823> 2026-01-01T01:09:40.834+0430 7fbbd120e140 5 asok(0x55ddee7da000) register_command perfcounters_dump hook 0x55ddee72acb0
-1822> 2026-01-01T01:09:40.834+0430 7fbbd120e140 5 asok(0x55ddee7da000) register_command 1 hook 0x55ddee72acb0
-1821> 2026-01-01T01:09:40.834+0430 7fbbd120e140 5 asok(0x55ddee7da000) register_command perf dump hook 0x55ddee72acb0
-1820> 2026-01-01T01:09:40.834+0430 7fbbd120e140 5 asok(0x55ddee7da000) register_command perfcounters_schema hook 0x55ddee72acb0
with this current status, do you still want me to follow the steps you suggested.
Updated by Nitzan Mordechai 2 months ago
the function:
PyObject *ActivePyModules::get_store_prefix(const std::string &module_name,
const std::string &prefix) const
{
without_gil_t no_gil;
std::lock_guard l(lock);
std::lock_guard lock(module_config.lock);
no_gil.acquire_gil();
const std::string base_prefix = PyModule::mgr_store_prefix
+ module_name + "/";
const std::string global_prefix = base_prefix + prefix;
dout(4) << __func__ << " prefix: " << global_prefix << dendl;
PyFormatter f;
for (auto p = store_cache.lower_bound(global_prefix);
p != store_cache.end() && p->first.find(global_prefix) == 0; ++p) {
f.dump_string(p->first.c_str() + base_prefix.size(), p->second);
}
return f.get();
}
we are iterating store_cache and calling PyFormatter that potentially can let other function change store_cache, in that case, update_kv_data may be called while mgr restart, that can change store_cache and making it (std::map) invalid during the iteration, and cause segfault.
Updated by Nitzan Mordechai 2 months ago
- Related to Bug #68550: Ceph MGR PG reporting the wrong state every other week added
Updated by Nitzan Mordechai 2 months ago
- Status changed from New to Fix Under Review
- Pull request ID set to 66891
Updated by Geethu George about 2 months ago
Hi Team
any update on when the fix will be ready? we have managers crashing daily.
Updated by Geethu George about 1 month ago
Hi team,
Any ETA on the fix. It would help us with the manager crashing.
Updated by Geethu George 27 days ago
Hi Team,
Could you update on the status of the fix. It will help us with the managers crashing currently.
Thanks
Updated by Kamoltat (Junior) Sirivadhna 5 days ago
- Pull request ID changed from 66891 to 67022