mgr/crash: Protect crash dictionary from concurrent modification#65975
mgr/crash: Protect crash dictionary from concurrent modification#65975
Conversation
The crash module encountered a RuntimeError: dictionary changed size during iteration when an operation modified the self.crashes dictionary while the serve() thread was running its periodic cleanup in _prune(). This change avoids the concurrency issue by: - Applying the @with_crashes decorator (holds crashes_lock) to do_post. - Ensuring the ls() helper function explicitly acquires the crashes_lock before accessing the dictionary. - Replaced lazy filter() usage in _prune with a list comprehension Fixes: https://tracker.ceph.com/issues/73561 Signed-off-by: Prashant D <pdhange@redhat.com>
ljflores
left a comment
There was a problem hiding this comment.
@pdvian this change caused a failure in teuthology. It's not a problem with the syntax, rather, a problem with module inter-dependency.
In this upgrade test, we run commands from the telemetry module before and after the upgrade to confirm they're working. On reef, the commands were running fine, but once the cluster was upgraded to main (including this patch), the ceph telemetry show command hangs:
/a/skanta-2025-11-18_15:07:03-rados-wip-bharath10-testing-2025-11-18-0557-distro-default-smithi/8609362
2025-11-18T17:11:11.867 INFO:tasks.workunit.client.0.smithi033.stderr:+ ceph telemetry show
...
2025-11-18T20:11:06.082 INFO:journalctl@ceph.mon.b.smithi070.stdout:Nov 18 20:11:05 smithi070 ceph-mon[93720]: pgmap v10988: 105 pgs: 105 active+clean; 4.6 MiB data, 4.2 GiB used, 711 GiB / 715 GiB avail
2025-11-18T20:11:06.089 INFO:journalctl@ceph.mon.a.smithi033.stdout:Nov 18 20:11:05 smithi033 ceph-mon[104464]: pgmap v10988: 105 pgs: 105 active+clean; 4.6 MiB data, 4.2 GiB used, 711 GiB / 715 GiB avail
2025-11-18T20:11:06.090 INFO:journalctl@ceph.mon.c.smithi033.stdout:Nov 18 20:11:05 smithi033 ceph-mon[106490]: pgmap v10988: 105 pgs: 105 active+clean; 4.6 MiB data, 4.2 GiB used, 711 GiB / 715 GiB avail
2025-11-18T20:11:07.476 DEBUG:teuthology.orchestra.run:got remote process result: 124
2025-11-18T20:11:07.479 INFO:tasks.workunit:Stopping ['test_telemetry_reef_x.sh'] on client.0...
2025-11-18T20:11:07.479 DEBUG:teuthology.orchestra.run.smithi033:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2025-11-18T20:11:07.807 ERROR:teuthology.run_tasks:Saw exception from tasks.
I was able to reproduce this on a vstart cluster with the following steps:
$ git fetch ci
$ git checkout --track ci/wip-bharath10-testing-2025-11-18-0557
$ cd build
$ ninja vstart
$ OSD=4 ../src/vstart.sh --debug --new -x --localhost --bluestore
$ ./bin/ceph telemetry on
$ ./bin/ceph telemetry on --license sharing-1-0
$ ./bin/ceph telemetry show
I confirmed that the issue is from this PR by reverting the commit and rerunning ceph telemetry show, where it worked.
Looking into the ceph telemetry show command, I can see that it depends on collecting the output of ceph crash ls to generate the crash report:
ceph/src/pybind/mgr/telemetry/module.py
Line 779 in f4e964e
I think this is where the problem is. The crash command runs okay when invoked in the CLI, but when the telemetry module invokes it, the command gets stuck.
I would suggest modifying this so other mgr modules can still invoke the command.
ljflores
left a comment
There was a problem hiding this comment.
Explanation in above comment.
|
@pdvian: ping. |
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
The crash module encountered a RuntimeError: dictionary changed size during iteration when an operation modified the self.crashes dictionary while the serve() thread was running its periodic cleanup in _prune().
This change avoids the concurrency issue by:
Fixes: https://tracker.ceph.com/issues/73561
Signed-off-by: Prashant D pdhange@redhat.com
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins test classic perfJenkins Job | Jenkins Job Definitionjenkins test crimson perfJenkins Job | Jenkins Job Definitionjenkins test signedJenkins Job | Jenkins Job Definitionjenkins test make checkJenkins Job | Jenkins Job Definitionjenkins test make check arm64Jenkins Job | Jenkins Job Definitionjenkins test submodulesJenkins Job | Jenkins Job Definitionjenkins test dashboardJenkins Job | Jenkins Job Definitionjenkins test dashboard cephadmJenkins Job | Jenkins Job Definitionjenkins test apiJenkins Job | Jenkins Job Definitionjenkins test docsReadTheDocs | Github Workflow Definitionjenkins test ceph-volume allJenkins Jobs | Jenkins Jobs Definitionjenkins test windowsJenkins Job | Jenkins Job Definitionjenkins test rook e2eJenkins Job | Jenkins Job DefinitionYou must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.