mgr/crash: Protect crash dictionary from concurrent modification by pdvian · Pull Request #65975 · ceph/ceph

pdvian · 2025-10-16T06:18:31Z

The crash module encountered a RuntimeError: dictionary changed size during iteration when an operation modified the self.crashes dictionary while the serve() thread was running its periodic cleanup in _prune().
This change avoids the concurrency issue by:

Applying the @with_crashes decorator (holds crashes_lock) to do_post.
Ensuring the ls() helper function explicitly acquires the crashes_lock before accessing the dictionary.
Replaced lazy filter() usage in _prune with a list comprehension

Fixes: https://tracker.ceph.com/issues/73561

Signed-off-by: Prashant D pdhange@redhat.com

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins test classic perf Jenkins Job | Jenkins Job Definition
jenkins test crimson perf Jenkins Job | Jenkins Job Definition
jenkins test signed Jenkins Job | Jenkins Job Definition
jenkins test make check Jenkins Job | Jenkins Job Definition
jenkins test make check arm64 Jenkins Job | Jenkins Job Definition
jenkins test submodules Jenkins Job | Jenkins Job Definition
jenkins test dashboard Jenkins Job | Jenkins Job Definition
jenkins test dashboard cephadm Jenkins Job | Jenkins Job Definition
jenkins test api Jenkins Job | Jenkins Job Definition
jenkins test docs ReadTheDocs | Github Workflow Definition
jenkins test ceph-volume all Jenkins Jobs | Jenkins Jobs Definition
jenkins test windows Jenkins Job | Jenkins Job Definition
jenkins test rook e2e Jenkins Job | Jenkins Job Definition

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

The crash module encountered a RuntimeError: dictionary changed size during iteration when an operation modified the self.crashes dictionary while the serve() thread was running its periodic cleanup in _prune(). This change avoids the concurrency issue by: - Applying the @with_crashes decorator (holds crashes_lock) to do_post. - Ensuring the ls() helper function explicitly acquires the crashes_lock before accessing the dictionary. - Replaced lazy filter() usage in _prune with a list comprehension Fixes: https://tracker.ceph.com/issues/73561 Signed-off-by: Prashant D <pdhange@redhat.com>

ljflores

@pdvian this change caused a failure in teuthology. It's not a problem with the syntax, rather, a problem with module inter-dependency.

In this upgrade test, we run commands from the telemetry module before and after the upgrade to confirm they're working. On reef, the commands were running fine, but once the cluster was upgraded to main (including this patch), the ceph telemetry show command hangs:
/a/skanta-2025-11-18_15:07:03-rados-wip-bharath10-testing-2025-11-18-0557-distro-default-smithi/8609362

2025-11-18T17:11:11.867 INFO:tasks.workunit.client.0.smithi033.stderr:+ ceph telemetry show
...
2025-11-18T20:11:06.082 INFO:journalctl@ceph.mon.b.smithi070.stdout:Nov 18 20:11:05 smithi070 ceph-mon[93720]: pgmap v10988: 105 pgs: 105 active+clean; 4.6 MiB data, 4.2 GiB used, 711 GiB / 715 GiB avail
2025-11-18T20:11:06.089 INFO:journalctl@ceph.mon.a.smithi033.stdout:Nov 18 20:11:05 smithi033 ceph-mon[104464]: pgmap v10988: 105 pgs: 105 active+clean; 4.6 MiB data, 4.2 GiB used, 711 GiB / 715 GiB avail
2025-11-18T20:11:06.090 INFO:journalctl@ceph.mon.c.smithi033.stdout:Nov 18 20:11:05 smithi033 ceph-mon[106490]: pgmap v10988: 105 pgs: 105 active+clean; 4.6 MiB data, 4.2 GiB used, 711 GiB / 715 GiB avail
2025-11-18T20:11:07.476 DEBUG:teuthology.orchestra.run:got remote process result: 124
2025-11-18T20:11:07.479 INFO:tasks.workunit:Stopping ['test_telemetry_reef_x.sh'] on client.0...
2025-11-18T20:11:07.479 DEBUG:teuthology.orchestra.run.smithi033:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2025-11-18T20:11:07.807 ERROR:teuthology.run_tasks:Saw exception from tasks.

I was able to reproduce this on a vstart cluster with the following steps:

$ git fetch ci
$ git checkout --track ci/wip-bharath10-testing-2025-11-18-0557
$ cd build
$ ninja vstart
$ OSD=4 ../src/vstart.sh --debug --new -x --localhost --bluestore
$ ./bin/ceph telemetry on
$ ./bin/ceph telemetry on --license sharing-1-0
$ ./bin/ceph telemetry show

I confirmed that the issue is from this PR by reverting the commit and rerunning ceph telemetry show, where it worked.

Looking into the ceph telemetry show command, I can see that it depends on collecting the output of ceph crash ls to generate the crash report:

ceph/src/pybind/mgr/telemetry/module.py

Line 779 in f4e964e

errno, crashids, err = self.remote('crash', 'ls')

I think this is where the problem is. The crash command runs okay when invoked in the CLI, but when the telemetry module invokes it, the command gets stuck.

I would suggest modifying this so other mgr modules can still invoke the command.

ljflores

Explanation in above comment.

rzarzynski · 2026-01-21T22:45:31Z

@pdvian: ping.

github-actions · 2026-02-06T22:02:46Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

github-actions bot added the pybind label Oct 16, 2025

pdvian requested a review from rzarzynski October 16, 2025 06:27

rzarzynski approved these changes Oct 21, 2025

View reviewed changes

rzarzynski added the needs-qa label Oct 21, 2025

SrinivasaBharath added core wip-bharath10-testing labels Nov 17, 2025

ljflores reviewed Dec 2, 2025

View reviewed changes

ljflores added TESTED and removed needs-qa wip-bharath10-testing labels Dec 2, 2025

ljflores requested changes Dec 4, 2025

View reviewed changes

rzarzynski added the DNM label Jan 21, 2026

github-actions bot added the needs-rebase label Feb 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mgr/crash: Protect crash dictionary from concurrent modification#65975

mgr/crash: Protect crash dictionary from concurrent modification#65975
pdvian wants to merge 1 commit intoceph:mainfrom
pdvian:wip-crash-fix

pdvian commented Oct 16, 2025 •

edited

Loading

Uh oh!

ljflores left a comment

Uh oh!

ljflores left a comment

Uh oh!

rzarzynski commented Jan 21, 2026

Uh oh!

github-actions bot commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

pdvian commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

ljflores left a comment

Choose a reason for hiding this comment

Uh oh!

ljflores left a comment

Choose a reason for hiding this comment

Uh oh!

rzarzynski commented Jan 21, 2026

Uh oh!

github-actions bot commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pdvian commented Oct 16, 2025 •

edited

Loading