Skip to content

mgr/crash: Protect crash dictionary from concurrent modification#65975

Open
pdvian wants to merge 1 commit intoceph:mainfrom
pdvian:wip-crash-fix
Open

mgr/crash: Protect crash dictionary from concurrent modification#65975
pdvian wants to merge 1 commit intoceph:mainfrom
pdvian:wip-crash-fix

Conversation

@pdvian
Copy link
Contributor

@pdvian pdvian commented Oct 16, 2025

The crash module encountered a RuntimeError: dictionary changed size during iteration when an operation modified the self.crashes dictionary while the serve() thread was running its periodic cleanup in _prune().
This change avoids the concurrency issue by:

  • Applying the @with_crashes decorator (holds crashes_lock) to do_post.
  • Ensuring the ls() helper function explicitly acquires the crashes_lock before accessing the dictionary.
  • Replaced lazy filter() usage in _prune with a list comprehension

Fixes: https://tracker.ceph.com/issues/73561

Signed-off-by: Prashant D pdhange@redhat.com

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

The crash module encountered a RuntimeError: dictionary
changed size during iteration when an operation modified the
self.crashes dictionary while the serve() thread was running
its periodic cleanup in _prune().
This change avoids the concurrency issue by:
- Applying the @with_crashes decorator (holds crashes_lock) to do_post.
- Ensuring the ls() helper function explicitly acquires the crashes_lock
  before accessing the dictionary.
- Replaced lazy filter() usage in _prune with a list comprehension

Fixes: https://tracker.ceph.com/issues/73561

Signed-off-by: Prashant D <pdhange@redhat.com>
@pdvian pdvian requested a review from rzarzynski October 16, 2025 06:27
Copy link
Member

@ljflores ljflores left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pdvian this change caused a failure in teuthology. It's not a problem with the syntax, rather, a problem with module inter-dependency.

In this upgrade test, we run commands from the telemetry module before and after the upgrade to confirm they're working. On reef, the commands were running fine, but once the cluster was upgraded to main (including this patch), the ceph telemetry show command hangs:
/a/skanta-2025-11-18_15:07:03-rados-wip-bharath10-testing-2025-11-18-0557-distro-default-smithi/8609362

2025-11-18T17:11:11.867 INFO:tasks.workunit.client.0.smithi033.stderr:+ ceph telemetry show
...
2025-11-18T20:11:06.082 INFO:journalctl@ceph.mon.b.smithi070.stdout:Nov 18 20:11:05 smithi070 ceph-mon[93720]: pgmap v10988: 105 pgs: 105 active+clean; 4.6 MiB data, 4.2 GiB used, 711 GiB / 715 GiB avail
2025-11-18T20:11:06.089 INFO:journalctl@ceph.mon.a.smithi033.stdout:Nov 18 20:11:05 smithi033 ceph-mon[104464]: pgmap v10988: 105 pgs: 105 active+clean; 4.6 MiB data, 4.2 GiB used, 711 GiB / 715 GiB avail
2025-11-18T20:11:06.090 INFO:journalctl@ceph.mon.c.smithi033.stdout:Nov 18 20:11:05 smithi033 ceph-mon[106490]: pgmap v10988: 105 pgs: 105 active+clean; 4.6 MiB data, 4.2 GiB used, 711 GiB / 715 GiB avail
2025-11-18T20:11:07.476 DEBUG:teuthology.orchestra.run:got remote process result: 124
2025-11-18T20:11:07.479 INFO:tasks.workunit:Stopping ['test_telemetry_reef_x.sh'] on client.0...
2025-11-18T20:11:07.479 DEBUG:teuthology.orchestra.run.smithi033:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2025-11-18T20:11:07.807 ERROR:teuthology.run_tasks:Saw exception from tasks.

I was able to reproduce this on a vstart cluster with the following steps:

$ git fetch ci
$ git checkout --track ci/wip-bharath10-testing-2025-11-18-0557
$ cd build
$ ninja vstart
$ OSD=4 ../src/vstart.sh --debug --new -x --localhost --bluestore
$ ./bin/ceph telemetry on
$ ./bin/ceph telemetry on --license sharing-1-0
$ ./bin/ceph telemetry show

I confirmed that the issue is from this PR by reverting the commit and rerunning ceph telemetry show, where it worked.

Looking into the ceph telemetry show command, I can see that it depends on collecting the output of ceph crash ls to generate the crash report:

errno, crashids, err = self.remote('crash', 'ls')

I think this is where the problem is. The crash command runs okay when invoked in the CLI, but when the telemetry module invokes it, the command gets stuck.

I would suggest modifying this so other mgr modules can still invoke the command.

Copy link
Member

@ljflores ljflores left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explanation in above comment.

@rzarzynski rzarzynski added the DNM label Jan 21, 2026
@rzarzynski
Copy link
Contributor

@pdvian: ping.

@github-actions
Copy link

github-actions bot commented Feb 6, 2026

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants