Skip to content

osd: Access/Modify epoch maps under mutex in OSDSuperblock class#62916

Merged
NitzanMordhai merged 1 commit intoceph:mainfrom
mohit84:tick_without_osd_crash
Jun 23, 2025
Merged

osd: Access/Modify epoch maps under mutex in OSDSuperblock class#62916
NitzanMordhai merged 1 commit intoceph:mainfrom
mohit84:tick_without_osd_crash

Conversation

@mohit84
Copy link
Contributor

@mohit84 mohit84 commented Apr 22, 2025

The OSDSuperblock object access/modify epoch maps in multiple threads simultaneously due to that OSD is getting crashed. To avoid the crash access the maps under mutex.

Fixes: https://tracker.ceph.com/issues/66819

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

@mohit84 mohit84 requested a review from a team as a code owner April 22, 2025 15:16
@mohit84 mohit84 requested a review from rzarzynski April 22, 2025 15:16
@github-actions github-actions bot added the core label Apr 22, 2025
@mohit84
Copy link
Contributor Author

mohit84 commented Apr 22, 2025

jenkins test make check

@mohit84 mohit84 force-pushed the tick_without_osd_crash branch from 25f8c8a to 289631e Compare April 23, 2025 10:14
@mohit84 mohit84 requested a review from a team as a code owner April 23, 2025 10:14
@mohit84
Copy link
Contributor Author

mohit84 commented Apr 23, 2025

jenkins test make check

@mohit84
Copy link
Contributor Author

mohit84 commented Apr 25, 2025

jenkins test make check


class OSDSuperblock {
private:
mutable std::shared_ptr<std::mutex> map_lock = std::make_shared<std::mutex>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shared_ptr makes me worried about copy-constructing. Have you tried unique_ptr?
Do we ever copy-construct OSDSuperblock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes i have tried there are multiple places where we do copy the object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, this aggregates the critical sections between all these instances.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the maps to a new class MapContainer and that class is using mutex to get/set the maps.

@mohit84 mohit84 force-pushed the tick_without_osd_crash branch from 289631e to cd06003 Compare May 19, 2025 12:52
@mohit84
Copy link
Contributor Author

mohit84 commented May 19, 2025

jenkins test make check

Copy link
Contributor

@rzarzynski rzarzynski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically LGTM! Just some nits.

interval_set<epoch_t> maps;

epoch_t get_oldest_map() const {
epoch_t _get_newest_map() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: these two methods got flapped got flapped which makes reading the diff a bit harder. Not a blocker, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

public:
MapContainer() = default;
MapContainer(const MapContainer& other)
: map_lock(ceph::make_mutex("map_lock"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, is not giving the name for default-constructed instances intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it is always recommended to use default instead of declaring empty constructor so i use default, please let me know if you want to change it to empty constructor something like MapContainer () {}.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm taking my previous comment back. There is the initializer:

  mutable ceph::mutex map_lock = ceph::make_mutex("map_lock");

int32_t whoami = -1; // my role in this fs.
epoch_t current_epoch = 0; // most recent epoch
interval_set<epoch_t> maps; // oldest/newest maps we have.
class MapContainer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: MapContainer is a quite general name while the entity is very closely related to OSDSuperblock. How about renaming to OSDSuperblockGuardedMap? Even better clarity might achieved with a nested class (if possible): OSDSuperblock::GuardedMap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@mohit84 mohit84 force-pushed the tick_without_osd_crash branch from cd06003 to 844e94b Compare June 2, 2025 07:40
@mohit84
Copy link
Contributor Author

mohit84 commented Jun 2, 2025

jenkins test make check

@rzarzynski
Copy link
Contributor

There is an FTBFS when building the file for crimson:

In file included from /home/jenkins-build/build/workspace/ceph-pull-requests/src/crush/CrushWrapper.cc:7:
/home/jenkins-build/build/workspace/ceph-pull-requests/src/osd/osd_types.h:5679:17: error: no type named 'mutex' in namespace 'ceph'
 5679 |   mutable ceph::mutex map_lock = ceph::make_mutex("map_lock");
      |           ~~~~~~^
/home/jenkins-build/build/workspace/ceph-pull-requests/src/osd/osd_types.h:5698:22: error: no member named 'make_mutex' in namespace 'ceph'
 5698 |     : map_lock(ceph::make_mutex("map_lock"))
      |                ~~~~~~^
2 errors generated.
[763/2885] Building CXX object src/crimson/CMakeFiles/crimson-common.dir/__/common/config.cc.o
FAILED: src/crimson/CMakeFiles/crimson-common.dir/__/common/config.cc.o 

I think this is rather a trivial issue – most likely the header for make_mutex really hasn't been included; for the classical works because of being included indirectly, from another header.

@mohit84 mohit84 force-pushed the tick_without_osd_crash branch from 844e94b to 29fe9f4 Compare June 2, 2025 14:06
@mohit84
Copy link
Contributor Author

mohit84 commented Jun 2, 2025

There is an FTBFS when building the file for crimson:

In file included from /home/jenkins-build/build/workspace/ceph-pull-requests/src/crush/CrushWrapper.cc:7:
/home/jenkins-build/build/workspace/ceph-pull-requests/src/osd/osd_types.h:5679:17: error: no type named 'mutex' in namespace 'ceph'
 5679 |   mutable ceph::mutex map_lock = ceph::make_mutex("map_lock");
      |           ~~~~~~^
/home/jenkins-build/build/workspace/ceph-pull-requests/src/osd/osd_types.h:5698:22: error: no member named 'make_mutex' in namespace 'ceph'
 5698 |     : map_lock(ceph::make_mutex("map_lock"))
      |                ~~~~~~^
2 errors generated.
[763/2885] Building CXX object src/crimson/CMakeFiles/crimson-common.dir/__/common/config.cc.o
FAILED: src/crimson/CMakeFiles/crimson-common.dir/__/common/config.cc.o 

I think this is rather a trivial issue – most likely the header for make_mutex really hasn't been included; for the classical works because of being included indirectly, from another header.

done

@mohit84
Copy link
Contributor Author

mohit84 commented Jun 2, 2025

jenkins test make check

#include "common/hobject.h"
#include "common/snap_types.h"
#include "common/strtol.h" // for ritoa()
#include "common/ceph_mutex.h" //To use make_mutex() for crimson
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think that the comment is confusing and should be removed. make_mutex is needed for both.

Classic was ok without it because of it being included indirectly (implicitly).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

The OSDSuperblock object access/modify epoch maps in multiple
threads simultaneously due to that OSD is getting crashed.
To avoid the crash access the maps under mutex.

Fixes: https://tracker.ceph.com/issues/66819
Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
@mohit84 mohit84 force-pushed the tick_without_osd_crash branch from 29fe9f4 to 8cf896e Compare June 3, 2025 12:14
@mohit84
Copy link
Contributor Author

mohit84 commented Jun 3, 2025

jenkins test make check

public:
MapContainer() = default;
MapContainer(const MapContainer& other)
: map_lock(ceph::make_mutex("map_lock"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm taking my previous comment back. There is the initializer:

  mutable ceph::mutex map_lock = ceph::make_mutex("map_lock");

@NitzanMordhai
Copy link
Contributor

@NitzanMordhai NitzanMordhai merged commit c056c9f into ceph:main Jun 23, 2025
12 checks passed
mohit84 added a commit to mohit84/ceph that referenced this pull request Jul 29, 2025
backport tracker: https://tracker.ceph.com/issues/72070

backport of ceph#62916
parent tracker: https://tracker.ceph.com/issues/66819

Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
mohit84 added a commit to mohit84/ceph that referenced this pull request Sep 3, 2025
backport tracker: https://tracker.ceph.com/issues/72070

backport of ceph#62916
parent tracker: https://tracker.ceph.com/issues/66819
(cherry picked from commit 8cf896e)

Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
mkogan1 pushed a commit to mkogan1/ceph that referenced this pull request Jan 13, 2026
backport tracker: https://tracker.ceph.com/issues/72070

backport of ceph#62916
parent tracker: https://tracker.ceph.com/issues/66819
(cherry picked from commit 8cf896e)

Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
(cherry picked from commit 1fdf9cd)

Resolves: rhbz#2417015
NitzanMordhai pushed a commit to NitzanMordhai/ceph that referenced this pull request Mar 5, 2026
backport tracker: https://tracker.ceph.com/issues/72070

backport of ceph#62916
parent tracker: https://tracker.ceph.com/issues/66819
(cherry picked from commit 8cf896e)

Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants