Skip to content

mgr/nvmeof: add nvmeof module which will create .nvmeof rbd pool#67167

Merged
Hezko merged 6 commits intoceph:mainfrom
Hezko:nvme-module-two
Feb 26, 2026
Merged

mgr/nvmeof: add nvmeof module which will create .nvmeof rbd pool#67167
Hezko merged 6 commits intoceph:mainfrom
Hezko:nvme-module-two

Conversation

@Hezko
Copy link
Contributor

@Hezko Hezko commented Feb 2, 2026

this PR introduce a new always on module. the module will ensure the creation of a pool called .nvmeof which will be used for saving the metadata pool. This will eliminate the current requirement of creating such pool manually by the user.

fixes: https://tracker.ceph.com/issues/74702

Signed-off-by: Tomer Haskalovitch tomer.haska@ibm.com

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

@Hezko Hezko self-assigned this Feb 2, 2026
@Hezko Hezko requested a review from a team as a code owner February 2, 2026 19:25
Copilot AI review requested due to automatic review settings February 2, 2026 19:25
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@Hezko Hezko requested a review from Copilot February 2, 2026 19:44
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Option(
name='metadata_pool_pg_num',
type='int',
default=128,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why so many PGs? I'm assuming that you are only going to have a couple of objects/omaps per gateway and that the total amount of metadata is probably going to be less than 1MB in all but the very largest configurations. For that type of storage requirement 1 PG should be more than sufficient

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right I will change that :)

#endif
"pg_autoscaler",
"telemetry",
"nvmeof",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the map always_on_modules_map below which determines which modules should always be on for each ceph release. Assuming you are back porting this to tentacle (not sure how that works if you have a release of tentacle that supports nvmeof and earlier tentacle releases that don't) I think you need to make a copy of the octopus_modules, call this tentacle_modules and add nvmeof to that and then adjust the map below so that the tentacle release uses tentacle_modules.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the purpose of the release in always_on_modules_map is primarily to determine whether health check errors are generated if an always on module is missing. So if you add it to tentacle_modules this means if the Mons are upgraded and Mgr is running < tentacle no health check warnings are generated, if Mgr is running old tentacle a warning will be generated (nvmeof will be missing), if Mgr is running new tentacle the nvmeof module will be running. There isn't a way of avoiding the health check warning during upgrade if you add a module mid release but it will just be a transient warning until the upgrade completes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, that makes sense. I’ll create a tentacle_modules list, add nvmeof to it, and update the map accordingly. I understand there may be a transient health warning during upgrades, which is expected.
I’ll proceed with that approach, and if I run into any issues or misunderstand something, I’ll follow up. Thanks!

@Hezko Hezko force-pushed the nvme-module-two branch 2 times, most recently from 6220366 to 2d748a1 Compare February 10, 2026 19:32
@Hezko Hezko requested a review from a team as a code owner February 10, 2026 19:32
@Hezko Hezko force-pushed the nvme-module-two branch 2 times, most recently from 99ec5e9 to 45c5b9d Compare February 10, 2026 19:38
@Hezko Hezko requested a review from bill-scales February 10, 2026 19:38
Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect a commit message that explains the purpose of the new module and the new pool.

Please update docs indicating what this module is. Add a release note.

Aviv mentioned that the current implementation asks users to create this pool, where is that code/doc?

logger.info(f"pool {pool_name} doesnt exist")
return pool_exists

def _create_pool(self, pool_name: str, pg_num: int):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are methods in mgr_module.py for these common operations. Use them please. (Add options if necessary.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a good idea to use other modules functions? I didn't see other usages for the create_pool function in mgr_module.py, and thought that if this is the case, it might be easily broken when someone rename or change it (as it will be called with self.remote right?). Are there examples for such usages I missed?

@Hezko
Copy link
Contributor Author

Hezko commented Feb 16, 2026

I expect a commit message that explains the purpose of the new module and the new pool.

Please update docs indicating what this module is. Add a release note.

Aviv mentioned that the current implementation asks users to create this pool, where is that code/doc?

@batrick

  1. After the meeting we had, many of your comments are no longer relevant, as the design and implementation changed a lot! yet they were awesome and I kept them in mind during the implementation.
    I hope my new commit message is better.
  2. Here is a link to the doc with the instructions for pool creation: link
  3. Regarding the release notes, I found this file: doc/releases/tentacle.rst , should I just add it to it? are there any guidelines for tampering with these files?

Thanks a lot! 🎊

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 10 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@caroav
Copy link
Contributor

caroav commented Feb 26, 2026

jenkins test make check arm64

Tomer Haskalovitch added 4 commits February 26, 2026 09:32
Introduce a new NVMe-oF mgr module and which create the pool
used for storing NVMe-related metadata ceph orch nvmeof apply command.
This removes the need for users to manually create and configure the
metadata pool before using the NVMe-oF functionality, simplifying
setup and reducing the chance of misconfiguration.

Fixes: https://tracker.ceph.com/issues/74702

Signed-off-by: Tomer Haskalovitch <tomer.haska@ibm.com>
Fixes: https://tracker.ceph.com/issues/74702

Signed-off-by: Tomer Haskalovitch <tomer.haska@ibm.com>
Fixes: https://tracker.ceph.com/issues/74702

Signed-off-by: Tomer Haskalovitch <tomer.haska@ibm.com>
Added a call to create_pool_if_not_exists during the execution of ceph orch apply nvmeof command.

Fixes: https://tracker.ceph.com/issues/74702

Signed-off-by: Tomer Haskalovitch <tomer.haska@ibm.com>
@Hezko
Copy link
Contributor Author

Hezko commented Feb 26, 2026

/config check ok

Tomer Haskalovitch added 2 commits February 26, 2026 10:35
Fixes: https://tracker.ceph.com/issues/74702

Signed-off-by: Tomer Haskalovitch <tomer.haska@ibm.com>
@Hezko
Copy link
Contributor Author

Hezko commented Feb 26, 2026

jenkins test make check arm64

1 similar comment
@Hezko
Copy link
Contributor Author

Hezko commented Feb 26, 2026

jenkins test make check arm64

@Hezko
Copy link
Contributor Author

Hezko commented Feb 26, 2026

jenkins test rook e2e

@Hezko Hezko merged commit 884bbcf into ceph:main Feb 26, 2026
13 of 14 checks passed
@batrick
Copy link
Member

batrick commented Feb 27, 2026

LGTM -- needs to be run through QA.

Who ran this through QA?

@avanthakkar
Copy link
Contributor

avanthakkar commented Mar 2, 2026

This PR seems to be causing crash of mgr :

Mar 02 12:19:14 ceph-vm-0 ceph-mgr[2366]: mgr[py] Exception calling _register_commands on nvmeof
Mar 02 12:19:14 ceph-vm-0 ceph-b32516c2-1631-11f1-8bb0-52540004db9b-mgr-ceph-vm-0-vzmahz[2362]: 2026-03-02T12:19:14.468+0000 7fe8b8dc0fc0 -1 mgr[py] Exception calling _register_commands on nvmeof
Mar 02 12:19:14 ceph-vm-0 ceph-b32516c2-1631-11f1-8bb0-52540004db9b-mgr-ceph-vm-0-vzmahz[2362]: 2026-03-02T12:19:14.468+0000 7fe8b8dc0fc0 -1 mgr[py] Traceback (most recent call last):
Mar 02 12:19:14 ceph-vm-0 ceph-b32516c2-1631-11f1-8bb0-52540004db9b-mgr-ceph-vm-0-vzmahz[2362]:   File "/usr/share/ceph/mgr/mgr_module.py", line 1107, in _register_commands
Mar 02 12:19:14 ceph-vm-0 ceph-b32516c2-1631-11f1-8bb0-52540004db9b-mgr-ceph-vm-0-vzmahz[2362]:     cls.COMMANDS.extend(cls.CLICommand.dump_cmd_list())
Mar 02 12:19:14 ceph-vm-0 ceph-b32516c2-1631-11f1-8bb0-52540004db9b-mgr-ceph-vm-0-vzmahz[2362]: AttributeError: type object 'NVMeoF' has no attribute 'CLICommand'
Mar 02 12:19:14 ceph-vm-0 ceph-b32516c2-1631-11f1-8bb0-52540004db9b-mgr-ceph-vm-0-vzmahz[2362]: 
Mar 02 12:19:14 ceph-vm-0 ceph-mgr[2366]: mgr[py] Traceback (most recent call last):
                                            File "/usr/share/ceph/mgr/mgr_module.py", line 1107, in _register_commands
                                              cls.COMMANDS.extend(cls.CLICommand.dump_cmd_list())
                                          AttributeError: type object 'NVMeoF' has no attribute 'CLICommand'

I've raised a fix for this: #67607. Please feel free to open tracker for the same if required , I'll attach in PR

batrick added a commit to batrick/ceph that referenced this pull request Mar 2, 2026
This PR broke the ceph-mgr [1] and was not QA'd more broadly.

[1] ceph#67167 (comment)

This reverts commit 884bbcf, reversing
changes made to 472e267.

Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
@batrick
Copy link
Member

batrick commented Mar 2, 2026

This PR seems to be causing crash of mgr :

Mar 02 12:19:14 ceph-vm-0 ceph-mgr[2366]: mgr[py] Exception calling _register_commands on nvmeof
Mar 02 12:19:14 ceph-vm-0 ceph-b32516c2-1631-11f1-8bb0-52540004db9b-mgr-ceph-vm-0-vzmahz[2362]: 2026-03-02T12:19:14.468+0000 7fe8b8dc0fc0 -1 mgr[py] Exception calling _register_commands on nvmeof
Mar 02 12:19:14 ceph-vm-0 ceph-b32516c2-1631-11f1-8bb0-52540004db9b-mgr-ceph-vm-0-vzmahz[2362]: 2026-03-02T12:19:14.468+0000 7fe8b8dc0fc0 -1 mgr[py] Traceback (most recent call last):
Mar 02 12:19:14 ceph-vm-0 ceph-b32516c2-1631-11f1-8bb0-52540004db9b-mgr-ceph-vm-0-vzmahz[2362]:   File "/usr/share/ceph/mgr/mgr_module.py", line 1107, in _register_commands
Mar 02 12:19:14 ceph-vm-0 ceph-b32516c2-1631-11f1-8bb0-52540004db9b-mgr-ceph-vm-0-vzmahz[2362]:     cls.COMMANDS.extend(cls.CLICommand.dump_cmd_list())
Mar 02 12:19:14 ceph-vm-0 ceph-b32516c2-1631-11f1-8bb0-52540004db9b-mgr-ceph-vm-0-vzmahz[2362]: AttributeError: type object 'NVMeoF' has no attribute 'CLICommand'
Mar 02 12:19:14 ceph-vm-0 ceph-b32516c2-1631-11f1-8bb0-52540004db9b-mgr-ceph-vm-0-vzmahz[2362]: 
Mar 02 12:19:14 ceph-vm-0 ceph-mgr[2366]: mgr[py] Traceback (most recent call last):
                                            File "/usr/share/ceph/mgr/mgr_module.py", line 1107, in _register_commands
                                              cls.COMMANDS.extend(cls.CLICommand.dump_cmd_list())
                                          AttributeError: type object 'NVMeoF' has no attribute 'CLICommand'

I've raised a fix for this: #67607. Please feel free to open tracker for the same if required , I'll attach in PR

@Hezko this broke the ceph-mgr. I am reverting in #67608. Please do not merge your PRs without QA and review from a component lead.

batrick added a commit that referenced this pull request Mar 2, 2026
* refs/pull/67608/head:
	Revert "Merge pull request #67167 from Hezko/nvme-module-two"

Reviewed-by: John Mulligan <jmulligan@redhat.com>
Hezko pushed a commit to Hezko/ceph that referenced this pull request Mar 3, 2026
@Hezko Hezko mentioned this pull request Mar 5, 2026
14 tasks
Hezko pushed a commit to Hezko/ceph that referenced this pull request Mar 5, 2026
This reverts commit df0028f.

Fixes: https://tracker.ceph.com/issues/74702

Signed-off-by: Tomer Haskalovitch <tomer.haska@ibm.com>
kginonredhat pushed a commit to kginonredhat/ceph that referenced this pull request Mar 13, 2026
This PR broke the ceph-mgr [1] and was not QA'd more broadly.

[1] ceph#67167 (comment)

This reverts commit 884bbcf, reversing
changes made to 472e267.

Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
JonBailey1993 pushed a commit to JonBailey1993/ceph that referenced this pull request Mar 17, 2026
This PR broke the ceph-mgr [1] and was not QA'd more broadly.

[1] ceph#67167 (comment)

This reverts commit 884bbcf, reversing
changes made to 472e267.

Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants