Skip to content

monitoring/ceph-mixin: Cleanup of variables, queries and tests (to fix showMultiCluster=True)#55495

Merged
nizamial09 merged 1 commit intoceph:mainfrom
frittentheke:issue_64321
May 2, 2024
Merged

monitoring/ceph-mixin: Cleanup of variables, queries and tests (to fix showMultiCluster=True)#55495
nizamial09 merged 1 commit intoceph:mainfrom
frittentheke:issue_64321

Conversation

@frittentheke
Copy link
Copy Markdown
Contributor

Rendering the dashboards with showMultiCluster=True allows for them to work with multiple clusters storing their metrics in a single Prometheus instance. This works via the (configurable) cluster label and that functionality already existed.

This commit simply fixes some inconsistencies in applying the label filters which I found after rendering and using the
dashboards with a Prometheus instance holding metrics for multiple Ceph clusters.

There are also issues with the tests. I started working on them as well, but would like some feedback on how to best test with either showMultiCluster set to True or False. This would then ensure the support for multi cluster doesn't break on future changes and additions to the dashboards.

Fixes: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann christian.rohmann@inovex.de

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@frittentheke frittentheke requested a review from a team as a code owner February 8, 2024 13:53
@frittentheke frittentheke requested review from Pegonzal and ivoalmeida and removed request for a team February 8, 2024 13:53
@frittentheke frittentheke force-pushed the issue_64321 branch 2 times, most recently from 711b4c0 to b127147 Compare February 9, 2024 15:11
@frittentheke
Copy link
Copy Markdown
Contributor Author

frittentheke commented Feb 9, 2024

I admit the PR got a little bigger than just "fixing" the queries, but I believe I somewhat stayed in context.
See commit msg for some of my reasoning.

Before any more cleanup, I suggest to first convert to https://github.com/grafana/grafonnet to ensure compatibility with more recent Grafana releases.

@frittentheke frittentheke changed the title monitoring/ceph-mixin: fix multicluster support in dashboards and their queries monitoring/ceph-mixin: Cleanup of variables, queries and tests (to fix showMultiCluster=True) Feb 9, 2024
@frittentheke
Copy link
Copy Markdown
Contributor Author

@Javlopez @nizamial09 PTAL.

@cloudbehl
Copy link
Copy Markdown
Contributor

@frittentheke Thanks for PR and fixing the queries.

Just wanted to understand, rather than having a flag to enable the cluster variable why don't we have dashboards default have cluster variable enabled. So user doesn't need to build it to support multicluster. Thoughts?

@cloudbehl
Copy link
Copy Markdown
Contributor

Also can you attach the small recording that shows fixes you have done as part of the PR?

@github-actions
Copy link
Copy Markdown

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@frittentheke
Copy link
Copy Markdown
Contributor Author

@frittentheke Thanks for PR and fixing the queries.

Just wanted to understand, rather than having a flag to enable the cluster variable why don't we have dashboards default have cluster variable enabled. So user doesn't need to build it to support multicluster. Thoughts?

No objections on my part. We ourselves do use a single Prometheus instance (could also be Grafana Mimir / Thanos / Cortex ) holding metrics for multiple Ceph cluster and therefore make use of the templating via the cluster label.

@frittentheke
Copy link
Copy Markdown
Contributor Author

Also can you attach the small recording that shows fixes you have done as part of the PR?

@cloudbehl you mean like a screen recording of me clicking through the various Grafana dashboards?

@cloudbehl
Copy link
Copy Markdown
Contributor

@frittentheke Thanks for PR and fixing the queries.
Just wanted to understand, rather than having a flag to enable the cluster variable why don't we have dashboards default have cluster variable enabled. So user doesn't need to build it to support multicluster. Thoughts?

No objections on my part. We ourselves do use a single Prometheus instance (could also be Grafana Mimir / Thanos / Cortex ) holding metrics for multiple Ceph cluster and therefore make use of the templating via the cluster label.

That's my understanding as well, as if user is having multiple ceph pointing to single prometheus this would work out of the box for them and they don't need to build grafana dashboards just fr doing it. Even if single ceph cluster data in prometheus showing a cluster ID on top would do no harm.

So lets do and have this flag default enabled for all dashboards. I can help test this.

Also there are new dashboards added recently to grafana recently, can you help make sure those also hold the cluster variable and we are using same queries to fill values in those variables.

@cloudbehl
Copy link
Copy Markdown
Contributor

ik

I have never tried it, so just want to see how it looks on one or two dashboards if you can show. That would be great

@frittentheke
Copy link
Copy Markdown
Contributor Author

frittentheke commented Feb 16, 2024

So lets do and have this flag default enabled for all dashboards. I can help test this.

@cloudbehl It might not be as easy. Even a single cluster would need to have the cluster label on its metrics then.
With enabled showMultiCluster all/most queries will filter on this label to distinguish between clusters.

And most likely folks will not have Prometheus add this cluster label in their scraping config. The only thing the docs at
https://docs.ceph.com/en/latest/mgr/prometheus/#honor-labels mention, is overriding the instance with a fixed value due to the responding mgr host changing. But there is no mention of a cluster variable.

I am wondering if there is any way to make this work opportunistically ... but I doubt it.
There might have been a reason to have this as a configurable option after all ...

Also there are new dashboards added recently to grafana recently, can you help make sure those also hold the cluster variable and we are using same queries to fill values in those variables.

I'll check again if all boards and queries are instrumented.

@frittentheke frittentheke force-pushed the issue_64321 branch 2 times, most recently from 339d9b9 to 08ea2eb Compare February 16, 2024 13:44
@frittentheke
Copy link
Copy Markdown
Contributor Author

@cloudbehl

I seem to have fallen into a rabbit hole, just tryining to fix "a few" issues with multicluster ...

I pushed a new revision now which also updates the recently added "RGW S3 Analystics" boards that came in via https://tracker.ceph.com/issues/64359. Unfortunately there was also #55314 merged, which breaks the queries for <=Reef I suppose, so there so no simply backporting this anymore (if accepted and merged).

In any case, this PR has gotten was bigger than I expected. I gladly will provide a little recording of me browsing through the dashboard. But honestly this is not enough as a review. Especially the labels instance and hostname are used weirdly (I raised https://tracker.ceph.com/issues/64288 a while back). Then there is lots of label_replacing happening.
While it's nice to make the queries work on the most cluttered instance labeles (e.g. with port numbers), they could look much much cleaner, if it was simply expected from the source to provide them in a certain syntax (e.g. have Prometheus write clean instance labels. I tried to apply some cleaning and alignment, but this requires some good pair of eyes in reviewing the changes. I don't want to break thing for some folks, but only help fixing the multicluster dashboards.

All in all I believe the mixins deserve a full refactoring a some point to ensure they are maintained

a) Upgrade / Switch to https://github.com/grafana/grafonnet (from https://github.com/grafana/grafonnet-lib which is deprecated)
b) Upgrade to the lastest Grafana panels and align all the boards even more in their style and naming and remove all of the boilerplate or explicit config with sane defaults where possible.
c) Review if certain patterns could not be moved into little helpers, just like the matchers are now. My first idea would be the label_replacement. Requiring so much post-processing of metrics and their labels and so much code in the generator code makes any change too risky.

@cloudbehl
Copy link
Copy Markdown
Contributor

So lets do and have this flag default enabled for all dashboards. I can help test this.

@cloudbehl It might not be as easy. Even a single cluster would need to have the cluster label on its metrics then. With enabled showMultiCluster all/most queries will filter on this label to distinguish between clusters.

And most likely folks will not have Prometheus add this cluster label in their scraping config. The only thing the docs at https://docs.ceph.com/en/latest/mgr/prometheus/#honor-labels mention, is overriding the instance with a fixed value due to the responding mgr host changing. But there is no mention of a cluster variable.

I am wondering if there is any way to make this work opportunistically ... but I doubt it. There might have been a reason to have this as a configurable option after all ...

Also there are new dashboards added recently to grafana recently, can you help make sure those also hold the cluster variable and we are using same queries to fill values in those variables.

I'll check again if all boards and queries are instrumented.
Thanks for looking into it.

This is something we will have soon in main branch via this PR((#54964) ), so all the new cluster will have the cluster label by default attached to it in prometheus metrics.

We have seen lot of admin doing the same just to support the multi-cluster so its makes sense to add the cluster label by default to all the queries. Also as we have progressing with multi-cluster management and monitoring for ceph cluster. This should become standard to have the label so we don't need to rely on instance label.

@cloudbehl
Copy link
Copy Markdown
Contributor

@cloudbehl

I seem to have fallen into a rabbit hole, just tryining to fix "a few" issues with multicluster ...

What's the major issue that you are seeing with it?

I pushed a new revision now which also updates the recently added "RGW S3 Analystics" boards that came in via https://tracker.ceph.com/issues/64359. Unfortunately there was also #55314 merged, which breaks the queries for <=Reef I suppose, so there so no simply backporting this anymore (if accepted and merged).

can we have a different commit for the "RGW" realted dashboards in different PR so this could be possibly backport to fix potential issue?

In any case, this PR has gotten was bigger than I expected. I gladly will provide a little recording of me browsing through the dashboard. But honestly this is not enough as a review. Especially the labels instance and hostname are used weirdly (I raised https://tracker.ceph.com/issues/64288 a while back). Then there is lots of label_replacing happening. While it's nice to make the queries work on the most cluttered instance labeles (e.g. with port numbers), they could look much much cleaner, if it was simply expected from the source to provide them in a certain syntax (e.g. have Prometheus write clean instance labels. I tried to apply some cleaning and alignment, but this requires some good pair of eyes in reviewing the changes. I don't want to break thing for some folks, but only help fixing the multicluster dashboards.

All in all I believe the mixins deserve a full refactoring a some point to ensure they are maintained

a) Upgrade / Switch to https://github.com/grafana/grafonnet (from https://github.com/grafana/grafonnet-lib which is deprecated)
I agree we should migrate.

b) Upgrade to the lastest Grafana panels and align all the boards even more in their style and naming and remove all of the boilerplate or explicit config with sane defaults where possible.

Agreed this is much required, something I have been talking with the monitoring team for a while as well.

Few improvement areas that I see:

  1. All the dashboard needs to be revisited and see how we can reduce the count of dashboards. Like we have two cluster dashboard we have four rgw dashboard. I think some dashboard could be merged into 1 and create less confusion for admins.
  2. Too many variables in some dashboard. Some are not even working and some are working but doesn't have proper filtering.
  3. Adding proper text helper for all the graphs.
  4. All the graph/tables needs to be migrated to newer graphs/tables.

@frittentheke
Copy link
Copy Markdown
Contributor Author

I seem to have fallen into a rabbit hole, just tryining to fix "a few" issues with multicluster ...

What's the major issue that you are seeing with it?

If you look at the changes, I also cleaned up (and hopefully did not break) quite a few queries that were not suitable for Prometheus holding data for e.g. other hosts than those hosting Ceph (e.g. listing all "instance" values for a Ceph dashboard template).

I still believe I did some good to all of them dashboards, even if there was a rewrite coming in, having a good base of working queries makes that process a lot easier. So I gladly invested the time.

@frittentheke
Copy link
Copy Markdown
Contributor Author

I pushed a new revision now which also updates the recently added "RGW S3 Analystics" boards that came in via https://tracker.ceph.com/issues/64359. Unfortunately there was also #55314 merged, which breaks the queries for <=Reef I suppose, so there so no simply backporting this anymore (if accepted and merged).

can we have a different commit for the "RGW" realted dashboards in different PR so this could be possibly backport to fix potential issue?

You sure can, I'll look into it.
I also renamed / aligned the target filenames to universally use the radosgw- prefix. Do you like that part or should I remove that altogether?

@cloudbehl
Copy link
Copy Markdown
Contributor

cloudbehl commented Feb 22, 2024

You sure can, I'll look into it. I also renamed / aligned the target filenames to universally use the radosgw- prefix. Do you like that part or should I remove that altogether?

I think renaming we can have a separate small PR after all is done just for squid and main branch

@aaSharma14
Copy link
Copy Markdown
Contributor

@frittentheke , There are two related issues added to this tracker - https://tracker.ceph.com/issues/64321, one tracker is for squid branch and the second one is for reef branch. Steps to open backport PR's are -

  1. Checkout the squid branch
  2. Do cd src/script
  3. Run ./ceph-backport.sh --setup
  4. The script will ask you to verify or enter some details like redmine key, github username, github token etc.
  5. If the setup is okay, it should return - ceph-backport.sh: setup is OK
  6. Now run ./ceph-backport.sh <squid_tracker_number> for eg. ./ceph-backport.sh 65838
  7. This will open the backport PR for squid
  8. Similarly you can do it for reef backport as well.

@frittentheke
Copy link
Copy Markdown
Contributor Author

@frittentheke , There are two related issues added to this tracker - https://tracker.ceph.com/issues/64321, one tracker is for squid branch and the second one is for reef branch. Steps to open backport PR's are -

1. Checkout the squid branch
2. Do `cd src/script`
3. Run `./ceph-backport.sh --setup`
4. The script will ask you to verify or enter some details like redmine key, github username, github token etc.
5. If the setup is okay, it should return - `ceph-backport.sh: setup is OK`

Done.

6. Now run `./ceph-backport.sh <squid_tracker_number>` for eg. `./ceph-backport.sh 65838`
7. This will open the backport PR for squid

@aaSharma14
I suppose I should actually cherry-pick and adjust the commit to be backported, right?
Or is that done automagically?

Also it refuses due to the issues being yours:

ceph-backport.sh: my Redmine username is crohmann (ID 12304)
ceph-backport.sh: ERROR: https://tracker.ceph.com/issues/65838 is assigned to someone else: Aashish Sharma (ID 11319)
ceph-backport.sh: (my ID is 12304)
ceph-backport.sh: Cowardly refusing to continue

@aaSharma14
Copy link
Copy Markdown
Contributor

@frittentheke , There are two related issues added to this tracker - https://tracker.ceph.com/issues/64321, one tracker is for squid branch and the second one is for reef branch. Steps to open backport PR's are -

1. Checkout the squid branch
2. Do `cd src/script`
3. Run `./ceph-backport.sh --setup`
4. The script will ask you to verify or enter some details like redmine key, github username, github token etc.
5. If the setup is okay, it should return - `ceph-backport.sh: setup is OK`

Done.

6. Now run `./ceph-backport.sh <squid_tracker_number>` for eg. `./ceph-backport.sh 65838`
7. This will open the backport PR for squid

@aaSharma14 I suppose I should actually cherry-pick and adjust the commit to be backported, right? Or is that done automagically?

Also it refuses due to the issues being yours:

ceph-backport.sh: my Redmine username is crohmann (ID 12304)
ceph-backport.sh: ERROR: https://tracker.ceph.com/issues/65838 is assigned to someone else: Aashish Sharma (ID 11319)
ceph-backport.sh: (my ID is 12304)
ceph-backport.sh: Cowardly refusing to continue

@frittentheke , The cherry-pick is done automatically with this script. However if there are any conflicts, you can just resolve the conflicts..Do a git add and then re-run the script and that should be it. Also i have changed the assignee to you, You can try again. Thanks

@aaSharma14
Copy link
Copy Markdown
Contributor

Thank you for the backport @frittentheke , Can you please open the reef backport for this as well?

@frittentheke
Copy link
Copy Markdown
Contributor Author

Thank you for the backport @frittentheke , Can you please open the reef backport for this as well?

Yes, but this just needs some more attention, at least due to the renaming of rgw counters, see
#55495 (comment)

frittentheke added a commit to frittentheke/ceph that referenced this pull request Jun 18, 2024
Following PR ceph#55495 fixing the
dashboard in regards to multiple clusters storing their metrics
in a single Prometheus instance, this PR addresses the issues
for alerts.

Fixes: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
frittentheke added a commit to frittentheke/ceph that referenced this pull request Jul 8, 2024
Following PR ceph#55495 fixing the
dashboard in regards to multiple clusters storing their metrics
in a single Prometheus instance, this PR addresses the issues
for alerts.

Fixes: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
frittentheke added a commit to frittentheke/ceph that referenced this pull request Jul 21, 2024
Following PR ceph#55495 fixing the
dashboard in regards to multiple clusters storing their metrics
in a single Prometheus instance, this PR addresses the issues
for alerts.

Fixes: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
frittentheke added a commit to frittentheke/ceph that referenced this pull request Aug 15, 2024
Following PR ceph#55495 fixing the
dashboard in regards to multiple clusters storing their metrics
in a single Prometheus instance, this PR addresses the issues
for alerts.

Fixes: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
aaSharma14 pushed a commit to frittentheke/ceph that referenced this pull request Sep 24, 2024
Following PR ceph#55495 fixing the
dashboard in regards to multiple clusters storing their metrics
in a single Prometheus instance, this PR addresses the issues
for alerts.

Fixes: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
aaSharma14 pushed a commit to frittentheke/ceph that referenced this pull request Oct 21, 2024
Following PR ceph#55495 fixing the
dashboard in regards to multiple clusters storing their metrics
in a single Prometheus instance, this PR addresses the issues
for alerts.

Fixes: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
mkogan1 pushed a commit to mkogan1/ceph that referenced this pull request Oct 31, 2024
Following PR ceph#55495 fixing the
dashboard in regards to multiple clusters storing their metrics
in a single Prometheus instance, this PR addresses the issues
for alerts.

Fixes: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
piyushagarwal1411 pushed a commit to piyushagarwal1411/ceph that referenced this pull request Nov 26, 2024
Following PR ceph#55495 fixing the
dashboard in regards to multiple clusters storing their metrics
in a single Prometheus instance, this PR addresses the issues
for alerts.

Fixes: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
JonBailey1993 pushed a commit to JonBailey1993/ceph that referenced this pull request Dec 2, 2024
Following PR ceph#55495 fixing the
dashboard in regards to multiple clusters storing their metrics
in a single Prometheus instance, this PR addresses the issues
for alerts.

Fixes: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
kulahe3 pushed a commit to kulahe3/ceph that referenced this pull request Dec 10, 2024
Following PR ceph#55495 fixing the
dashboard in regards to multiple clusters storing their metrics
in a single Prometheus instance, this PR addresses the issues
for alerts.

Fixes: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
CJary pushed a commit to CJary/ceph that referenced this pull request Jan 17, 2025
Following PR ceph#55495 fixing the
dashboard in regards to multiple clusters storing their metrics
in a single Prometheus instance, this PR addresses the issues
for alerts.

Fixes: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
VallariAg pushed a commit to VallariAg/ceph that referenced this pull request Jan 23, 2025
Following PR ceph#55495 fixing the
dashboard in regards to multiple clusters storing their metrics
in a single Prometheus instance, this PR addresses the issues
for alerts.

Fixes: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
(cherry picked from commit 810c706)
VallariAg pushed a commit to VallariAg/ceph that referenced this pull request Jan 23, 2025
Following PR ceph#55495 fixing the
dashboard in regards to multiple clusters storing their metrics
in a single Prometheus instance, this PR addresses the issues
for alerts.

Fixes: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
(cherry picked from commit 810c706)
baum pushed a commit to baum/ceph that referenced this pull request Feb 8, 2025
Following PR ceph#55495 fixing the
dashboard in regards to multiple clusters storing their metrics
in a single Prometheus instance, this PR addresses the issues
for alerts.

Fixes: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
(cherry picked from commit 810c706)
baum pushed a commit to baum/ceph that referenced this pull request Mar 9, 2025
- gateway submodule

Fixes: https://tracker.ceph.com/issues/64777

This PR adds high availability support for the nvmeof Ceph service. High availability means that even in the case that a certain GW is down, there will be another available path for the initiator to be able to continue the IO through another GW. High availability is achieved by running nvmeof service consisting of at least 2 nvmeof GWs in the Ceph cluster. Every GW will be seen by the host (initiator) as a separate path to the nvme namespaces (volumes).

The implementation consists of the following main modules:

- NVMeofGWMon - a PaxosService. It is a monitor that tracks the status of the nvmeof running services, and take actions in case that services fail, and in case services restored.
- NVMeofGwMonitorClient – It is an agent that is running as a part of each nvmeof GW. It is sending beacons to the monitor to signal that the GW is alive. As a part of the beacon, the client also sends information about the service. This information is used by the monitor to take decisions and perform some operations.
- MNVMeofGwBeacon – It is a structure used by the client and the monitor to send/recv the beacons.
- MNVMeofGwMap – The map is tracking the nvmeof GWs status. It also defines what should be the new role of every GW. So in the events of GWs go down or GWs restored, the map will reflect the new role of each GW resulted by these events. The map is distributed to the NVMeofGwMonitorClient on each GW, and it knows to update the GW with the required changes.

It is also adding 3 new mon commands:
- nvme-gw create
- nvme-gw delete
- nvme-gw show

The commands are used by the ceph adm to update the monitor that a new GW is deployed. The monitor will update the map accordingly and will start tracking this GW until it is deleted.

Signed-off-by: Leonid Chernin <lechernin@gmail.com>
Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
(cherry picked from commit 5843c6b)

mon: add NVMe-oF gateway monitor and HA doc

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
(cherry picked from commit bb75dde)

mgr/cephadm: ceph nvmeof monitor support

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
(cherry picked from commit 2946b19)

mon/NVMeofGwMap.cc: tabbing, line length, formatting

- Retabs file to match emacs/vim modelines at top
- Fixes bracing
- Adjusts line length to 80 char

Signed-off-by: Samuel Just <sjust@redhat.com>
(cherry picked from commit 8bf309e)

mon/NVMeofGwMap.h: tabbing, line length, formatting

- Adjust method signatures to better match mon/
- Adjust line length to 80 characthers

Signed-off-by: Samuel Just <sjust@redhat.com>
(cherry picked from commit 58d16c7)

mon/NVMeofGwMon.h: tabbing, line length, formatting

Signed-off-by: Samuel Just <sjust@redhat.com>
(cherry picked from commit 1f470f0)

mon/NVMeofGwMon.cc: tabbing, line length, formatting

- Retabs file to match emacs/vim modelines at top
- Fixes bracing
- Adjusts line length to 80 char

Signed-off-by: Samuel Just <sjust@redhat.com>
(cherry picked from commit bff9dd4)

mon/NVMeofGwTypes.h: tabbing, bracing, line length fixes

Signed-off-by: Samuel Just <sjust@redhat.com>
(cherry picked from commit e0f0469)

mon/NVMeofGwSerialize.h: tabbing, bracing, line length fixes

Signed-off-by: Samuel Just <sjust@redhat.com>
(cherry picked from commit d5e013f)

mgr/orchestrator: require "group" field for nvmeof specs

Signed-off-by: Adam King <adking@redhat.com>
(cherry picked from commit f6d552d)

mgr/cephadm: migrate nvmeof specs without group field

As we have added the group field as a requirement for new
nvmeof specs and check for it in spec validation, we need
a migration to populate this field for specs we find that
don't have it.

Signed-off-by: Adam King <adking@redhat.com>
(cherry picked from commit d7b00ea)

mgr/cephadm: make nvme-gw adds be able to handle multiple services/groups

Before this was grabbing the service spec for the first daemon
description in the list. This meant every daemon would be added
with the pool/group of whatever that spec happened to specify.
This patch grabs the spec, and therefore also the pool/group
individually for each nvmeof daemon

Signed-off-by: Adam King <adking@redhat.com>
(cherry picked from commit 2a6b105)

qa/cephadm: add group param when applying nvmeof

Since it will now be required

Signed-off-by: Adam King <adking@redhat.com>
(cherry picked from commit 41c5dbe)

include/ceph_features: remove stray available marker

Should have been removed in caa9e7a.

Signed-off-by: Samuel Just <sjust@redhat.com>

include/ceph_features: add NVMEOFHA feature bit

Normally, we'd just use the SERVER_SQUID or SERVER_T flags instead of
using an extra feature bit.  However, the nvmeof ha monitor paxos
service has had a more complex development journey.  There are users
interested in using the nvmeof ha feature in squid, but it didn't make
the cutoff for backporting it.  There's an upstream nvmeof-squid branch
in the ceph.git repository with the patches backported for anyone
interested in building it.

However, that means that users of our normal stable releases will see
the feature added to the monitor one release after anyone who chooses to
use the nvmeof-squid branch.  We could disallow upgrades from
nvmeof-squid to T, but by adding a feature bit here we make such a
restriction unnecessary.

Signed-off-by: Samuel Just <sjust@redhat.com>

mon/NVMeofGw*: support upgrades from prior out-of-tree nvmeofha implementation (nvmeof-reef)

This commit adds upgrade support for users running an experimental
nvmeofha implementation which can be found in the nvmeof-reef branch in
ceph.git.

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>

 mon/NVMeofGw*: fixing bugs - handle gw fast-reboot, proper handle of gw delete scenarios

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>

nvmeof/NVMeofGwMonitorClient: use a separate mutex for beacons

Add beacon_lock to mitigate potential beacon delays caused by slow message
handling, particularly in handle_nvmeof_gw_map.

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
(cherry picked from commit 0dc4185)

cephadm: mount nvmeof certs into container

ceph@2946b19
incorrectly removed this line and since then these certs are
not being properly mounted into the container. This commit
adds the line back

Signed-off-by: Adam King <adking@redhat.com>
(cherry picked from commit 8cc3a35)

qa/suites/rbd/nvmeof: add multi-subsystem setup and thrash test

1. qa/tasks/nvmeof.py:
    1.1. create multiple rbd images for all subsystems
    1.2. add NvmeofThrasher and ThrashTest
2. qa/tasks/mon_thrash.py: add 'switch_thrashers' option
3. nvmeof_setup_subsystem.sh: create multiple subsystems and enable HA
4. Restructure qa/suites/rbd/nvmeof: Create two sub-suites
   - "basic" (nvmeof_initiator job)
   - "thrash" (new: nvmeof_mon_thrash and nvmeof_thrash jobs)

Resolves: rhbz#2302243

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit d0c4182)

Revert "mgr/orchestrator: require "group" field for nvmeof specs"

This reverts commit f6d552d.

It was decided by the nvmeof team to stick with defaulting to
an empty string rather than forcing the users onto other
non-empty names when they upgrade

Signed-off-by: Adam King <adking@redhat.com>
(cherry picked from commit 3e5e85a)

Revert "mgr/cephadm: migrate nvmeof specs without group field"

This reverts commit d7b00ea.

It was decided by the nvmeof team to stick with	defaulting to
an empty string rather	than forcing the users onto other
non-empty names when they upgrade

Signed-off-by: Adam King <adking@redhat.com>
(cherry picked from commit e63d4b0)

mgr/orchestrator: allow passing group to apply/add nvmeof commands

We no longer require the group when applying an nvmeof spec
but we still want to allow the commands to take a group
parameter (and this will at least make a group name
required when creating a new service on the command line)

Signed-off-by: Adam King <adking@redhat.com>
(cherry picked from commit b377085)

 mon/NVMeofGw*: Fix issue when ana group of deleted GW was not serviced.
 Introduced GW Deleting state
Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>

Resolves: rhbz#2310380
(cherry picked from commit d4f961a)

 mon/NVMeofGw*:
 1. fix blocklist bug - blockist was not called
 2. originally monitor only bloklisted specific ana groups but since we allow
    the changing of ns ana grp on the fly for the sake of ns load balance,
    it is not good enough and we need to blocklist all the cluster contexts
    of the failing gateway
Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>

(cherry picked from commit 936d3af)

 mon/NVMeofGw*:
 fix issue that GW was down when last subsystem  was deleted

 Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>

 Resolves: rhbz#2301460

(cherry picked from commit 698e4c5)

Merge pull request ceph#59999 from leonidc/tracking-gw-deleting

mon/nvmeofgw*: fix tracking gateways in DELETING state
Resolves: rhbz#2314625

(cherry picked from commit 381a408)
Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>

mgr/cephadm: change ceph-nvmeof gw image version to 1.3
Resolves: rhbz#2309667

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 783f868)

mgr/cephadm: Make the discovery and gateway IPs configurable in NVMEof configuration

Resolves: rhbz#2311459
(cherry picked from commit 9f6d1ec)
Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>

pybind/mgr/cephadm/services/nvmeof.py: allow setting '0.0.0.0' as address in the spec file

- Partial revert of ceph@9eb3b99
- Part of ceph#59738

(cherry picked from commit 62a4247)

python-common/ceph/deployment/service_spec.py: Allow the cephadm deployment to determine the default addresses

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
(cherry picked from commit 0997e4c)

Resolves: rhbz#2311996
(cherry picked from commit 2db7559)

qa/tasks/nvmeof.py: add nvmeof gw-group to deployment

Groups was made a required parameter to be
`ceph orch apply nvmeof <pool> <group>` in
ceph#58860.
That broke the `nvmeof` suite so this PR fixes that.

Right now, all gateway are deployed in a single group.
Later, this would be changed to have multi groups for a better test.

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit c9a6fed)

qa: Expand nvmeof thrasher and add nvmeof_namespaces.yaml job

1. qa/tasks/nvmeof.py: add other methods to stop nvmeof daemons
2. add qa/workunits/rbd/nvmeof_namespace_test.sh which adds and
   deletes new namespaces. It is run in nvmeof_namespaces.yaml
   job where fio happens to other namespaces in background.

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 58d8be9)

qa/suites/nvmeof/basic: add nvmeof_scalability test

Add test to upscale/downscale nvmeof
gateways.

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit e5a9cda)

qa: move nvmeof shell scripts to qa/workunits/nvmeof

Move all scripts qa/workunits/rbd/nvmeof_*.sh
to qa/workunits/nvmeof/*.sh

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 2ed818e)

Conflicts:
	qa/workunits/nvmeof/setup_subsystem.sh

qa/suites/nvmeof: increase hosts in cluster setup

In "nvmeof" task, change "client" config to "installer"
which allows to take inputs like "host.a".

nvmeof/basic: change 2-gateway-2-initiator to
	       4-gateway-2-inititator cluster
nvmeof/thrash: change 3-gateway-1-initiator to
	        4-gateway-1-inititaor cluster

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 4d97b1a)

qa/suites/nvmeof: add mtls test

Add qa/workunits/nvmeof/mtls_test.sh which enables
mtls config and redeploy, then verify and disables
mtls config.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit fdc93ad)

Conflicts:
	qa/tasks/nvmeof.py

qa/suite/nvmeof/thrash: increase number of thrashing

- Run fio for 15 mins (instead of 10min).
- nvmeof.py: change daemon_max_thrash_times default from 3 to 5
- nvmeof.py: run nvme list in do_checks()

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 51743e6)

qa/suites/nvmeof: add nvmeof warnings to log-ignorelist

Add NVMEOF_SINGLE_GATEWAY and NVMEOF_GATEWAY_DOWN
warnings to nvmeof:thrash job's log-ignorelist

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 73d5c01)

qa/suites/nvmeof/thrash: Add "is unavailable" to log-ignorelist

This commit also:
- Remove --rbd_iostat from thrasher fio
- Log iteration details before printing stats in nvmeof_tharsher

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit c0ca0eb)

qa/tasks/nvmeof.py: Improve thrasher and rbd image creation

Create rbd images in one command using ";" to queue them,
instead of running "cephadm shell -- rbd create" again
and again for each image.

Improve the method to select to-be-thrashed daemons.
Use randint() and sample(), instead of weights/skip.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 82118e1)

qa/tasks/ceph: provide configuration for setting configs via mon

These configs may be set using:

ceph:
  cluster-config:
    entity:
      foo: bar

same as the current:

ceph:
  config:
    entity:
      foo: bar

The configs will be set in parallel using the `ceph config set` command.

The main benefit here is to avoid using the ceph.conf to set configs which
cannot be overriden using subsequent `ceph config` command. The only way to
override is to change the ceph.conf in the test (yuck) or the admin socket
(which gets reset when the daemon restarts).

Finally, we can now exploit the `ceph config reset` command will let us
trivially rollback config changes after a test completes. That is exposed
as the `ctx.config_epoch` variable.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 9d485ae)

python-common/ceph/deployment: add SPDK log level to nvmeof configuration
Fixes https://tracker.ceph.com/issues/67258

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit d3cc237)

mgr/cephadm: add SPDK log level to nvmeof configuration
Fixes https://tracker.ceph.com/issues/67258

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 19399de)

python-common/ceph/deployment: change SPDK RPC fields in nvmeof configuration
Fixes https://tracker.ceph.com/issues/67629

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit d18e6fb)

mgr/cephadm: change SPDK RPC fields in nvmeof configuration
Fixes https://tracker.ceph.com/issues/67629

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit d208242)

python-common/ceph/deployment: revert SPDK RPC fields in nvmeof configuration
Fixes https://tracker.ceph.com/issues/67844

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit cb28d39)

mgr/cephadm: revert SPDK RPC fields in nvmeof configuration
Fixes https://tracker.ceph.com/issues/67844

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 11de53f)

python-common/ceph/deployment: Add namespace netmask parameters to nvmeof configuration
Fixes https://tracker.ceph.com/issues/68542

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit dd4b357)

mgr/cephadm: Add namespace netmask parameters to nvmeof configuration
Fixes https://tracker.ceph.com/issues/68542

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 0dcc207)

python-common/ceph/deployment: Add resource limits to nvmeof configuration
Fixes https://tracker.ceph.com/issues/68967

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 4269d7c)

mgr/cephadm: Add resource limits to nvmeof configuration
Fixes https://tracker.ceph.com/issues/68967

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 1807a55)
Signed-off-by: Gil Bregman <gbregman@il.ibm.com>

mgr/cephadm/nvmeof: Add auto rebalance fields to NVMeOF configuration
Fixes https://tracker.ceph.com/issues/69176

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit bfc8fb6)

mgr/cephadm/nvmeof: Rewrite NVMEoF fields validation.
Fixes https://tracker.ceph.com/issues/69176

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 31283c0)

mgr/cephadm/nvmeof: Add key verification field to NVMeOF configuration
Fixes https://tracker.ceph.com/issues/69413

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 26a0f9a)
Signed-off-by: Gil Bregman <gbregman@il.ibm.com>

pybind/mgr/orchestrator/module.py: NvmeofServiceSpec service_id

- make service_id better alligned with default/empty group
  (ceph@f6d552d)
- fix service_id in nvmeof daemon add

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
(cherry picked from commit e1612d0)

cephadm/nvmeof: support no huge pages for nvmeof spdk

depends on: ceph/ceph-nvmeof#898

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
(cherry picked from commit 38513cb)

cephadm/nvmeof: support per-node gateway addresses

Added gateway and discovery address maps to the service specification.
These maps store per-node service addresses. The address is first searched
in the map, then in the spec address configuration. If neither is defined,
the host IP is used as a fallback.

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
(cherry picked from commit 2f47f9d)

cephadm/nvmeof: fix ports when default values are overridden

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
(cherry picked from commit e717a92)

src/nvmeof/NVMeofGwMonitorClient: remove MDS client, not needed

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
(cherry picked from commit f806872)

mon: add nvmeof healthchecks

Add NVMeofGwMap::get_health_checks which raises
NVMEOF_SINGLE_GATEWAY if any of the groups have
1 gateway.

In NVMeofGwMon, call `encode_health` and `load_health`
to register healthchecks. This will add nvmeof healthchecks
to "ceph health" output.

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 1cad040)

mon: add warning NVMEOF_GATEWAY_DOWN

In src/mon/NVMeofGwMap.cc,
add warning NVMEOF_GATEWAY_DOWN when any
gateway is in GW_UNAVAILABLE state.

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 0006599)

monitoring: Add prometheus alert NVMeoFMultipleNamespacesOfRBDImage

NVMeoFMultipleNamespacesOfRBDImage alerts the user if a RBD image
is used for multiple namespaces. This is important alerts for cases
where namespaces are created on same image for different gateway group.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 61b3289)

monitoring: add 2 nvmeof alerts to prometheus_alerts.yaml

- `NVMeoFMissingListener`: trigger if all listeners
     are not created for each gateway in a subsystem
- `NVMeoFZeroListenerSubsystem`: trigger if a subsystem has no listeners

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit f02e312)

monitoring: add 2 new nvmeof alerts

Add NVMeoFMissingListener and NVMeoFZeroListenerSubsystem
alerts to prometheus_alerts.libsonnet.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 7994fea)

monitoring: add tests for 2 new nvmeof alerts

Add test for alerts NVMeoFMissingListener and
NVMeoFZeroListenerSubsystem to test_alerts.yml.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit a878460)

monitoring: Add alert NVMeoFTooManyNamespaces

NVMeoFTooManyNamespaces helps to alert user if total
number of namespaces across subsystems are more than
1024.

Change NVMeoFTooManySubsystems limit to 128 from 16.

Fixes: ceph/ceph-nvmeof#948

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 614e146)

mon/NVMeofGwMap: add healthcheck warning NVMEOF_GATEWAY_DELETING

Add a warning when NVMeoF gateways are in DELETING state.
This happens when there are namespaces under the deleted gateway's
ANA group ID.

The gateways are removed completely after users manually move these
namespaces to another load balancing group. Or if a new gateway is
deployed on that host.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 571dd53)

src/common/options/mon.yaml.in: add mon_nvmeofgw_delete_grace

This config allows to configure the delay in triggering
NVMEOF_GATEWAY_DELETING healthcheck warning, which is
triggered when NVMeoF gateways are in DELETEING state
for too long (indicating a problem in namespace
load-balacing).
The default value for this config is 15 mins.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 7b33f77)

mon/NVMeofGwMap: add delay to NVMEOF_GATEWAY_DELETING warning

Instead of immediately triggering, have this healthcheck trigger
after some time has elasped. This delay can be configured by
mon_nvmeofgw_delete_grace.

Track the time when gateways go into DELETING state in a new
member var (of NVMeofGwMon) 'gws_deleting_time'.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 56cf512)

qa/workunits/nvmeof/basic_tests.sh: fix connect-all assert

There seems to be change in 'nvme list' json output
which caused failures in asserts after 'nvme connect-all'
command.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 22f91cd)

 mon/nvmeofgw*:fix monitor database corruption upon add gw

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
(cherry picked from commit 417c544)

mon/nvmeofgw*: fix HA usecase when gateway has no listeners: behaves like no-subsystems

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
(cherry picked from commit 47e7a24)

 mon/nvmeofgw*: monitors publish in nvme-gw show ana group responsible
 for  namespace rebalance

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
(cherry picked from commit c358483)

nvmeofgw* : fix publishing rebalance index

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
(cherry picked from commit ceb62c0)

mgr/cephadm: change ceph-nvmeof gw image version to 1.4
Fixes https://tracker.ceph.com/issues/69099

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>

mon/nvme: fix unused lambda capture warnings

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
(cherry picked from commit edb0321)

Add multi-cluster support (showMultiCluster=True) to alerts

Following PR ceph#55495 fixing the
dashboard in regards to multiple clusters storing their metrics
in a single Prometheus instance, this PR addresses the issues
for alerts.

Fixes: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
(cherry picked from commit 810c706)

monitoring: Update nvmeof alert limits in config

Update these in config.libsonnet:
- NVMeoFMaxGatewaysPerGroup (4->8)
- NVMeoFMaxGatewaysPerCluster (4->32)
- NVMeoFMaxNamespaces (1024->2048)
- NVMeoFHighClientCount (32->128)

Also update prometheus_alerts.yml and test_alerts.yml
accordingly.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit f3c1881)

mon: do not show nvmeof in 'ceph versions' output

NVMeoF gateway version is independent of ceph version
so 'ceph version' shows wrong nvmeof version in output
(i.e. instead of gateway version, it shows Ceph version).
Hence, remove nvmeof in 'ceph versions' output.

To check for gateway version, use 'gw info' command.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 73c935d)

mgr/cephadm/nvmeof: Add verify_listener_ip field to NVMeOF configuration and remove obsolete enable_key_encryption
Fixes https://tracker.ceph.com/issues/69731

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 744b04a)

mgr/cephadm/nvmeof: Add max_hosts field to NVMeOF configuration and update default values
Fixes https://tracker.ceph.com/issues/69759

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 0d8bd4d)

mgr/cephadm/nvmeof: Add SPDK iobuf options field to NVMeOF configuration
Fixes https://tracker.ceph.com/issues/69554

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 42bac97)

monitoring: add NVMeoFMaxGatewayGroups

Add config NVMeoFMaxGatewayGroups to config.libsonnet
and set it to 4 (groups).

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit c5c4b10)

monitoring: add alert NVMeoFMaxGatewayGroups

Add alert NVMeoFMaxGatewayGroups to prometheus_alerts.yml
and prometheus_alerts.libsonnet.

This alerts is to indicate if max number of NVMeoF gateway
groups have been reached in a cluster.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit ab4a1dd)

monitoring: add tests for NVMeoFMaxGatewayGroups

Add unit tests for alert NVMeoFMaxGatewayGroups
in monitoring/ceph-mixin/tests_alerts/test_alerts.yml

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit e5cb5db)

qa/tasks/nvmeof: Add --refresh flag in do_checks() cmds

This is to ensure latest state of the services are displayed.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 023c209)

qa: Add qa/suites/nvmeof/thrash/gateway-initiator-setup/2-subsys-8-namespace.yaml

This allows to run nvmeof thrasher test on smaller
confgurations which finshes faster than 120subsys-8ns
config.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit d7551f7)

qa/tasks/nvmeof.py: Add stop_and_join method to thrasher

Also add nvme-gw show command output in do_checks()
and revive daemons with 'ceph orch daemon start' in
revive_daemon() method.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 0b0f450)

qa/workunits/nvmeof/fio_test.sh: fix fio filenames

Filenames were provided to fio as nvme1n1:nvme1n2,
it should be pull path (/dev/nvme1n1:/dev/nvme1n2).

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 06811a4)

qa/tasks/nvmeof.py: Do not use 'systemctl start' in thrasher

Instead use 'daemon start' in revive_daemon() to bring
up gateways thrashed with 'systemctl stop'.
This is because 'systemctl start' method seems to temporary
issues.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit b5e6a0c)

qa/tasks/nvmeof.py: make seperate calls in do_checks()

When running 'nvme list-subsys <device>' command
in do_checks(), instead of combining command for
all devices with '&&', make seperate calls.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 5a58114)

qa/tasks/nvmeof.py: Fix do_checks() method

All checks currently run on initator node, now
run all "ceph" commands on one of gateway hosts
instead of initator nodes. And run "nvme list"
and "nvme list-subsys" checks on initator node.

Add retry (5 times) to do_checks if any command fails.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 7dfd3d3)

qa/tasks/nvmeof.py: Ignore systemctl_stop thrashing method

Do not use systemctl_stop method to thrash daemons,
just use 'ceph orch daemon stop' and 'ceph orch daemon rm'
methods to thrash nvmeof gateways.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit d4aec58)

qa/tasks/nvmeof.py: Add teardown() method

Add teardown method to remove nvmeof service
before rest of the cluster tearsdown.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit e8201d3)

qa/suites/nvmeof: Remove watchdog from thrasher

This commit does the following:
1. remove watchdog from thrasher
1. remove wait from fio_test
3. change thrasher switcher wait-time to 10 mins

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 76b4028)

qa/suites/nvmeof: use SCALING_DELAYS: '120'

Increase delays for qa/workunits/nvmeof/scalability_test.sh
as namespace rebalancing takes more time. After upscaling,
gateway initially could be 'CREATED', it is a valid state during
gateway initialization, but then the state should progress
to 'AVAILABLE' within couple of seconds.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 3b9b290)

nvmeofgw*: change log level of critical nvmeof monitor events to 1

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
(cherry picked from commit 57c4e16)

nvmeofgw*: 2 fixes - for duplicated optimized  pathes and fix for GW startup
 1. fix duplicated optimized host's pathes - trigger process_gw_down upon
   fast-gw reboot, removed old fast-reboot handlers
 2. fix GW startup - trigger process_gw_down when expired WAIT_BLOCKLIST timer

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
(cherry picked from commit 4397c02)

qa/workunits/nvmeof/fio_test: Log cluster status if fio fails

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit e450406)

qa/suites/nvmeof: add more asserts to scalability_test

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 877c726)

qa/suites/nvmeof: Run fio with scalability test

Run fio in parallel with scalability test.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit e2f3bed)

qa/workunits/nvmeof/fio_test.sh: add more debug commands

Add more commands to debug when fio fails:
- nvme list-subsys /dev/nvme1n2
- nvme list from the initiator
- nvme list | wc -l
- nvme id-ns /dev/nvme1n2

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit fd8fbea)

monitoring: fix NVMeoFSubsystemNamespaceLimit

Alert is not triggered as expected, change the query
to fix that.

BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2282348

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 4a7866a)

mgr/cephadm/nvmeof: Add QOS timeslice field to NVMeOF configuration
Fixes https://tracker.ceph.com/issues/69952

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 7b4af1f)

Merge pull request ceph#60871 from leonidc/leonidc-epoch-filter

Epoch filtering

Reviewed-by: Samuel Just <sjust@redhat.com>
Reviewed-by: Aviv Caro <Aviv.Caro@ibm.com>
Reviewed-by: Ronen Friedman <rfriedma@redhat.com>
(cherry picked from commit 3cdf529)

mon/nvmeofgw*: fix no-listeners FSM, fix detection of no-listeners
condition

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
(cherry picked from commit 66ca80e)

restore  proper no-listeners logic

Signed-off-by: leonidc <leonidc@il.ibm.com>
VallariAg pushed a commit to VallariAg/ceph that referenced this pull request Mar 10, 2025
Following PR ceph#55495 fixing the
dashboard in regards to multiple clusters storing their metrics
in a single Prometheus instance, this PR addresses the issues
for alerts.

Fixes: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
(cherry picked from commit 810c706)
VallariAg pushed a commit to VallariAg/ceph that referenced this pull request Mar 11, 2025
Following PR ceph#55495 fixing the
dashboard in regards to multiple clusters storing their metrics
in a single Prometheus instance, this PR addresses the issues
for alerts.

Fixes: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
(cherry picked from commit 810c706)
VallariAg pushed a commit to VallariAg/ceph that referenced this pull request Mar 11, 2025
Following PR ceph#55495 fixing the
dashboard in regards to multiple clusters storing their metrics
in a single Prometheus instance, this PR addresses the issues
for alerts.

Fixes: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
(cherry picked from commit 810c706)
baum pushed a commit to ceph/ceph-ci that referenced this pull request Mar 13, 2025
========================================

Resolves: rhbz#2350962

qa/tasks/nvmeof.py: add nvmeof gw-group to deployment

Groups was made a required parameter to be
`ceph orch apply nvmeof <pool> <group>` in
ceph/ceph#58860.
That broke the `nvmeof` suite so this PR fixes that.

Right now, all gateway are deployed in a single group.
Later, this would be changed to have multi groups for a better test.

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit c9a6fed)

qa: Expand nvmeof thrasher and add nvmeof_namespaces.yaml job

1. qa/tasks/nvmeof.py: add other methods to stop nvmeof daemons
2. add qa/workunits/rbd/nvmeof_namespace_test.sh which adds and
   deletes new namespaces. It is run in nvmeof_namespaces.yaml
   job where fio happens to other namespaces in background.

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 58d8be9)

qa/suites/nvmeof/basic: add nvmeof_scalability test

Add test to upscale/downscale nvmeof
gateways.

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit e5a9cda)

qa: move nvmeof shell scripts to qa/workunits/nvmeof

Move all scripts qa/workunits/rbd/nvmeof_*.sh
to qa/workunits/nvmeof/*.sh

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 2ed818e)

qa/suites/nvmeof: increase hosts in cluster setup

In "nvmeof" task, change "client" config to "installer"
which allows to take inputs like "host.a".

nvmeof/basic: change 2-gateway-2-initiator to
	       4-gateway-2-inititator cluster
nvmeof/thrash: change 3-gateway-1-initiator to
	        4-gateway-1-inititaor cluster

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 4d97b1a)

qa/suites/nvmeof: wait for service "nvmeof.mypool.mygroup0"

This is because nvmeof gateway group names are now
part of service id.

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit da8e95c)

labeler: add nvmeof labelers

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit d513cc5)

qa/suites/nvmeof: use "latest" image of gateway and cli

Change nvmeof gateway and cli image from 1.2 to "latest".

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 0bab553)

qa/workunits/nvmeof/setup_subsystem.sh: use --no-group-append

In newer version of nvmeof cli, "subsystem add" needs
this tag to ensure subsystem name is value of --subsystem.
Otherwise, in newer cli version, the gateway group is appended
at the end of the subsystem name.

This fixes the teuthology nvmeof suite (currently all jobs fails
because of this).

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 303f18b)

mon: add nvmeof healthchecks

Add NVMeofGwMap::get_health_checks which raises
NVMEOF_SINGLE_GATEWAY if any of the groups have
1 gateway.

In NVMeofGwMon, call `encode_health` and `load_health`
to register healthchecks. This will add nvmeof healthchecks
to "ceph health" output.

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 1cad040)

mon: add warning NVMEOF_GATEWAY_DOWN

In src/mon/NVMeofGwMap.cc,
add warning NVMEOF_GATEWAY_DOWN when any
gateway is in GW_UNAVAILABLE state.

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 0006599)

qa/suites/nvmeof: add mtls test

Add qa/workunits/nvmeof/mtls_test.sh which enables
mtls config and redeploy, then verify and disables
mtls config.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit fdc93ad)

monitoring: add 2 nvmeof alerts to prometheus_alerts.yaml

- `NVMeoFMissingListener`: trigger if all listeners
     are not created for each gateway in a subsystem
- `NVMeoFZeroListenerSubsystem`: trigger if a subsystem has no listeners

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit f02e312)

monitoring: add 2 new nvmeof alerts

Add NVMeoFMissingListener and NVMeoFZeroListenerSubsystem
alerts to prometheus_alerts.libsonnet.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 7994fea)

monitoring: add tests for 2 new nvmeof alerts

Add test for alerts NVMeoFMissingListener and
NVMeoFZeroListenerSubsystem to test_alerts.yml.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit a878460)

qa/suites/nvmeof: add nvmeof warnings to log-ignorelist

Add NVMEOF_SINGLE_GATEWAY and NVMEOF_GATEWAY_DOWN
warnings to nvmeof:thrash job's log-ignorelist

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 73d5c01)

qa/suites/nvmeof: fix nvmeof_namespaces.yaml

When basic_tests.sh is executed in parallel
with namespace_test.sh, sometimes namespace_test.sh
starts before fio_test.sh which would break the test.

So this change ensures "fio_test.sh" is started before
and executed in parallel with "namespace_test.sh".

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 6e15b5e)

qa/suite/nvmeof: add asserts to scalability_test.sh

Add assertions to 'status_checks()' function.
Use "apply" and "redeploy", instead of "orch rm" and
"apply" to upscale/downscale gateways.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 9393509)

qa/suite/nvmeof/thrash: increase number of thrashing

- Run fio for 15 mins (instead of 10min).
- nvmeof.py: change daemon_max_thrash_times default from 3 to 5
- nvmeof.py: run nvme list in do_checks()

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 51743e6)

qa/suites/nvmeof/basic: use default image in nvmeof_initiator.yaml

Instead of using quay.io/ceph/nvmeof:latest, use default
image in ceph build.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit f670916)

qa/suites/nvmeof/thrash: Add "is unavailable" to log-ignorelist

This commit also:
- Remove --rbd_iostat from thrasher fio
- Log iteration details before printing stats in nvmeof_tharsher

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit c0ca0eb)

qa/suites/nvmeof/thrasher: use 120 subsystems and 8 ns each

For tharsher test:
1. Run it on 120 subsystems with 8 namespaces each
2. Run FIO for 20 mins (instead of 15mins)
2. Run FIO for few randomly picked devices
    (using `--random_devices 200`)

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit e1983c5)

qa/tasks/nvmeof.py: Improve thrasher and rbd image creation

Create rbd images in one command using ";" to queue them,
instead of running "cephadm shell -- rbd create" again
and again for each image.

Improve the method to select to-be-thrashed daemons.
Use randint() and sample(), instead of weights/skip.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 82118e1)

qa/workunits/nvmeof/setup_subsystem.sh: add list_namespaces() func

Add list_namespaces function which could be useful for debugging later.
Remove extra call of list_subsystems so it's only logged once after
subsystems are completely setup.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 2030411)

qa/workunits/nvmeof/basic_tests.sh: Assert number of devices

Check number of devices connected after connect-all.
It should be equal to number of namespaces created.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 7ee4677)

qa/suites/nvmeof/thrash: add 10-subsys-90-namespace-no_huge_pages.yaml

Add test for no-huge-pages by using config
"spdk_mem_size: 4096" in 10 subsystems
and 90 namespaces each setup.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 09ade3d)

monitoring: Add prometheus alert NVMeoFMultipleNamespacesOfRBDImage

NVMeoFMultipleNamespacesOfRBDImage alerts the user if a RBD image
is used for multiple namespaces. This is important alerts for cases
where namespaces are created on same image for different gateway group.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 61b3289)

mon/NVMeofGwMap: add healthcheck warning NVMEOF_GATEWAY_DELETING

Add a warning when NVMeoF gateways are in DELETING state.
This happens when there are namespaces under the deleted gateway's
ANA group ID.

The gateways are removed completely after users manually move these
namespaces to another load balancing group. Or if a new gateway is
deployed on that host.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 571dd53)

src/common/options/mon.yaml.in: add mon_nvmeofgw_delete_grace

This config allows to configure the delay in triggering
NVMEOF_GATEWAY_DELETING healthcheck warning, which is
triggered when NVMeoF gateways are in DELETEING state
for too long (indicating a problem in namespace
load-balacing).
The default value for this config is 15 mins.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 7b33f77)

mon/NVMeofGwMap: add delay to NVMEOF_GATEWAY_DELETING warning

Instead of immediately triggering, have this healthcheck trigger
after some time has elasped. This delay can be configured by
mon_nvmeofgw_delete_grace.

Track the time when gateways go into DELETING state in a new
member var (of NVMeofGwMon) 'gws_deleting_time'.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 56cf512)

qa/workunits/nvmeof/basic_tests.sh: fix connect-all assert

There seems to be change in 'nvme list' json output
which caused failures in asserts after 'nvme connect-all'
command.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 22f91cd)

qa/tasks/nvmeof: Add --refresh flag in do_checks() cmds

This is to ensure latest state of the services are displayed.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 023c209)

qa: Add qa/suites/nvmeof/thrash/gateway-initiator-setup/2-subsys-8-namespace.yaml

This allows to run nvmeof thrasher test on smaller
confgurations which finshes faster than 120subsys-8ns
config.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit d7551f7)

qa/tasks/nvmeof.py: Add stop_and_join method to thrasher

Also add nvme-gw show command output in do_checks()
and revive daemons with 'ceph orch daemon start' in
revive_daemon() method.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 0b0f450)

qa/workunits/nvmeof/fio_test.sh: fix fio filenames

Filenames were provided to fio as nvme1n1:nvme1n2,
it should be pull path (/dev/nvme1n1:/dev/nvme1n2).

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 06811a4)

qa/tasks/nvmeof.py: Do not use 'systemctl start' in thrasher

Instead use 'daemon start' in revive_daemon() to bring
up gateways thrashed with 'systemctl stop'.
This is because 'systemctl start' method seems to temporary
issues.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit b5e6a0c)

qa/tasks/nvmeof.py: make seperate calls in do_checks()

When running 'nvme list-subsys <device>' command
in do_checks(), instead of combining command for
all devices with '&&', make seperate calls.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 5a58114)

qa/tasks/nvmeof.py: Fix do_checks() method

All checks currently run on initator node, now
run all "ceph" commands on one of gateway hosts
instead of initator nodes. And run "nvme list"
and "nvme list-subsys" checks on initator node.

Add retry (5 times) to do_checks if any command fails.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 7dfd3d3)

qa/tasks/nvmeof.py: Ignore systemctl_stop thrashing method

Do not use systemctl_stop method to thrash daemons,
just use 'ceph orch daemon stop' and 'ceph orch daemon rm'
methods to thrash nvmeof gateways.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit d4aec58)

qa/tasks/nvmeof.py: Add teardown() method

Add teardown method to remove nvmeof service
before rest of the cluster tearsdown.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit e8201d3)

qa/suites/nvmeof: Remove watchdog from thrasher

This commit does the following:
1. remove watchdog from thrasher
1. remove wait from fio_test
3. change thrasher switcher wait-time to 10 mins

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 76b4028)

monitoring: add NVMeoFMaxGatewayGroups

Add config NVMeoFMaxGatewayGroups to config.libsonnet
and set it to 4 (groups).

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit c5c4b10)

monitoring: add alert NVMeoFMaxGatewayGroups

Add alert NVMeoFMaxGatewayGroups to prometheus_alerts.yml
and prometheus_alerts.libsonnet.

This alerts is to indicate if max number of NVMeoF gateway
groups have been reached in a cluster.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit ab4a1dd)

monitoring: add tests for NVMeoFMaxGatewayGroups

Add unit tests for alert NVMeoFMaxGatewayGroups
in monitoring/ceph-mixin/tests_alerts/test_alerts.yml

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit e5cb5db)

qa/suites/nvmeof: use SCALING_DELAYS: '120'

Increase delays for qa/workunits/nvmeof/scalability_test.sh
as namespace rebalancing takes more time. After upscaling,
gateway initially could be 'CREATED', it is a valid state during
gateway initialization, but then the state should progress
to 'AVAILABLE' within couple of seconds.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 3b9b290)

qa/workunits/nvmeof/fio_test: Log cluster status if fio fails

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit e450406)

qa/suites/nvmeof: add more asserts to scalability_test

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 877c726)

qa/suites/nvmeof: Run fio with scalability test

Run fio in parallel with scalability test.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit e2f3bed)

qa/workunits/nvmeof/fio_test.sh: add more debug commands

Add more commands to debug when fio fails:
- nvme list-subsys /dev/nvme1n2
- nvme list from the initiator
- nvme list | wc -l
- nvme id-ns /dev/nvme1n2

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit fd8fbea)

mon: Add nvmeof group/gateway name in  "ceph -s"

In "ceph status" command output, show gateway
group names and gateway names.

Before:
```
  services:
    mon:    4 daemons, quorum ceph-nvme-vm8,ceph-nvme-vm1,ceph-nvme-vm7,ceph-nvme-vm6 (age 71m)
    mgr:    ceph-nvme-vm8.tgytdq(active, since 73m), standbys: ceph-nvme-vm6.tequqo, ceph-nvme-vm1.pxrofr, ceph-nvme-vm7.lbxrea
    osd:    4 osds: 4 up (since 70m), 4 in (since 70m)
    nvmeof: 4 gateways active (4 hosts)
```

After:
```
  services:
    mon:               4 daemons, quorum ceph-nvme-vm14,ceph-nvme-vm11,ceph-nvme-vm13,ceph-nvme-vm12 (age 17m)
    mgr:               ceph-nvme-vm14.gjjgvq(active, since 19m), standbys: ceph-nvme-vm12.shbvpw, ceph-nvme-vm11.gucgiu, ceph-nvme-vm13.inzizw
    osd:               4 osds: 4 up (since 15m), 4 in (since 16m)
    nvmeof (mygroup1) : 2 gateways active (ceph-nvme-vm13.azfdpk, ceph-nvme-vm14.hdsoxl)
    nvmeof (mygroup2) : 2 gateways active (ceph-nvme-vm11.hnooxs, ceph-nvme-vm12.wcjcjs)
```

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit e3fab2a)

mon: show count of active/total nvmeof gws in "ceph -s"

Improve "ceph status" output for nvmeof service:

1. Group by service_id (<pool>.<group>) instead of
  just by gateway groups.
2. Show total gateway count from NVMeofGwMap, and count
  of active gateways.

New output:
```
  services:
    mon:                     4 daemons, quorum ceph-nvme-vm31,ceph-nvme-vm28,ceph-nvme-vm30,ceph-nvme-vm29 (age 16m)
    mgr:                     ceph-nvme-vm31.wnfclf(active, since 18m), standbys: ceph-nvme-vm29.iuwqin, ceph-nvme-vm28.lnnyui, ceph-nvme-vm30.fitwnw
    osd:                     4 osds: 4 up (since 14m), 4 in (since 15m)
    nvmeof (mypool.mygroup1): 2 gateways: 1 active (ceph-nvme-vm30.kkcfux)
    nvmeof (mypool.mygroup2): 2 gateways: 2 active (ceph-nvme-vm28.mfqucr, ceph-nvme-vm29.hrizzl)
```

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 3065ffe)

monitoring: fix NVMeoFSubsystemNamespaceLimit

Alert is not triggered as expected, change the query
to fix that.

BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2282348

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 4a7866a)

mgr/cephadm: set service name for DaemonDescription object used during daemon removal

What this is specifically fixing is that the nvmeof post_remove function
needs the service spec of the daemon's service to get the pool and group
tied to the nvmeof daemon. We have been using the DaemonDescription
"service_name" property to get the service name in order to get the spec.
This works in a regular deployment. However, it is possible to make a placement
like

placement:
  hosts:
  - vm-00=nvmeof.a
  - vm-01=nvmeof.b

and one of the nvmeof CI tests was doing so, which is why we saw this.
That will cause the nvmeof daemon names to be nvmeof.nvmeof.a and
nvmeof.nvmeof.b and not include the service name at all. In this
case, the service_name property on the DaemonDescription class
will end up getting service names nvmeof.nvmeof.a and nvmeof.nvmeof.b
respectively from the nvmeof daemons, which will cause us to fail
to find the spec in post_remove. This change makes it so we manually set
the service name for the DaemonDescription object that gets passed
to post_remove based on the service name of the daemon object we
get from the host cache, which will still have the correct service
name even if the daemon has a custom name. Then the nvmeof post_remove
function will get the correct service name and be able to find the spec.

Additionally, we now take are technically taking the daemon type
and id from the DaemonDescription in our HostCache as well, but
this is mostly just for consistency and should have no real impact.

Fixes: https://tracker.ceph.com/issues/68962

Signed-off-by: Adam King <adking@redhat.com>
(cherry picked from commit d8dae24)

Add multi-cluster support (showMultiCluster=True) to alerts

Following PR ceph/ceph#55495 fixing the
dashboard in regards to multiple clusters storing their metrics
in a single Prometheus instance, this PR addresses the issues
for alerts.

Fixes: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
(cherry picked from commit 810c706)

mon/nvme: fix unused lambda capture warnings

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
(cherry picked from commit edb0321)

src/nvmeof/NVMeofGwMonitorClient: remove MDS client, not needed

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
(cherry picked from commit f806872)

cephadm/nvmeof: fix ports when default values are overridden

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
(cherry picked from commit e717a92)

cephadm/nvmeof: support per-node gateway addresses

Added gateway and discovery address maps to the service specification.
These maps store per-node service addresses. The address is first searched
in the map, then in the spec address configuration. If neither is defined,
the host IP is used as a fallback.

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
(cherry picked from commit 2f47f9d)

cephadm/nvmeof: support no huge pages for nvmeof spdk

depends on: ceph/ceph-nvmeof#898

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
(cherry picked from commit 38513cb)

pybind/mgr/orchestrator/module.py: NvmeofServiceSpec service_id

- make service_id better alligned with default/empty group
  (ceph/ceph@f6d552d)
- fix service_id in nvmeof daemon add

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
(cherry picked from commit e1612d0)

python-common/ceph/deployment: add SPDK log level to nvmeof configuration
Fixes https://tracker.ceph.com/issues/67258

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit d3cc237)

mgr/cephadm: add SPDK log level to nvmeof configuration
Fixes https://tracker.ceph.com/issues/67258

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 19399de)

python-common/ceph/deployment: change SPDK RPC fields in nvmeof configuration
Fixes https://tracker.ceph.com/issues/67629

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit d18e6fb)

mgr/cephadm: change SPDK RPC fields in nvmeof configuration
Fixes https://tracker.ceph.com/issues/67629

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit d208242)

python-common/ceph/deployment: revert SPDK RPC fields in nvmeof configuration
Fixes https://tracker.ceph.com/issues/67844

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit cb28d39)

mgr/cephadm: revert SPDK RPC fields in nvmeof configuration
Fixes https://tracker.ceph.com/issues/67844

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 11de53f)

python-common/ceph/deployment: Add namespace netmask parameters to nvmeof configuration
Fixes https://tracker.ceph.com/issues/68542

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit dd4b357)

mgr/cephadm: Add namespace netmask parameters to nvmeof configuration
Fixes https://tracker.ceph.com/issues/68542

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 0dcc207)

python-common/ceph/deployment: Add resource limits to nvmeof configuration
Fixes https://tracker.ceph.com/issues/68967

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 4269d7c)

mgr/cephadm: Add resource limits to nvmeof configuration
Fixes https://tracker.ceph.com/issues/68967

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 1807a55)
Signed-off-by: Gil Bregman <gbregman@il.ibm.com>

mgr/cephadm/nvmeof: Add auto rebalance fields to NVMeOF configuration
Fixes https://tracker.ceph.com/issues/69176

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit bfc8fb6)

mgr/cephadm/nvmeof: Rewrite NVMEoF fields validation.
Fixes https://tracker.ceph.com/issues/69176

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 31283c0)

mgr/cephadm/nvmeof: Add key verification field to NVMeOF configuration
Fixes https://tracker.ceph.com/issues/69413

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 26a0f9a)
Signed-off-by: Gil Bregman <gbregman@il.ibm.com>

mgr/cephadm: change ceph-nvmeof gw image version to 1.4
Fixes https://tracker.ceph.com/issues/69099

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>

mgr/cephadm/nvmeof: Add verify_listener_ip field to NVMeOF configuration and remove obsolete enable_key_encryption
Fixes https://tracker.ceph.com/issues/69731

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 744b04a)

mgr/cephadm/nvmeof: Add max_hosts field to NVMeOF configuration and update default values
Fixes https://tracker.ceph.com/issues/69759

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 0d8bd4d)

mgr/cephadm/nvmeof: Add SPDK iobuf options field to NVMeOF configuration
Fixes https://tracker.ceph.com/issues/69554

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 42bac97)

mgr/cephadm/nvmeof: Add QOS timeslice field to NVMeOF configuration
Fixes https://tracker.ceph.com/issues/69952

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 7b4af1f)

mon/nvmeofgw*: fix HA usecase when gateway has no listeners: behaves like no-subsystems

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
(cherry picked from commit 47e7a24)

 mon/nvmeofgw*: monitors publish in nvme-gw show ana group responsible
 for  namespace rebalance

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
(cherry picked from commit c358483)

nvmeofgw* : fix publishing rebalance index

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
(cherry picked from commit ceb62c0)

nvmeofgw*: 2 fixes - for duplicated optimized  pathes and fix for GW startup
 1. fix duplicated optimized host's pathes - trigger process_gw_down upon
   fast-gw reboot, removed old fast-reboot handlers
 2. fix GW startup - trigger process_gw_down when expired WAIT_BLOCKLIST timer

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
(cherry picked from commit 4397c02)

Merge pull request #60871 from leonidc/leonidc-epoch-filter

Epoch filtering

Reviewed-by: Samuel Just <sjust@redhat.com>
Reviewed-by: Aviv Caro <Aviv.Caro@ibm.com>
Reviewed-by: Ronen Friedman <rfriedma@redhat.com>
(cherry picked from commit 3cdf529)

mon/nvmeofgw*: fix no-listeners FSM, fix detection of no-listeners
condition

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
(cherry picked from commit 66ca80e)

restore  proper no-listeners logic

Signed-off-by: leonidc <leonidc@il.ibm.com>
mkogan1 pushed a commit to mkogan1/ceph that referenced this pull request Mar 17, 2025
========================================

Resolves: rhbz#2350962

qa/tasks/nvmeof.py: add nvmeof gw-group to deployment

Groups was made a required parameter to be
`ceph orch apply nvmeof <pool> <group>` in
ceph#58860.
That broke the `nvmeof` suite so this PR fixes that.

Right now, all gateway are deployed in a single group.
Later, this would be changed to have multi groups for a better test.

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit c9a6fed)

qa: Expand nvmeof thrasher and add nvmeof_namespaces.yaml job

1. qa/tasks/nvmeof.py: add other methods to stop nvmeof daemons
2. add qa/workunits/rbd/nvmeof_namespace_test.sh which adds and
   deletes new namespaces. It is run in nvmeof_namespaces.yaml
   job where fio happens to other namespaces in background.

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 58d8be9)

qa/suites/nvmeof/basic: add nvmeof_scalability test

Add test to upscale/downscale nvmeof
gateways.

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit e5a9cda)

qa: move nvmeof shell scripts to qa/workunits/nvmeof

Move all scripts qa/workunits/rbd/nvmeof_*.sh
to qa/workunits/nvmeof/*.sh

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 2ed818e)

qa/suites/nvmeof: increase hosts in cluster setup

In "nvmeof" task, change "client" config to "installer"
which allows to take inputs like "host.a".

nvmeof/basic: change 2-gateway-2-initiator to
	       4-gateway-2-inititator cluster
nvmeof/thrash: change 3-gateway-1-initiator to
	        4-gateway-1-inititaor cluster

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 4d97b1a)

qa/suites/nvmeof: wait for service "nvmeof.mypool.mygroup0"

This is because nvmeof gateway group names are now
part of service id.

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit da8e95c)

labeler: add nvmeof labelers

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit d513cc5)

qa/suites/nvmeof: use "latest" image of gateway and cli

Change nvmeof gateway and cli image from 1.2 to "latest".

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 0bab553)

qa/workunits/nvmeof/setup_subsystem.sh: use --no-group-append

In newer version of nvmeof cli, "subsystem add" needs
this tag to ensure subsystem name is value of --subsystem.
Otherwise, in newer cli version, the gateway group is appended
at the end of the subsystem name.

This fixes the teuthology nvmeof suite (currently all jobs fails
because of this).

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 303f18b)

mon: add nvmeof healthchecks

Add NVMeofGwMap::get_health_checks which raises
NVMEOF_SINGLE_GATEWAY if any of the groups have
1 gateway.

In NVMeofGwMon, call `encode_health` and `load_health`
to register healthchecks. This will add nvmeof healthchecks
to "ceph health" output.

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 1cad040)

mon: add warning NVMEOF_GATEWAY_DOWN

In src/mon/NVMeofGwMap.cc,
add warning NVMEOF_GATEWAY_DOWN when any
gateway is in GW_UNAVAILABLE state.

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 0006599)

qa/suites/nvmeof: add mtls test

Add qa/workunits/nvmeof/mtls_test.sh which enables
mtls config and redeploy, then verify and disables
mtls config.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit fdc93ad)

monitoring: add 2 nvmeof alerts to prometheus_alerts.yaml

- `NVMeoFMissingListener`: trigger if all listeners
     are not created for each gateway in a subsystem
- `NVMeoFZeroListenerSubsystem`: trigger if a subsystem has no listeners

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit f02e312)

monitoring: add 2 new nvmeof alerts

Add NVMeoFMissingListener and NVMeoFZeroListenerSubsystem
alerts to prometheus_alerts.libsonnet.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 7994fea)

monitoring: add tests for 2 new nvmeof alerts

Add test for alerts NVMeoFMissingListener and
NVMeoFZeroListenerSubsystem to test_alerts.yml.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit a878460)

qa/suites/nvmeof: add nvmeof warnings to log-ignorelist

Add NVMEOF_SINGLE_GATEWAY and NVMEOF_GATEWAY_DOWN
warnings to nvmeof:thrash job's log-ignorelist

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 73d5c01)

qa/suites/nvmeof: fix nvmeof_namespaces.yaml

When basic_tests.sh is executed in parallel
with namespace_test.sh, sometimes namespace_test.sh
starts before fio_test.sh which would break the test.

So this change ensures "fio_test.sh" is started before
and executed in parallel with "namespace_test.sh".

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 6e15b5e)

qa/suite/nvmeof: add asserts to scalability_test.sh

Add assertions to 'status_checks()' function.
Use "apply" and "redeploy", instead of "orch rm" and
"apply" to upscale/downscale gateways.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 9393509)

qa/suite/nvmeof/thrash: increase number of thrashing

- Run fio for 15 mins (instead of 10min).
- nvmeof.py: change daemon_max_thrash_times default from 3 to 5
- nvmeof.py: run nvme list in do_checks()

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 51743e6)

qa/suites/nvmeof/basic: use default image in nvmeof_initiator.yaml

Instead of using quay.io/ceph/nvmeof:latest, use default
image in ceph build.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit f670916)

qa/suites/nvmeof/thrash: Add "is unavailable" to log-ignorelist

This commit also:
- Remove --rbd_iostat from thrasher fio
- Log iteration details before printing stats in nvmeof_tharsher

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit c0ca0eb)

qa/suites/nvmeof/thrasher: use 120 subsystems and 8 ns each

For tharsher test:
1. Run it on 120 subsystems with 8 namespaces each
2. Run FIO for 20 mins (instead of 15mins)
2. Run FIO for few randomly picked devices
    (using `--random_devices 200`)

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit e1983c5)

qa/tasks/nvmeof.py: Improve thrasher and rbd image creation

Create rbd images in one command using ";" to queue them,
instead of running "cephadm shell -- rbd create" again
and again for each image.

Improve the method to select to-be-thrashed daemons.
Use randint() and sample(), instead of weights/skip.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 82118e1)

qa/workunits/nvmeof/setup_subsystem.sh: add list_namespaces() func

Add list_namespaces function which could be useful for debugging later.
Remove extra call of list_subsystems so it's only logged once after
subsystems are completely setup.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 2030411)

qa/workunits/nvmeof/basic_tests.sh: Assert number of devices

Check number of devices connected after connect-all.
It should be equal to number of namespaces created.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 7ee4677)

qa/suites/nvmeof/thrash: add 10-subsys-90-namespace-no_huge_pages.yaml

Add test for no-huge-pages by using config
"spdk_mem_size: 4096" in 10 subsystems
and 90 namespaces each setup.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 09ade3d)

monitoring: Add prometheus alert NVMeoFMultipleNamespacesOfRBDImage

NVMeoFMultipleNamespacesOfRBDImage alerts the user if a RBD image
is used for multiple namespaces. This is important alerts for cases
where namespaces are created on same image for different gateway group.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 61b3289)

mon/NVMeofGwMap: add healthcheck warning NVMEOF_GATEWAY_DELETING

Add a warning when NVMeoF gateways are in DELETING state.
This happens when there are namespaces under the deleted gateway's
ANA group ID.

The gateways are removed completely after users manually move these
namespaces to another load balancing group. Or if a new gateway is
deployed on that host.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 571dd53)

src/common/options/mon.yaml.in: add mon_nvmeofgw_delete_grace

This config allows to configure the delay in triggering
NVMEOF_GATEWAY_DELETING healthcheck warning, which is
triggered when NVMeoF gateways are in DELETEING state
for too long (indicating a problem in namespace
load-balacing).
The default value for this config is 15 mins.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 7b33f77)

mon/NVMeofGwMap: add delay to NVMEOF_GATEWAY_DELETING warning

Instead of immediately triggering, have this healthcheck trigger
after some time has elasped. This delay can be configured by
mon_nvmeofgw_delete_grace.

Track the time when gateways go into DELETING state in a new
member var (of NVMeofGwMon) 'gws_deleting_time'.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 56cf512)

qa/workunits/nvmeof/basic_tests.sh: fix connect-all assert

There seems to be change in 'nvme list' json output
which caused failures in asserts after 'nvme connect-all'
command.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 22f91cd)

qa/tasks/nvmeof: Add --refresh flag in do_checks() cmds

This is to ensure latest state of the services are displayed.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 023c209)

qa: Add qa/suites/nvmeof/thrash/gateway-initiator-setup/2-subsys-8-namespace.yaml

This allows to run nvmeof thrasher test on smaller
confgurations which finshes faster than 120subsys-8ns
config.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit d7551f7)

qa/tasks/nvmeof.py: Add stop_and_join method to thrasher

Also add nvme-gw show command output in do_checks()
and revive daemons with 'ceph orch daemon start' in
revive_daemon() method.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 0b0f450)

qa/workunits/nvmeof/fio_test.sh: fix fio filenames

Filenames were provided to fio as nvme1n1:nvme1n2,
it should be pull path (/dev/nvme1n1:/dev/nvme1n2).

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 06811a4)

qa/tasks/nvmeof.py: Do not use 'systemctl start' in thrasher

Instead use 'daemon start' in revive_daemon() to bring
up gateways thrashed with 'systemctl stop'.
This is because 'systemctl start' method seems to temporary
issues.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit b5e6a0c)

qa/tasks/nvmeof.py: make seperate calls in do_checks()

When running 'nvme list-subsys <device>' command
in do_checks(), instead of combining command for
all devices with '&&', make seperate calls.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 5a58114)

qa/tasks/nvmeof.py: Fix do_checks() method

All checks currently run on initator node, now
run all "ceph" commands on one of gateway hosts
instead of initator nodes. And run "nvme list"
and "nvme list-subsys" checks on initator node.

Add retry (5 times) to do_checks if any command fails.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 7dfd3d3)

qa/tasks/nvmeof.py: Ignore systemctl_stop thrashing method

Do not use systemctl_stop method to thrash daemons,
just use 'ceph orch daemon stop' and 'ceph orch daemon rm'
methods to thrash nvmeof gateways.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit d4aec58)

qa/tasks/nvmeof.py: Add teardown() method

Add teardown method to remove nvmeof service
before rest of the cluster tearsdown.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit e8201d3)

qa/suites/nvmeof: Remove watchdog from thrasher

This commit does the following:
1. remove watchdog from thrasher
1. remove wait from fio_test
3. change thrasher switcher wait-time to 10 mins

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 76b4028)

monitoring: add NVMeoFMaxGatewayGroups

Add config NVMeoFMaxGatewayGroups to config.libsonnet
and set it to 4 (groups).

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit c5c4b10)

monitoring: add alert NVMeoFMaxGatewayGroups

Add alert NVMeoFMaxGatewayGroups to prometheus_alerts.yml
and prometheus_alerts.libsonnet.

This alerts is to indicate if max number of NVMeoF gateway
groups have been reached in a cluster.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit ab4a1dd)

monitoring: add tests for NVMeoFMaxGatewayGroups

Add unit tests for alert NVMeoFMaxGatewayGroups
in monitoring/ceph-mixin/tests_alerts/test_alerts.yml

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit e5cb5db)

qa/suites/nvmeof: use SCALING_DELAYS: '120'

Increase delays for qa/workunits/nvmeof/scalability_test.sh
as namespace rebalancing takes more time. After upscaling,
gateway initially could be 'CREATED', it is a valid state during
gateway initialization, but then the state should progress
to 'AVAILABLE' within couple of seconds.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 3b9b290)

qa/workunits/nvmeof/fio_test: Log cluster status if fio fails

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit e450406)

qa/suites/nvmeof: add more asserts to scalability_test

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 877c726)

qa/suites/nvmeof: Run fio with scalability test

Run fio in parallel with scalability test.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit e2f3bed)

qa/workunits/nvmeof/fio_test.sh: add more debug commands

Add more commands to debug when fio fails:
- nvme list-subsys /dev/nvme1n2
- nvme list from the initiator
- nvme list | wc -l
- nvme id-ns /dev/nvme1n2

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit fd8fbea)

mon: Add nvmeof group/gateway name in  "ceph -s"

In "ceph status" command output, show gateway
group names and gateway names.

Before:
```
  services:
    mon:    4 daemons, quorum ceph-nvme-vm8,ceph-nvme-vm1,ceph-nvme-vm7,ceph-nvme-vm6 (age 71m)
    mgr:    ceph-nvme-vm8.tgytdq(active, since 73m), standbys: ceph-nvme-vm6.tequqo, ceph-nvme-vm1.pxrofr, ceph-nvme-vm7.lbxrea
    osd:    4 osds: 4 up (since 70m), 4 in (since 70m)
    nvmeof: 4 gateways active (4 hosts)
```

After:
```
  services:
    mon:               4 daemons, quorum ceph-nvme-vm14,ceph-nvme-vm11,ceph-nvme-vm13,ceph-nvme-vm12 (age 17m)
    mgr:               ceph-nvme-vm14.gjjgvq(active, since 19m), standbys: ceph-nvme-vm12.shbvpw, ceph-nvme-vm11.gucgiu, ceph-nvme-vm13.inzizw
    osd:               4 osds: 4 up (since 15m), 4 in (since 16m)
    nvmeof (mygroup1) : 2 gateways active (ceph-nvme-vm13.azfdpk, ceph-nvme-vm14.hdsoxl)
    nvmeof (mygroup2) : 2 gateways active (ceph-nvme-vm11.hnooxs, ceph-nvme-vm12.wcjcjs)
```

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit e3fab2a)

mon: show count of active/total nvmeof gws in "ceph -s"

Improve "ceph status" output for nvmeof service:

1. Group by service_id (<pool>.<group>) instead of
  just by gateway groups.
2. Show total gateway count from NVMeofGwMap, and count
  of active gateways.

New output:
```
  services:
    mon:                     4 daemons, quorum ceph-nvme-vm31,ceph-nvme-vm28,ceph-nvme-vm30,ceph-nvme-vm29 (age 16m)
    mgr:                     ceph-nvme-vm31.wnfclf(active, since 18m), standbys: ceph-nvme-vm29.iuwqin, ceph-nvme-vm28.lnnyui, ceph-nvme-vm30.fitwnw
    osd:                     4 osds: 4 up (since 14m), 4 in (since 15m)
    nvmeof (mypool.mygroup1): 2 gateways: 1 active (ceph-nvme-vm30.kkcfux)
    nvmeof (mypool.mygroup2): 2 gateways: 2 active (ceph-nvme-vm28.mfqucr, ceph-nvme-vm29.hrizzl)
```

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 3065ffe)

monitoring: fix NVMeoFSubsystemNamespaceLimit

Alert is not triggered as expected, change the query
to fix that.

BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2282348

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 4a7866a)

mgr/cephadm: set service name for DaemonDescription object used during daemon removal

What this is specifically fixing is that the nvmeof post_remove function
needs the service spec of the daemon's service to get the pool and group
tied to the nvmeof daemon. We have been using the DaemonDescription
"service_name" property to get the service name in order to get the spec.
This works in a regular deployment. However, it is possible to make a placement
like

placement:
  hosts:
  - vm-00=nvmeof.a
  - vm-01=nvmeof.b

and one of the nvmeof CI tests was doing so, which is why we saw this.
That will cause the nvmeof daemon names to be nvmeof.nvmeof.a and
nvmeof.nvmeof.b and not include the service name at all. In this
case, the service_name property on the DaemonDescription class
will end up getting service names nvmeof.nvmeof.a and nvmeof.nvmeof.b
respectively from the nvmeof daemons, which will cause us to fail
to find the spec in post_remove. This change makes it so we manually set
the service name for the DaemonDescription object that gets passed
to post_remove based on the service name of the daemon object we
get from the host cache, which will still have the correct service
name even if the daemon has a custom name. Then the nvmeof post_remove
function will get the correct service name and be able to find the spec.

Additionally, we now take are technically taking the daemon type
and id from the DaemonDescription in our HostCache as well, but
this is mostly just for consistency and should have no real impact.

Fixes: https://tracker.ceph.com/issues/68962

Signed-off-by: Adam King <adking@redhat.com>
(cherry picked from commit d8dae24)

Add multi-cluster support (showMultiCluster=True) to alerts

Following PR ceph#55495 fixing the
dashboard in regards to multiple clusters storing their metrics
in a single Prometheus instance, this PR addresses the issues
for alerts.

Fixes: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
(cherry picked from commit 810c706)

mon/nvme: fix unused lambda capture warnings

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
(cherry picked from commit edb0321)

src/nvmeof/NVMeofGwMonitorClient: remove MDS client, not needed

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
(cherry picked from commit f806872)

cephadm/nvmeof: fix ports when default values are overridden

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
(cherry picked from commit e717a92)

cephadm/nvmeof: support per-node gateway addresses

Added gateway and discovery address maps to the service specification.
These maps store per-node service addresses. The address is first searched
in the map, then in the spec address configuration. If neither is defined,
the host IP is used as a fallback.

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
(cherry picked from commit 2f47f9d)

cephadm/nvmeof: support no huge pages for nvmeof spdk

depends on: ceph/ceph-nvmeof#898

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
(cherry picked from commit 38513cb)

pybind/mgr/orchestrator/module.py: NvmeofServiceSpec service_id

- make service_id better alligned with default/empty group
  (ceph@f6d552d)
- fix service_id in nvmeof daemon add

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
(cherry picked from commit e1612d0)

python-common/ceph/deployment: add SPDK log level to nvmeof configuration
Fixes https://tracker.ceph.com/issues/67258

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit d3cc237)

mgr/cephadm: add SPDK log level to nvmeof configuration
Fixes https://tracker.ceph.com/issues/67258

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 19399de)

python-common/ceph/deployment: change SPDK RPC fields in nvmeof configuration
Fixes https://tracker.ceph.com/issues/67629

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit d18e6fb)

mgr/cephadm: change SPDK RPC fields in nvmeof configuration
Fixes https://tracker.ceph.com/issues/67629

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit d208242)

python-common/ceph/deployment: revert SPDK RPC fields in nvmeof configuration
Fixes https://tracker.ceph.com/issues/67844

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit cb28d39)

mgr/cephadm: revert SPDK RPC fields in nvmeof configuration
Fixes https://tracker.ceph.com/issues/67844

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 11de53f)

python-common/ceph/deployment: Add namespace netmask parameters to nvmeof configuration
Fixes https://tracker.ceph.com/issues/68542

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit dd4b357)

mgr/cephadm: Add namespace netmask parameters to nvmeof configuration
Fixes https://tracker.ceph.com/issues/68542

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 0dcc207)

python-common/ceph/deployment: Add resource limits to nvmeof configuration
Fixes https://tracker.ceph.com/issues/68967

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 4269d7c)

mgr/cephadm: Add resource limits to nvmeof configuration
Fixes https://tracker.ceph.com/issues/68967

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 1807a55)
Signed-off-by: Gil Bregman <gbregman@il.ibm.com>

mgr/cephadm/nvmeof: Add auto rebalance fields to NVMeOF configuration
Fixes https://tracker.ceph.com/issues/69176

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit bfc8fb6)

mgr/cephadm/nvmeof: Rewrite NVMEoF fields validation.
Fixes https://tracker.ceph.com/issues/69176

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 31283c0)

mgr/cephadm/nvmeof: Add key verification field to NVMeOF configuration
Fixes https://tracker.ceph.com/issues/69413

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 26a0f9a)
Signed-off-by: Gil Bregman <gbregman@il.ibm.com>

mgr/cephadm: change ceph-nvmeof gw image version to 1.4
Fixes https://tracker.ceph.com/issues/69099

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>

mgr/cephadm/nvmeof: Add verify_listener_ip field to NVMeOF configuration and remove obsolete enable_key_encryption
Fixes https://tracker.ceph.com/issues/69731

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 744b04a)

mgr/cephadm/nvmeof: Add max_hosts field to NVMeOF configuration and update default values
Fixes https://tracker.ceph.com/issues/69759

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 0d8bd4d)

mgr/cephadm/nvmeof: Add SPDK iobuf options field to NVMeOF configuration
Fixes https://tracker.ceph.com/issues/69554

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 42bac97)

mgr/cephadm/nvmeof: Add QOS timeslice field to NVMeOF configuration
Fixes https://tracker.ceph.com/issues/69952

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 7b4af1f)

mon/nvmeofgw*: fix HA usecase when gateway has no listeners: behaves like no-subsystems

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
(cherry picked from commit 47e7a24)

 mon/nvmeofgw*: monitors publish in nvme-gw show ana group responsible
 for  namespace rebalance

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
(cherry picked from commit c358483)

nvmeofgw* : fix publishing rebalance index

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
(cherry picked from commit ceb62c0)

nvmeofgw*: 2 fixes - for duplicated optimized  pathes and fix for GW startup
 1. fix duplicated optimized host's pathes - trigger process_gw_down upon
   fast-gw reboot, removed old fast-reboot handlers
 2. fix GW startup - trigger process_gw_down when expired WAIT_BLOCKLIST timer

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
(cherry picked from commit 4397c02)

Merge pull request ceph#60871 from leonidc/leonidc-epoch-filter

Epoch filtering

Reviewed-by: Samuel Just <sjust@redhat.com>
Reviewed-by: Aviv Caro <Aviv.Caro@ibm.com>
Reviewed-by: Ronen Friedman <rfriedma@redhat.com>
(cherry picked from commit 3cdf529)

mon/nvmeofgw*: fix no-listeners FSM, fix detection of no-listeners
condition

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
(cherry picked from commit 66ca80e)

restore  proper no-listeners logic

Signed-off-by: leonidc <leonidc@il.ibm.com>
baum pushed a commit to ceph/ceph-ci that referenced this pull request Nov 11, 2025
========================================

Resolves: rhbz#2350962

qa/tasks/nvmeof.py: add nvmeof gw-group to deployment

Groups was made a required parameter to be
`ceph orch apply nvmeof <pool> <group>` in
ceph/ceph#58860.
That broke the `nvmeof` suite so this PR fixes that.

Right now, all gateway are deployed in a single group.
Later, this would be changed to have multi groups for a better test.

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit c9a6fed)

qa: Expand nvmeof thrasher and add nvmeof_namespaces.yaml job

1. qa/tasks/nvmeof.py: add other methods to stop nvmeof daemons
2. add qa/workunits/rbd/nvmeof_namespace_test.sh which adds and
   deletes new namespaces. It is run in nvmeof_namespaces.yaml
   job where fio happens to other namespaces in background.

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 58d8be9)

qa/suites/nvmeof/basic: add nvmeof_scalability test

Add test to upscale/downscale nvmeof
gateways.

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit e5a9cda)

qa: move nvmeof shell scripts to qa/workunits/nvmeof

Move all scripts qa/workunits/rbd/nvmeof_*.sh
to qa/workunits/nvmeof/*.sh

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 2ed818e)

qa/suites/nvmeof: increase hosts in cluster setup

In "nvmeof" task, change "client" config to "installer"
which allows to take inputs like "host.a".

nvmeof/basic: change 2-gateway-2-initiator to
	       4-gateway-2-inititator cluster
nvmeof/thrash: change 3-gateway-1-initiator to
	        4-gateway-1-inititaor cluster

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 4d97b1a)

qa/suites/nvmeof: wait for service "nvmeof.mypool.mygroup0"

This is because nvmeof gateway group names are now
part of service id.

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit da8e95c)

labeler: add nvmeof labelers

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit d513cc5)

qa/suites/nvmeof: use "latest" image of gateway and cli

Change nvmeof gateway and cli image from 1.2 to "latest".

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 0bab553)

qa/workunits/nvmeof/setup_subsystem.sh: use --no-group-append

In newer version of nvmeof cli, "subsystem add" needs
this tag to ensure subsystem name is value of --subsystem.
Otherwise, in newer cli version, the gateway group is appended
at the end of the subsystem name.

This fixes the teuthology nvmeof suite (currently all jobs fails
because of this).

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 303f18b)

mon: add nvmeof healthchecks

Add NVMeofGwMap::get_health_checks which raises
NVMEOF_SINGLE_GATEWAY if any of the groups have
1 gateway.

In NVMeofGwMon, call `encode_health` and `load_health`
to register healthchecks. This will add nvmeof healthchecks
to "ceph health" output.

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 1cad040)

mon: add warning NVMEOF_GATEWAY_DOWN

In src/mon/NVMeofGwMap.cc,
add warning NVMEOF_GATEWAY_DOWN when any
gateway is in GW_UNAVAILABLE state.

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 0006599)

qa/suites/nvmeof: add mtls test

Add qa/workunits/nvmeof/mtls_test.sh which enables
mtls config and redeploy, then verify and disables
mtls config.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit fdc93ad)

monitoring: add 2 nvmeof alerts to prometheus_alerts.yaml

- `NVMeoFMissingListener`: trigger if all listeners
     are not created for each gateway in a subsystem
- `NVMeoFZeroListenerSubsystem`: trigger if a subsystem has no listeners

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit f02e312)

monitoring: add 2 new nvmeof alerts

Add NVMeoFMissingListener and NVMeoFZeroListenerSubsystem
alerts to prometheus_alerts.libsonnet.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 7994fea)

monitoring: add tests for 2 new nvmeof alerts

Add test for alerts NVMeoFMissingListener and
NVMeoFZeroListenerSubsystem to test_alerts.yml.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit a878460)

qa/suites/nvmeof: add nvmeof warnings to log-ignorelist

Add NVMEOF_SINGLE_GATEWAY and NVMEOF_GATEWAY_DOWN
warnings to nvmeof:thrash job's log-ignorelist

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 73d5c01)

qa/suites/nvmeof: fix nvmeof_namespaces.yaml

When basic_tests.sh is executed in parallel
with namespace_test.sh, sometimes namespace_test.sh
starts before fio_test.sh which would break the test.

So this change ensures "fio_test.sh" is started before
and executed in parallel with "namespace_test.sh".

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 6e15b5e)

qa/suite/nvmeof: add asserts to scalability_test.sh

Add assertions to 'status_checks()' function.
Use "apply" and "redeploy", instead of "orch rm" and
"apply" to upscale/downscale gateways.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 9393509)

qa/suite/nvmeof/thrash: increase number of thrashing

- Run fio for 15 mins (instead of 10min).
- nvmeof.py: change daemon_max_thrash_times default from 3 to 5
- nvmeof.py: run nvme list in do_checks()

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 51743e6)

qa/suites/nvmeof/basic: use default image in nvmeof_initiator.yaml

Instead of using quay.io/ceph/nvmeof:latest, use default
image in ceph build.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit f670916)

qa/suites/nvmeof/thrash: Add "is unavailable" to log-ignorelist

This commit also:
- Remove --rbd_iostat from thrasher fio
- Log iteration details before printing stats in nvmeof_tharsher

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit c0ca0eb)

qa/suites/nvmeof/thrasher: use 120 subsystems and 8 ns each

For tharsher test:
1. Run it on 120 subsystems with 8 namespaces each
2. Run FIO for 20 mins (instead of 15mins)
2. Run FIO for few randomly picked devices
    (using `--random_devices 200`)

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit e1983c5)

qa/tasks/nvmeof.py: Improve thrasher and rbd image creation

Create rbd images in one command using ";" to queue them,
instead of running "cephadm shell -- rbd create" again
and again for each image.

Improve the method to select to-be-thrashed daemons.
Use randint() and sample(), instead of weights/skip.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 82118e1)

qa/workunits/nvmeof/setup_subsystem.sh: add list_namespaces() func

Add list_namespaces function which could be useful for debugging later.
Remove extra call of list_subsystems so it's only logged once after
subsystems are completely setup.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 2030411)

qa/workunits/nvmeof/basic_tests.sh: Assert number of devices

Check number of devices connected after connect-all.
It should be equal to number of namespaces created.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 7ee4677)

qa/suites/nvmeof/thrash: add 10-subsys-90-namespace-no_huge_pages.yaml

Add test for no-huge-pages by using config
"spdk_mem_size: 4096" in 10 subsystems
and 90 namespaces each setup.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 09ade3d)

monitoring: Add prometheus alert NVMeoFMultipleNamespacesOfRBDImage

NVMeoFMultipleNamespacesOfRBDImage alerts the user if a RBD image
is used for multiple namespaces. This is important alerts for cases
where namespaces are created on same image for different gateway group.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 61b3289)

mon/NVMeofGwMap: add healthcheck warning NVMEOF_GATEWAY_DELETING

Add a warning when NVMeoF gateways are in DELETING state.
This happens when there are namespaces under the deleted gateway's
ANA group ID.

The gateways are removed completely after users manually move these
namespaces to another load balancing group. Or if a new gateway is
deployed on that host.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 571dd53)

src/common/options/mon.yaml.in: add mon_nvmeofgw_delete_grace

This config allows to configure the delay in triggering
NVMEOF_GATEWAY_DELETING healthcheck warning, which is
triggered when NVMeoF gateways are in DELETEING state
for too long (indicating a problem in namespace
load-balacing).
The default value for this config is 15 mins.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 7b33f77)

mon/NVMeofGwMap: add delay to NVMEOF_GATEWAY_DELETING warning

Instead of immediately triggering, have this healthcheck trigger
after some time has elasped. This delay can be configured by
mon_nvmeofgw_delete_grace.

Track the time when gateways go into DELETING state in a new
member var (of NVMeofGwMon) 'gws_deleting_time'.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 56cf512)

qa/workunits/nvmeof/basic_tests.sh: fix connect-all assert

There seems to be change in 'nvme list' json output
which caused failures in asserts after 'nvme connect-all'
command.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 22f91cd)

qa/tasks/nvmeof: Add --refresh flag in do_checks() cmds

This is to ensure latest state of the services are displayed.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 023c209)

qa: Add qa/suites/nvmeof/thrash/gateway-initiator-setup/2-subsys-8-namespace.yaml

This allows to run nvmeof thrasher test on smaller
confgurations which finshes faster than 120subsys-8ns
config.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit d7551f7)

qa/tasks/nvmeof.py: Add stop_and_join method to thrasher

Also add nvme-gw show command output in do_checks()
and revive daemons with 'ceph orch daemon start' in
revive_daemon() method.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 0b0f450)

qa/workunits/nvmeof/fio_test.sh: fix fio filenames

Filenames were provided to fio as nvme1n1:nvme1n2,
it should be pull path (/dev/nvme1n1:/dev/nvme1n2).

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 06811a4)

qa/tasks/nvmeof.py: Do not use 'systemctl start' in thrasher

Instead use 'daemon start' in revive_daemon() to bring
up gateways thrashed with 'systemctl stop'.
This is because 'systemctl start' method seems to temporary
issues.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit b5e6a0c)

qa/tasks/nvmeof.py: make seperate calls in do_checks()

When running 'nvme list-subsys <device>' command
in do_checks(), instead of combining command for
all devices with '&&', make seperate calls.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 5a58114)

qa/tasks/nvmeof.py: Fix do_checks() method

All checks currently run on initator node, now
run all "ceph" commands on one of gateway hosts
instead of initator nodes. And run "nvme list"
and "nvme list-subsys" checks on initator node.

Add retry (5 times) to do_checks if any command fails.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 7dfd3d3)

qa/tasks/nvmeof.py: Ignore systemctl_stop thrashing method

Do not use systemctl_stop method to thrash daemons,
just use 'ceph orch daemon stop' and 'ceph orch daemon rm'
methods to thrash nvmeof gateways.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit d4aec58)

qa/tasks/nvmeof.py: Add teardown() method

Add teardown method to remove nvmeof service
before rest of the cluster tearsdown.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit e8201d3)

qa/suites/nvmeof: Remove watchdog from thrasher

This commit does the following:
1. remove watchdog from thrasher
1. remove wait from fio_test
3. change thrasher switcher wait-time to 10 mins

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 76b4028)

monitoring: add NVMeoFMaxGatewayGroups

Add config NVMeoFMaxGatewayGroups to config.libsonnet
and set it to 4 (groups).

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit c5c4b10)

monitoring: add alert NVMeoFMaxGatewayGroups

Add alert NVMeoFMaxGatewayGroups to prometheus_alerts.yml
and prometheus_alerts.libsonnet.

This alerts is to indicate if max number of NVMeoF gateway
groups have been reached in a cluster.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit ab4a1dd)

monitoring: add tests for NVMeoFMaxGatewayGroups

Add unit tests for alert NVMeoFMaxGatewayGroups
in monitoring/ceph-mixin/tests_alerts/test_alerts.yml

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit e5cb5db)

qa/suites/nvmeof: use SCALING_DELAYS: '120'

Increase delays for qa/workunits/nvmeof/scalability_test.sh
as namespace rebalancing takes more time. After upscaling,
gateway initially could be 'CREATED', it is a valid state during
gateway initialization, but then the state should progress
to 'AVAILABLE' within couple of seconds.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 3b9b290)

qa/workunits/nvmeof/fio_test: Log cluster status if fio fails

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit e450406)

qa/suites/nvmeof: add more asserts to scalability_test

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 877c726)

qa/suites/nvmeof: Run fio with scalability test

Run fio in parallel with scalability test.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit e2f3bed)

qa/workunits/nvmeof/fio_test.sh: add more debug commands

Add more commands to debug when fio fails:
- nvme list-subsys /dev/nvme1n2
- nvme list from the initiator
- nvme list | wc -l
- nvme id-ns /dev/nvme1n2

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit fd8fbea)

mon: Add nvmeof group/gateway name in  "ceph -s"

In "ceph status" command output, show gateway
group names and gateway names.

Before:
```
  services:
    mon:    4 daemons, quorum ceph-nvme-vm8,ceph-nvme-vm1,ceph-nvme-vm7,ceph-nvme-vm6 (age 71m)
    mgr:    ceph-nvme-vm8.tgytdq(active, since 73m), standbys: ceph-nvme-vm6.tequqo, ceph-nvme-vm1.pxrofr, ceph-nvme-vm7.lbxrea
    osd:    4 osds: 4 up (since 70m), 4 in (since 70m)
    nvmeof: 4 gateways active (4 hosts)
```

After:
```
  services:
    mon:               4 daemons, quorum ceph-nvme-vm14,ceph-nvme-vm11,ceph-nvme-vm13,ceph-nvme-vm12 (age 17m)
    mgr:               ceph-nvme-vm14.gjjgvq(active, since 19m), standbys: ceph-nvme-vm12.shbvpw, ceph-nvme-vm11.gucgiu, ceph-nvme-vm13.inzizw
    osd:               4 osds: 4 up (since 15m), 4 in (since 16m)
    nvmeof (mygroup1) : 2 gateways active (ceph-nvme-vm13.azfdpk, ceph-nvme-vm14.hdsoxl)
    nvmeof (mygroup2) : 2 gateways active (ceph-nvme-vm11.hnooxs, ceph-nvme-vm12.wcjcjs)
```

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit e3fab2a)

mon: show count of active/total nvmeof gws in "ceph -s"

Improve "ceph status" output for nvmeof service:

1. Group by service_id (<pool>.<group>) instead of
  just by gateway groups.
2. Show total gateway count from NVMeofGwMap, and count
  of active gateways.

New output:
```
  services:
    mon:                     4 daemons, quorum ceph-nvme-vm31,ceph-nvme-vm28,ceph-nvme-vm30,ceph-nvme-vm29 (age 16m)
    mgr:                     ceph-nvme-vm31.wnfclf(active, since 18m), standbys: ceph-nvme-vm29.iuwqin, ceph-nvme-vm28.lnnyui, ceph-nvme-vm30.fitwnw
    osd:                     4 osds: 4 up (since 14m), 4 in (since 15m)
    nvmeof (mypool.mygroup1): 2 gateways: 1 active (ceph-nvme-vm30.kkcfux)
    nvmeof (mypool.mygroup2): 2 gateways: 2 active (ceph-nvme-vm28.mfqucr, ceph-nvme-vm29.hrizzl)
```

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 3065ffe)

monitoring: fix NVMeoFSubsystemNamespaceLimit

Alert is not triggered as expected, change the query
to fix that.

BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2282348

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 4a7866a)

mgr/cephadm: set service name for DaemonDescription object used during daemon removal

What this is specifically fixing is that the nvmeof post_remove function
needs the service spec of the daemon's service to get the pool and group
tied to the nvmeof daemon. We have been using the DaemonDescription
"service_name" property to get the service name in order to get the spec.
This works in a regular deployment. However, it is possible to make a placement
like

placement:
  hosts:
  - vm-00=nvmeof.a
  - vm-01=nvmeof.b

and one of the nvmeof CI tests was doing so, which is why we saw this.
That will cause the nvmeof daemon names to be nvmeof.nvmeof.a and
nvmeof.nvmeof.b and not include the service name at all. In this
case, the service_name property on the DaemonDescription class
will end up getting service names nvmeof.nvmeof.a and nvmeof.nvmeof.b
respectively from the nvmeof daemons, which will cause us to fail
to find the spec in post_remove. This change makes it so we manually set
the service name for the DaemonDescription object that gets passed
to post_remove based on the service name of the daemon object we
get from the host cache, which will still have the correct service
name even if the daemon has a custom name. Then the nvmeof post_remove
function will get the correct service name and be able to find the spec.

Additionally, we now take are technically taking the daemon type
and id from the DaemonDescription in our HostCache as well, but
this is mostly just for consistency and should have no real impact.

Fixes: https://tracker.ceph.com/issues/68962

Signed-off-by: Adam King <adking@redhat.com>
(cherry picked from commit d8dae24)

Add multi-cluster support (showMultiCluster=True) to alerts

Following PR ceph/ceph#55495 fixing the
dashboard in regards to multiple clusters storing their metrics
in a single Prometheus instance, this PR addresses the issues
for alerts.

Fixes: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
(cherry picked from commit 810c706)

mon/nvme: fix unused lambda capture warnings

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
(cherry picked from commit edb0321)

src/nvmeof/NVMeofGwMonitorClient: remove MDS client, not needed

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
(cherry picked from commit f806872)

cephadm/nvmeof: fix ports when default values are overridden

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
(cherry picked from commit e717a92)

cephadm/nvmeof: support per-node gateway addresses

Added gateway and discovery address maps to the service specification.
These maps store per-node service addresses. The address is first searched
in the map, then in the spec address configuration. If neither is defined,
the host IP is used as a fallback.

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
(cherry picked from commit 2f47f9d)

cephadm/nvmeof: support no huge pages for nvmeof spdk

depends on: ceph/ceph-nvmeof#898

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
(cherry picked from commit 38513cb)

pybind/mgr/orchestrator/module.py: NvmeofServiceSpec service_id

- make service_id better alligned with default/empty group
  (ceph/ceph@f6d552d)
- fix service_id in nvmeof daemon add

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
(cherry picked from commit e1612d0)

python-common/ceph/deployment: add SPDK log level to nvmeof configuration
Fixes https://tracker.ceph.com/issues/67258

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit d3cc237)

mgr/cephadm: add SPDK log level to nvmeof configuration
Fixes https://tracker.ceph.com/issues/67258

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 19399de)

python-common/ceph/deployment: change SPDK RPC fields in nvmeof configuration
Fixes https://tracker.ceph.com/issues/67629

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit d18e6fb)

mgr/cephadm: change SPDK RPC fields in nvmeof configuration
Fixes https://tracker.ceph.com/issues/67629

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit d208242)

python-common/ceph/deployment: revert SPDK RPC fields in nvmeof configuration
Fixes https://tracker.ceph.com/issues/67844

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit cb28d39)

mgr/cephadm: revert SPDK RPC fields in nvmeof configuration
Fixes https://tracker.ceph.com/issues/67844

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 11de53f)

python-common/ceph/deployment: Add namespace netmask parameters to nvmeof configuration
Fixes https://tracker.ceph.com/issues/68542

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit dd4b357)

mgr/cephadm: Add namespace netmask parameters to nvmeof configuration
Fixes https://tracker.ceph.com/issues/68542

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 0dcc207)

python-common/ceph/deployment: Add resource limits to nvmeof configuration
Fixes https://tracker.ceph.com/issues/68967

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 4269d7c)

mgr/cephadm: Add resource limits to nvmeof configuration
Fixes https://tracker.ceph.com/issues/68967

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 1807a55)
Signed-off-by: Gil Bregman <gbregman@il.ibm.com>

mgr/cephadm/nvmeof: Add auto rebalance fields to NVMeOF configuration
Fixes https://tracker.ceph.com/issues/69176

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit bfc8fb6)

mgr/cephadm/nvmeof: Rewrite NVMEoF fields validation.
Fixes https://tracker.ceph.com/issues/69176

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 31283c0)

mgr/cephadm/nvmeof: Add key verification field to NVMeOF configuration
Fixes https://tracker.ceph.com/issues/69413

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 26a0f9a)
Signed-off-by: Gil Bregman <gbregman@il.ibm.com>

mgr/cephadm: change ceph-nvmeof gw image version to 1.4
Fixes https://tracker.ceph.com/issues/69099

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>

mgr/cephadm/nvmeof: Add verify_listener_ip field to NVMeOF configuration and remove obsolete enable_key_encryption
Fixes https://tracker.ceph.com/issues/69731

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 744b04a)

mgr/cephadm/nvmeof: Add max_hosts field to NVMeOF configuration and update default values
Fixes https://tracker.ceph.com/issues/69759

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 0d8bd4d)

mgr/cephadm/nvmeof: Add SPDK iobuf options field to NVMeOF configuration
Fixes https://tracker.ceph.com/issues/69554

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 42bac97)

mgr/cephadm/nvmeof: Add QOS timeslice field to NVMeOF configuration
Fixes https://tracker.ceph.com/issues/69952

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 7b4af1f)

mon/nvmeofgw*: fix HA usecase when gateway has no listeners: behaves like no-subsystems

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
(cherry picked from commit 47e7a24)

 mon/nvmeofgw*: monitors publish in nvme-gw show ana group responsible
 for  namespace rebalance

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
(cherry picked from commit c358483)

nvmeofgw* : fix publishing rebalance index

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
(cherry picked from commit ceb62c0)

nvmeofgw*: 2 fixes - for duplicated optimized  pathes and fix for GW startup
 1. fix duplicated optimized host's pathes - trigger process_gw_down upon
   fast-gw reboot, removed old fast-reboot handlers
 2. fix GW startup - trigger process_gw_down when expired WAIT_BLOCKLIST timer

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
(cherry picked from commit 4397c02)

Merge pull request #60871 from leonidc/leonidc-epoch-filter

Epoch filtering

Reviewed-by: Samuel Just <sjust@redhat.com>
Reviewed-by: Aviv Caro <Aviv.Caro@ibm.com>
Reviewed-by: Ronen Friedman <rfriedma@redhat.com>
(cherry picked from commit 3cdf529)

mon/nvmeofgw*: fix no-listeners FSM, fix detection of no-listeners
condition

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
(cherry picked from commit 66ca80e)

restore  proper no-listeners logic

Signed-off-by: leonidc <leonidc@il.ibm.com>
baum pushed a commit to ceph/ceph-ci that referenced this pull request Nov 18, 2025
========================================

Resolves: rhbz#2350962

qa/tasks/nvmeof.py: add nvmeof gw-group to deployment

Groups was made a required parameter to be
`ceph orch apply nvmeof <pool> <group>` in
ceph/ceph#58860.
That broke the `nvmeof` suite so this PR fixes that.

Right now, all gateway are deployed in a single group.
Later, this would be changed to have multi groups for a better test.

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit c9a6fed)

qa: Expand nvmeof thrasher and add nvmeof_namespaces.yaml job

1. qa/tasks/nvmeof.py: add other methods to stop nvmeof daemons
2. add qa/workunits/rbd/nvmeof_namespace_test.sh which adds and
   deletes new namespaces. It is run in nvmeof_namespaces.yaml
   job where fio happens to other namespaces in background.

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 58d8be9)

qa/suites/nvmeof/basic: add nvmeof_scalability test

Add test to upscale/downscale nvmeof
gateways.

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit e5a9cda)

qa: move nvmeof shell scripts to qa/workunits/nvmeof

Move all scripts qa/workunits/rbd/nvmeof_*.sh
to qa/workunits/nvmeof/*.sh

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 2ed818e)

qa/suites/nvmeof: increase hosts in cluster setup

In "nvmeof" task, change "client" config to "installer"
which allows to take inputs like "host.a".

nvmeof/basic: change 2-gateway-2-initiator to
	       4-gateway-2-inititator cluster
nvmeof/thrash: change 3-gateway-1-initiator to
	        4-gateway-1-inititaor cluster

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 4d97b1a)

qa/suites/nvmeof: wait for service "nvmeof.mypool.mygroup0"

This is because nvmeof gateway group names are now
part of service id.

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit da8e95c)

labeler: add nvmeof labelers

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit d513cc5)

qa/suites/nvmeof: use "latest" image of gateway and cli

Change nvmeof gateway and cli image from 1.2 to "latest".

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 0bab553)

qa/workunits/nvmeof/setup_subsystem.sh: use --no-group-append

In newer version of nvmeof cli, "subsystem add" needs
this tag to ensure subsystem name is value of --subsystem.
Otherwise, in newer cli version, the gateway group is appended
at the end of the subsystem name.

This fixes the teuthology nvmeof suite (currently all jobs fails
because of this).

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 303f18b)

mon: add nvmeof healthchecks

Add NVMeofGwMap::get_health_checks which raises
NVMEOF_SINGLE_GATEWAY if any of the groups have
1 gateway.

In NVMeofGwMon, call `encode_health` and `load_health`
to register healthchecks. This will add nvmeof healthchecks
to "ceph health" output.

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 1cad040)

mon: add warning NVMEOF_GATEWAY_DOWN

In src/mon/NVMeofGwMap.cc,
add warning NVMEOF_GATEWAY_DOWN when any
gateway is in GW_UNAVAILABLE state.

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 0006599)

qa/suites/nvmeof: add mtls test

Add qa/workunits/nvmeof/mtls_test.sh which enables
mtls config and redeploy, then verify and disables
mtls config.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit fdc93ad)

monitoring: add 2 nvmeof alerts to prometheus_alerts.yaml

- `NVMeoFMissingListener`: trigger if all listeners
     are not created for each gateway in a subsystem
- `NVMeoFZeroListenerSubsystem`: trigger if a subsystem has no listeners

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit f02e312)

monitoring: add 2 new nvmeof alerts

Add NVMeoFMissingListener and NVMeoFZeroListenerSubsystem
alerts to prometheus_alerts.libsonnet.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 7994fea)

monitoring: add tests for 2 new nvmeof alerts

Add test for alerts NVMeoFMissingListener and
NVMeoFZeroListenerSubsystem to test_alerts.yml.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit a878460)

qa/suites/nvmeof: add nvmeof warnings to log-ignorelist

Add NVMEOF_SINGLE_GATEWAY and NVMEOF_GATEWAY_DOWN
warnings to nvmeof:thrash job's log-ignorelist

Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
(cherry picked from commit 73d5c01)

qa/suites/nvmeof: fix nvmeof_namespaces.yaml

When basic_tests.sh is executed in parallel
with namespace_test.sh, sometimes namespace_test.sh
starts before fio_test.sh which would break the test.

So this change ensures "fio_test.sh" is started before
and executed in parallel with "namespace_test.sh".

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 6e15b5e)

qa/suite/nvmeof: add asserts to scalability_test.sh

Add assertions to 'status_checks()' function.
Use "apply" and "redeploy", instead of "orch rm" and
"apply" to upscale/downscale gateways.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 9393509)

qa/suite/nvmeof/thrash: increase number of thrashing

- Run fio for 15 mins (instead of 10min).
- nvmeof.py: change daemon_max_thrash_times default from 3 to 5
- nvmeof.py: run nvme list in do_checks()

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 51743e6)

qa/suites/nvmeof/basic: use default image in nvmeof_initiator.yaml

Instead of using quay.io/ceph/nvmeof:latest, use default
image in ceph build.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit f670916)

qa/suites/nvmeof/thrash: Add "is unavailable" to log-ignorelist

This commit also:
- Remove --rbd_iostat from thrasher fio
- Log iteration details before printing stats in nvmeof_tharsher

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit c0ca0eb)

qa/suites/nvmeof/thrasher: use 120 subsystems and 8 ns each

For tharsher test:
1. Run it on 120 subsystems with 8 namespaces each
2. Run FIO for 20 mins (instead of 15mins)
2. Run FIO for few randomly picked devices
    (using `--random_devices 200`)

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit e1983c5)

qa/tasks/nvmeof.py: Improve thrasher and rbd image creation

Create rbd images in one command using ";" to queue them,
instead of running "cephadm shell -- rbd create" again
and again for each image.

Improve the method to select to-be-thrashed daemons.
Use randint() and sample(), instead of weights/skip.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 82118e1)

qa/workunits/nvmeof/setup_subsystem.sh: add list_namespaces() func

Add list_namespaces function which could be useful for debugging later.
Remove extra call of list_subsystems so it's only logged once after
subsystems are completely setup.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 2030411)

qa/workunits/nvmeof/basic_tests.sh: Assert number of devices

Check number of devices connected after connect-all.
It should be equal to number of namespaces created.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 7ee4677)

qa/suites/nvmeof/thrash: add 10-subsys-90-namespace-no_huge_pages.yaml

Add test for no-huge-pages by using config
"spdk_mem_size: 4096" in 10 subsystems
and 90 namespaces each setup.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 09ade3d)

monitoring: Add prometheus alert NVMeoFMultipleNamespacesOfRBDImage

NVMeoFMultipleNamespacesOfRBDImage alerts the user if a RBD image
is used for multiple namespaces. This is important alerts for cases
where namespaces are created on same image for different gateway group.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 61b3289)

mon/NVMeofGwMap: add healthcheck warning NVMEOF_GATEWAY_DELETING

Add a warning when NVMeoF gateways are in DELETING state.
This happens when there are namespaces under the deleted gateway's
ANA group ID.

The gateways are removed completely after users manually move these
namespaces to another load balancing group. Or if a new gateway is
deployed on that host.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 571dd53)

src/common/options/mon.yaml.in: add mon_nvmeofgw_delete_grace

This config allows to configure the delay in triggering
NVMEOF_GATEWAY_DELETING healthcheck warning, which is
triggered when NVMeoF gateways are in DELETEING state
for too long (indicating a problem in namespace
load-balacing).
The default value for this config is 15 mins.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 7b33f77)

mon/NVMeofGwMap: add delay to NVMEOF_GATEWAY_DELETING warning

Instead of immediately triggering, have this healthcheck trigger
after some time has elasped. This delay can be configured by
mon_nvmeofgw_delete_grace.

Track the time when gateways go into DELETING state in a new
member var (of NVMeofGwMon) 'gws_deleting_time'.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 56cf512)

qa/workunits/nvmeof/basic_tests.sh: fix connect-all assert

There seems to be change in 'nvme list' json output
which caused failures in asserts after 'nvme connect-all'
command.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 22f91cd)

qa/tasks/nvmeof: Add --refresh flag in do_checks() cmds

This is to ensure latest state of the services are displayed.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 023c209)

qa: Add qa/suites/nvmeof/thrash/gateway-initiator-setup/2-subsys-8-namespace.yaml

This allows to run nvmeof thrasher test on smaller
confgurations which finshes faster than 120subsys-8ns
config.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit d7551f7)

qa/tasks/nvmeof.py: Add stop_and_join method to thrasher

Also add nvme-gw show command output in do_checks()
and revive daemons with 'ceph orch daemon start' in
revive_daemon() method.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 0b0f450)

qa/workunits/nvmeof/fio_test.sh: fix fio filenames

Filenames were provided to fio as nvme1n1:nvme1n2,
it should be pull path (/dev/nvme1n1:/dev/nvme1n2).

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 06811a4)

qa/tasks/nvmeof.py: Do not use 'systemctl start' in thrasher

Instead use 'daemon start' in revive_daemon() to bring
up gateways thrashed with 'systemctl stop'.
This is because 'systemctl start' method seems to temporary
issues.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit b5e6a0c)

qa/tasks/nvmeof.py: make seperate calls in do_checks()

When running 'nvme list-subsys <device>' command
in do_checks(), instead of combining command for
all devices with '&&', make seperate calls.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 5a58114)

qa/tasks/nvmeof.py: Fix do_checks() method

All checks currently run on initator node, now
run all "ceph" commands on one of gateway hosts
instead of initator nodes. And run "nvme list"
and "nvme list-subsys" checks on initator node.

Add retry (5 times) to do_checks if any command fails.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 7dfd3d3)

qa/tasks/nvmeof.py: Ignore systemctl_stop thrashing method

Do not use systemctl_stop method to thrash daemons,
just use 'ceph orch daemon stop' and 'ceph orch daemon rm'
methods to thrash nvmeof gateways.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit d4aec58)

qa/tasks/nvmeof.py: Add teardown() method

Add teardown method to remove nvmeof service
before rest of the cluster tearsdown.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit e8201d3)

qa/suites/nvmeof: Remove watchdog from thrasher

This commit does the following:
1. remove watchdog from thrasher
1. remove wait from fio_test
3. change thrasher switcher wait-time to 10 mins

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 76b4028)

monitoring: add NVMeoFMaxGatewayGroups

Add config NVMeoFMaxGatewayGroups to config.libsonnet
and set it to 4 (groups).

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit c5c4b10)

monitoring: add alert NVMeoFMaxGatewayGroups

Add alert NVMeoFMaxGatewayGroups to prometheus_alerts.yml
and prometheus_alerts.libsonnet.

This alerts is to indicate if max number of NVMeoF gateway
groups have been reached in a cluster.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit ab4a1dd)

monitoring: add tests for NVMeoFMaxGatewayGroups

Add unit tests for alert NVMeoFMaxGatewayGroups
in monitoring/ceph-mixin/tests_alerts/test_alerts.yml

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit e5cb5db)

qa/suites/nvmeof: use SCALING_DELAYS: '120'

Increase delays for qa/workunits/nvmeof/scalability_test.sh
as namespace rebalancing takes more time. After upscaling,
gateway initially could be 'CREATED', it is a valid state during
gateway initialization, but then the state should progress
to 'AVAILABLE' within couple of seconds.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 3b9b290)

qa/workunits/nvmeof/fio_test: Log cluster status if fio fails

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit e450406)

qa/suites/nvmeof: add more asserts to scalability_test

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 877c726)

qa/suites/nvmeof: Run fio with scalability test

Run fio in parallel with scalability test.

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit e2f3bed)

qa/workunits/nvmeof/fio_test.sh: add more debug commands

Add more commands to debug when fio fails:
- nvme list-subsys /dev/nvme1n2
- nvme list from the initiator
- nvme list | wc -l
- nvme id-ns /dev/nvme1n2

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit fd8fbea)

mon: Add nvmeof group/gateway name in  "ceph -s"

In "ceph status" command output, show gateway
group names and gateway names.

Before:
```
  services:
    mon:    4 daemons, quorum ceph-nvme-vm8,ceph-nvme-vm1,ceph-nvme-vm7,ceph-nvme-vm6 (age 71m)
    mgr:    ceph-nvme-vm8.tgytdq(active, since 73m), standbys: ceph-nvme-vm6.tequqo, ceph-nvme-vm1.pxrofr, ceph-nvme-vm7.lbxrea
    osd:    4 osds: 4 up (since 70m), 4 in (since 70m)
    nvmeof: 4 gateways active (4 hosts)
```

After:
```
  services:
    mon:               4 daemons, quorum ceph-nvme-vm14,ceph-nvme-vm11,ceph-nvme-vm13,ceph-nvme-vm12 (age 17m)
    mgr:               ceph-nvme-vm14.gjjgvq(active, since 19m), standbys: ceph-nvme-vm12.shbvpw, ceph-nvme-vm11.gucgiu, ceph-nvme-vm13.inzizw
    osd:               4 osds: 4 up (since 15m), 4 in (since 16m)
    nvmeof (mygroup1) : 2 gateways active (ceph-nvme-vm13.azfdpk, ceph-nvme-vm14.hdsoxl)
    nvmeof (mygroup2) : 2 gateways active (ceph-nvme-vm11.hnooxs, ceph-nvme-vm12.wcjcjs)
```

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit e3fab2a)

mon: show count of active/total nvmeof gws in "ceph -s"

Improve "ceph status" output for nvmeof service:

1. Group by service_id (<pool>.<group>) instead of
  just by gateway groups.
2. Show total gateway count from NVMeofGwMap, and count
  of active gateways.

New output:
```
  services:
    mon:                     4 daemons, quorum ceph-nvme-vm31,ceph-nvme-vm28,ceph-nvme-vm30,ceph-nvme-vm29 (age 16m)
    mgr:                     ceph-nvme-vm31.wnfclf(active, since 18m), standbys: ceph-nvme-vm29.iuwqin, ceph-nvme-vm28.lnnyui, ceph-nvme-vm30.fitwnw
    osd:                     4 osds: 4 up (since 14m), 4 in (since 15m)
    nvmeof (mypool.mygroup1): 2 gateways: 1 active (ceph-nvme-vm30.kkcfux)
    nvmeof (mypool.mygroup2): 2 gateways: 2 active (ceph-nvme-vm28.mfqucr, ceph-nvme-vm29.hrizzl)
```

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 3065ffe)

monitoring: fix NVMeoFSubsystemNamespaceLimit

Alert is not triggered as expected, change the query
to fix that.

BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2282348

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
(cherry picked from commit 4a7866a)

mgr/cephadm: set service name for DaemonDescription object used during daemon removal

What this is specifically fixing is that the nvmeof post_remove function
needs the service spec of the daemon's service to get the pool and group
tied to the nvmeof daemon. We have been using the DaemonDescription
"service_name" property to get the service name in order to get the spec.
This works in a regular deployment. However, it is possible to make a placement
like

placement:
  hosts:
  - vm-00=nvmeof.a
  - vm-01=nvmeof.b

and one of the nvmeof CI tests was doing so, which is why we saw this.
That will cause the nvmeof daemon names to be nvmeof.nvmeof.a and
nvmeof.nvmeof.b and not include the service name at all. In this
case, the service_name property on the DaemonDescription class
will end up getting service names nvmeof.nvmeof.a and nvmeof.nvmeof.b
respectively from the nvmeof daemons, which will cause us to fail
to find the spec in post_remove. This change makes it so we manually set
the service name for the DaemonDescription object that gets passed
to post_remove based on the service name of the daemon object we
get from the host cache, which will still have the correct service
name even if the daemon has a custom name. Then the nvmeof post_remove
function will get the correct service name and be able to find the spec.

Additionally, we now take are technically taking the daemon type
and id from the DaemonDescription in our HostCache as well, but
this is mostly just for consistency and should have no real impact.

Fixes: https://tracker.ceph.com/issues/68962

Signed-off-by: Adam King <adking@redhat.com>
(cherry picked from commit d8dae24)

Add multi-cluster support (showMultiCluster=True) to alerts

Following PR ceph/ceph#55495 fixing the
dashboard in regards to multiple clusters storing their metrics
in a single Prometheus instance, this PR addresses the issues
for alerts.

Fixes: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
(cherry picked from commit 810c706)

mon/nvme: fix unused lambda capture warnings

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
(cherry picked from commit edb0321)

src/nvmeof/NVMeofGwMonitorClient: remove MDS client, not needed

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
(cherry picked from commit f806872)

cephadm/nvmeof: fix ports when default values are overridden

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
(cherry picked from commit e717a92)

cephadm/nvmeof: support per-node gateway addresses

Added gateway and discovery address maps to the service specification.
These maps store per-node service addresses. The address is first searched
in the map, then in the spec address configuration. If neither is defined,
the host IP is used as a fallback.

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
(cherry picked from commit 2f47f9d)

cephadm/nvmeof: support no huge pages for nvmeof spdk

depends on: ceph/ceph-nvmeof#898

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
(cherry picked from commit 38513cb)

pybind/mgr/orchestrator/module.py: NvmeofServiceSpec service_id

- make service_id better alligned with default/empty group
  (ceph/ceph@f6d552d)
- fix service_id in nvmeof daemon add

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
(cherry picked from commit e1612d0)

python-common/ceph/deployment: add SPDK log level to nvmeof configuration
Fixes https://tracker.ceph.com/issues/67258

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit d3cc237)

mgr/cephadm: add SPDK log level to nvmeof configuration
Fixes https://tracker.ceph.com/issues/67258

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 19399de)

python-common/ceph/deployment: change SPDK RPC fields in nvmeof configuration
Fixes https://tracker.ceph.com/issues/67629

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit d18e6fb)

mgr/cephadm: change SPDK RPC fields in nvmeof configuration
Fixes https://tracker.ceph.com/issues/67629

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit d208242)

python-common/ceph/deployment: revert SPDK RPC fields in nvmeof configuration
Fixes https://tracker.ceph.com/issues/67844

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit cb28d39)

mgr/cephadm: revert SPDK RPC fields in nvmeof configuration
Fixes https://tracker.ceph.com/issues/67844

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 11de53f)

python-common/ceph/deployment: Add namespace netmask parameters to nvmeof configuration
Fixes https://tracker.ceph.com/issues/68542

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit dd4b357)

mgr/cephadm: Add namespace netmask parameters to nvmeof configuration
Fixes https://tracker.ceph.com/issues/68542

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 0dcc207)

python-common/ceph/deployment: Add resource limits to nvmeof configuration
Fixes https://tracker.ceph.com/issues/68967

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 4269d7c)

mgr/cephadm: Add resource limits to nvmeof configuration
Fixes https://tracker.ceph.com/issues/68967

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 1807a55)
Signed-off-by: Gil Bregman <gbregman@il.ibm.com>

mgr/cephadm/nvmeof: Add auto rebalance fields to NVMeOF configuration
Fixes https://tracker.ceph.com/issues/69176

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit bfc8fb6)

mgr/cephadm/nvmeof: Rewrite NVMEoF fields validation.
Fixes https://tracker.ceph.com/issues/69176

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 31283c0)

mgr/cephadm/nvmeof: Add key verification field to NVMeOF configuration
Fixes https://tracker.ceph.com/issues/69413

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 26a0f9a)
Signed-off-by: Gil Bregman <gbregman@il.ibm.com>

mgr/cephadm: change ceph-nvmeof gw image version to 1.4
Fixes https://tracker.ceph.com/issues/69099

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>

mgr/cephadm/nvmeof: Add verify_listener_ip field to NVMeOF configuration and remove obsolete enable_key_encryption
Fixes https://tracker.ceph.com/issues/69731

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 744b04a)

mgr/cephadm/nvmeof: Add max_hosts field to NVMeOF configuration and update default values
Fixes https://tracker.ceph.com/issues/69759

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 0d8bd4d)

mgr/cephadm/nvmeof: Add SPDK iobuf options field to NVMeOF configuration
Fixes https://tracker.ceph.com/issues/69554

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 42bac97)

mgr/cephadm/nvmeof: Add QOS timeslice field to NVMeOF configuration
Fixes https://tracker.ceph.com/issues/69952

Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
(cherry picked from commit 7b4af1f)

mon/nvmeofgw*: fix HA usecase when gateway has no listeners: behaves like no-subsystems

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
(cherry picked from commit 47e7a24)

 mon/nvmeofgw*: monitors publish in nvme-gw show ana group responsible
 for  namespace rebalance

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
(cherry picked from commit c358483)

nvmeofgw* : fix publishing rebalance index

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
(cherry picked from commit ceb62c0)

nvmeofgw*: 2 fixes - for duplicated optimized  pathes and fix for GW startup
 1. fix duplicated optimized host's pathes - trigger process_gw_down upon
   fast-gw reboot, removed old fast-reboot handlers
 2. fix GW startup - trigger process_gw_down when expired WAIT_BLOCKLIST timer

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
(cherry picked from commit 4397c02)

Merge pull request #60871 from leonidc/leonidc-epoch-filter

Epoch filtering

Reviewed-by: Samuel Just <sjust@redhat.com>
Reviewed-by: Aviv Caro <Aviv.Caro@ibm.com>
Reviewed-by: Ronen Friedman <rfriedma@redhat.com>
(cherry picked from commit 3cdf529)

mon/nvmeofgw*: fix no-listeners FSM, fix detection of no-listeners
condition

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
(cherry picked from commit 66ca80e)

restore  proper no-listeners logic

Signed-off-by: leonidc <leonidc@il.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

4 participants