Skip to content

mgr: store names of modules that register RADOS clients in the MgrMap#50065

Merged
ljflores merged 2 commits intoceph:mainfrom
ajarr:fix-58691
Mar 9, 2023
Merged

mgr: store names of modules that register RADOS clients in the MgrMap#50065
ljflores merged 2 commits intoceph:mainfrom
ajarr:fix-58691

Conversation

@ajarr
Copy link
Contributor

@ajarr ajarr commented Feb 10, 2023

The MgrMap stores a list of RADOS clients' addresses registered by the
mgr modules. During failover of ceph-mgr, the list is used to blocklist
clients belonging to the failed ceph-mgr.

Store the names of the mgr modules that registered the RADOS clients
along with the clients' addresses in the MgrMap. During debugging, this
allows easy identification of the mgr module that registered a
particular RADOS client by just dumping the MgrMap (ceph mgr dump).

Following is the MgrMap output with a module's client name displayed
along with its client addrvec,

$ ceph mgr dump | jq '.active_clients[0]'
{
  "name": "devicehealth",
  "addrvec": [
    {
      "type": "v2",
      "addr": "10.0.0.148:0",
      "nonce": 612376578
    }
  ]
}

Fixes: https://tracker.ceph.com/issues/58691
Signed-off-by: Ramana Raja rraja@redhat.com

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@ajarr ajarr requested a review from a team as a code owner February 10, 2023 02:33
@ajarr ajarr marked this pull request as draft February 10, 2023 02:35
@ajarr
Copy link
Contributor Author

ajarr commented Feb 10, 2023

This draft depends on #50006

@ajarr
Copy link
Contributor Author

ajarr commented Feb 10, 2023

Need to add a simple test and add a release note.

@ajarr
Copy link
Contributor Author

ajarr commented Feb 10, 2023

Need to add a simple test and add a release note.

Done.

@ajarr ajarr marked this pull request as ready for review February 10, 2023 16:25
@ajarr ajarr requested a review from idryomov February 10, 2023 18:47
@ajarr ajarr force-pushed the fix-58691 branch 2 times, most recently from 094d7f8 to c24cd86 Compare February 14, 2023 04:08
@ajarr ajarr requested a review from idryomov February 14, 2023 04:10
@ajarr ajarr requested a review from batrick February 14, 2023 15:37
@batrick
Copy link
Member

batrick commented Feb 15, 2023

@ajarr I'm on PTO. I'll look at this late next week.

@ajarr
Copy link
Contributor Author

ajarr commented Feb 15, 2023

@ajarr I'm on PTO. I'll look at this late next week.

Thanks for the update, @batrick .

@ajarr ajarr force-pushed the fix-58691 branch 2 times, most recently from 82e4b37 to f8b9299 Compare February 15, 2023 16:11
Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise lgtm

@batrick
Copy link
Member

batrick commented Feb 27, 2023

@ajarr while I'm thinking of this now, I want to let you know that I'll be making some changes to this register client code in a fix for

https://tracker.ceph.com/issues/55606

I don't think there will be conflicts but just wanted to leave a note here so I don't forget.

@ajarr ajarr requested a review from idryomov February 28, 2023 03:03
@ajarr
Copy link
Contributor Author

ajarr commented Feb 28, 2023

@ajarr while I'm thinking of this now, I want to let you know that I'll be making some changes to this register client code in a fix for

https://tracker.ceph.com/issues/55606

I don't think there will be conflicts but just wanted to leave a note here so I don't forget.

@batrick thanks! I took a quick look at PyModuleRegistry changes in your draft.

@batrick
Copy link
Member

batrick commented Feb 28, 2023

#50291

for posterity

The MgrMap stores a list of RADOS clients' addresses registered by the
mgr modules. During failover of ceph-mgr, the list is used to blocklist
clients belonging to the failed ceph-mgr.

Store the names of the mgr modules that registered the RADOS clients
along with the clients' addresses in the MgrMap. During debugging, this
allows easy identification of the mgr module that registered a
particular RADOS client by just dumping the MgrMap (`ceph mgr dump`).

Following is the MgrMap output with a module's client name displayed
along with its client addrvec,
$ ceph mgr dump | jq '.active_clients[0]'
{
  "name": "devicehealth",
  "addrvec": [
    {
      "type": "v2",
      "addr": "10.0.0.148:0",
      "nonce": 612376578
    }
  ]
}

Fixes: https://tracker.ceph.com/issues/58691
Signed-off-by: Ramana Raja <rraja@redhat.com>
... that modify and access the data member 'clients' respectively.

Signed-off-by: Ramana Raja <rraja@redhat.com>
@idryomov
Copy link
Contributor

#50006 merged, please rebase

@ajarr
Copy link
Contributor Author

ajarr commented Feb 28, 2023

#50006 merged, please rebase

Done

@batrick
Copy link
Member

batrick commented Feb 28, 2023

@ajarr your PR is actually helpful for the test case I wrote for #50291. Thanks :)

Edit: I'll wait for this PR to be merged before QA'ing #50291.

@idryomov
Copy link
Contributor

idryomov commented Mar 1, 2023

@ajarr Did you test the old monitors/new manager case? The encoding changes are straightforward but I think it's worth doing that anyway.

@idryomov
Copy link
Contributor

idryomov commented Mar 1, 2023

jenkins test windows

@idryomov
Copy link
Contributor

idryomov commented Mar 1, 2023

@ajarr Did you test the old monitors/new manager case? The encoding changes are straightforward but I think it's worth doing that anyway.

I ran some basic tests. With a new monitor and an old manager, an empty name is reported (as expected):

$ ceph mgr dump | jq '.active_clients'
[
  {
    "name": "",
    "addrvec": [
      {
        "type": "v2",
        "addr": "127.0.0.1:0",
        "nonce": 165306198
      }
    ]
  }
]

After upgrading a manager:

$ ceph mgr dump | jq '.active_clients'
[
  {
    "name": "libcephsqlite",
    "addrvec": [
      {
        "type": "v2",
        "addr": "127.0.0.1:0",
        "nonce": 3026129408
      }
    ]
  }
]

An old monitor paired with a new manager appears to run fine too:

$ ceph mgr dump | jq '.active_clients'
[
  {
    "addrvec": [
      {
        "type": "v2",
        "addr": "127.0.0.1:0",
        "nonce": 3993337573
      }
    ]
  }
]

@ajarr
Copy link
Contributor Author

ajarr commented Mar 1, 2023

Thanks a lot, @idryomov !

@ajarr
Copy link
Contributor Author

ajarr commented Mar 6, 2023

@ljflores @rzarzynski can you please include this PR in the next batch of core qa test run?

@ljflores
Copy link
Member

ljflores commented Mar 7, 2023

@ajarr putting it in the wip-yuri4-testing batch

@ajarr
Copy link
Contributor Author

ajarr commented Mar 9, 2023

@ajarr putting it in the wip-yuri4-testing batch

@ljflores thanks!

@ljflores
Copy link
Member

ljflores commented Mar 9, 2023

Rados suite review: https://pulpito.ceph.com/?branch=wip-yuri4-testing-2023-03-08-1234

Failures, unrelated:
1. https://tracker.ceph.com/issues/58946
2. https://tracker.ceph.com/issues/58560
3. https://tracker.ceph.com/issues/58585
4. https://tracker.ceph.com/issues/55347
5. https://tracker.ceph.com/issues/49287

Details:
1. cephadm: KeyError: 'osdspec_affinity' - Ceph - Orchestrator
2. test_envlibrados_for_rocksdb.sh failed to subscribe to repo - Infrastructure
3. rook: failed to pull kubelet image - Ceph - Orchestrator
4. SELinux Denials during cephadm/workunits/test_cephadm - Ceph - Orchestrator
5. podman: setting cgroup config for procHooks process caused: Unit libpod-$hash.scope not found - Ceph - Orchestrator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants