Skip to content

client: check mds down status before getting mds_gid_t from mdsmap#52258

Merged
vshankar merged 2 commits intoceph:mainfrom
YiteGu:check-mds-down-while-get-from-up
Feb 28, 2024
Merged

client: check mds down status before getting mds_gid_t from mdsmap#52258
vshankar merged 2 commits intoceph:mainfrom
YiteGu:check-mds-down-while-get-from-up

Conversation

@YiteGu
Copy link
Member

@YiteGu YiteGu commented Jun 29, 2023

Get mds_gid_t from up of MDSMap, will cause to core dump if target mds_rank_t does not exist up map:

from: src/mds/MDSMap.h
const auto& get_info(mds_rank_t m) const {
return mds_info.at(up.at(m));
}

reproduct:

  1. ceph fs fail <fs_name>
  2. curl -X GET "example.com:8080/api/cephfs/1/clients" ...
    up.at(m) will cause to core dump.

Fixes: https://tracker.ceph.com/issues/61844
Singed-off-by: Yite Gu yitegu0@gmail.com

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@github-actions github-actions bot added the cephfs Ceph File System label Jun 29, 2023
@YiteGu YiteGu force-pushed the check-mds-down-while-get-from-up branch from 2df373b to 812f910 Compare June 29, 2023 13:57
@YiteGu
Copy link
Member Author

YiteGu commented Jun 29, 2023

Hi @batrick have a code review

if (fsmap->get_filesystem(role.fscid)->mds_map.is_down(role.rank)) {
lderr(cct) << __func__ << ": targets rank: " << role.rank
<< " is down" << dendl;
return -CEPHFS_ENOENT;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better return value would be -CEPHFS_EPERM as we want to say "resolving mds is not allowed when the mds is down".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else other changes looks good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@dparmar18
Copy link
Contributor

Since you have steps to reproduce this; can we have a test case?

@dparmar18 dparmar18 requested a review from a team June 30, 2023 07:18
@dparmar18 dparmar18 changed the title Client: Check mds down status bofore get mds_git_t from up Client: Check mds down status before getting mds_git_t from mds up map Jun 30, 2023
@YiteGu YiteGu force-pushed the check-mds-down-while-get-from-up branch from 812f910 to 17f037e Compare July 1, 2023 17:46
@YiteGu
Copy link
Member Author

YiteGu commented Jul 1, 2023

@dparmar18 Okay, but I lack experience of write test cases, could you provide me an example?

@dparmar18
Copy link
Contributor

@dparmar18 Okay, but I lack experience of write test cases, could you provide me an example?

I'll check and see if i can find one because (at least) I haven't came across any test cases where a restful API is used to check MDS.

@YiteGu
Copy link
Member Author

YiteGu commented Jul 3, 2023

@dparmar18 Okay, but I lack experience of write test cases, could you provide me an example?

I'll check and see if i can find one because (at least) I haven't came across any test cases where a restful API is used to check MDS.

You could directly write a test case to associate this fix, I am looking forward to it very much :)

@dparmar18
Copy link
Contributor

can someone from @ceph/dashboard help identify how and where to add a test case for this since it involves querying dashboard

Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convention is to use lower case in the commit message component/title. Please change to: client: check mds down status bofore getting mds_gid_t from mdsmap

@batrick
Copy link
Member

batrick commented Jul 4, 2023

You could write a test:

ceph fs set cephfs max_mds 2
ceph fs set cephfs joinable false
ceph mds fail cephfs:1
ceph tell mds.cephfs:1 session ls # would abort

@YiteGu YiteGu force-pushed the check-mds-down-while-get-from-up branch from 17f037e to e017e62 Compare July 5, 2023 01:40
@YiteGu YiteGu changed the title Client: Check mds down status before getting mds_git_t from mds up map client: check mds down status before getting mds_git_t from mdsmap Jul 5, 2023
@dparmar18
Copy link
Contributor

You could write a test:

ceph fs set cephfs max_mds 2
ceph fs set cephfs joinable false
ceph mds fail cephfs:1
ceph tell mds.cephfs:1 session ls # would abort

I think @YiteGu can follow the steps he mentioned in tracker and add a api test in https://github.com/ceph/ceph/blob/main/qa/tasks/mgr/dashboard/test_cephfs.py, it would be as simple as

self.fs.fail()
request mds status using GET - self._get()

thanks to @nizamial09 for helping find the right place.

@YiteGu YiteGu force-pushed the check-mds-down-while-get-from-up branch from e017e62 to e62e690 Compare July 5, 2023 14:00
@YiteGu YiteGu requested a review from a team as a code owner July 5, 2023 14:00
@YiteGu YiteGu requested review from avanthakkar and cloudbehl and removed request for a team July 5, 2023 14:00
@YiteGu YiteGu requested review from batrick and dparmar18 July 5, 2023 14:10
@YiteGu YiteGu force-pushed the check-mds-down-while-get-from-up branch 2 times, most recently from 2ab9f59 to d3de04c Compare July 5, 2023 16:53
@github-actions
Copy link

github-actions bot commented Feb 2, 2024

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

…fails

Fixes: https://tracker.ceph.com/issues/64089
Signed-off-by: Dhairya Parmar <dparmar@redhat.com>
(cherry picked from commit 564dba3)
@YiteGu YiteGu force-pushed the check-mds-down-while-get-from-up branch from 746917a to 67378bd Compare February 2, 2024 10:48
@YiteGu
Copy link
Member Author

YiteGu commented Feb 2, 2024

too-many-lines

for now you can put # pylint: disable=too-many-lines on top of the cephfs.py to disable it

# -*- coding: utf-8 -*-
# pylint: disable=too-many-lines

make check done

@vshankar
Copy link
Contributor

https://pulpito.ceph.com/vshankar-2024-02-17_05:57:56-fs-wip-vshankar-testing-20240217.015652-testing-default-smithi/

This run still has quay fetch failures

Command failed on smithi155 with status 1: 'sudo /home/ubuntu/cephtest/cephadm --image quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:1c920afa8316ea21877f3186850254947531ea22 pull'

I'll file a tracker for that and merge this batch.

vshankar added a commit to vshankar/ceph that referenced this pull request Feb 21, 2024
* refs/pull/52258/head:
	client: check mds down status bofore getting mds_git_t from mdsmap
	mgr/dashboard: allow sending back error status code fetching clients fails

Reviewed-by: Dhairya Parmar <dparmar@redhat.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Copy link
Contributor

@rishabh-d-dave rishabh-d-dave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YiteGu Since the PR has been tested and approved already, feel free to skip ignore thi nits.

@YiteGu YiteGu force-pushed the check-mds-down-while-get-from-up branch from 67378bd to 177491e Compare February 22, 2024 17:33
@YiteGu
Copy link
Member Author

YiteGu commented Feb 22, 2024

@YiteGu Since the PR has been tested and approved already, feel free to skip ignore thi nits.

done, thanks. :)

@vshankar
Copy link
Contributor

https://pulpito.ceph.com/vshankar-2024-02-17_05:57:56-fs-wip-vshankar-testing-20240217.015652-testing-default-smithi/

This run still has quay fetch failures

Command failed on smithi155 with status 1: 'sudo /home/ubuntu/cephtest/cephadm --image quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:1c920afa8316ea21877f3186850254947531ea22 pull'

I'll file a tracker for that and merge this batch.

Unfortunately it was risky to merge the change without a proper run. The fix for this is under test.

@vshankar
Copy link
Contributor

@mchangir
Copy link
Contributor

@YiteGu nit: I think the typedef is named mds_gid_t and not mds_git_t
please update the:

  • PR title
  • PR Description
  • Commit title
  • Commit message
    appropriately

Get mds_gid_t from up of MDSMap, will cause to core dump if target
mds_rank_t does not exist up map:

from: src/mds/MDSMap.h
  const auto& get_info(mds_rank_t m) const {
    return mds_info.at(up.at(m));
  }

reproduct:
1. ceph fs fail <fs_name>
2. curl -X GET "example.com:8080/api/cephfs/1/clients" ...
up.at(m) will cause to core dump.

Fixes: https://tracker.ceph.com/issues/61844
Signed-off-by: Yite Gu <yitegu0@gmail.com>
@YiteGu YiteGu force-pushed the check-mds-down-while-get-from-up branch from 177491e to 878463e Compare February 26, 2024 11:55
@YiteGu YiteGu changed the title client: check mds down status before getting mds_git_t from mdsmap client: check mds down status before getting mds_gid_t from mdsmap Feb 26, 2024
@YiteGu
Copy link
Member Author

YiteGu commented Feb 26, 2024

@YiteGu nit: I think the typedef is named mds_gid_t and not mds_git_t please update the:

  • PR title
  • PR Description
  • Commit title
  • Commit message
    appropriately

done

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshankar
Copy link
Contributor

jenkins test dashboard

@vshankar
Copy link
Contributor

@YiteGu Could you please get the failing dashboard jenkins test resolved. Its not a mandatory check to merge a change, however I'd like to be certain that this change does not break dashboard tests.

@YiteGu
Copy link
Member Author

YiteGu commented Feb 27, 2024

@YiteGu Could you please get the failing dashboard jenkins test resolved. Its not a mandatory check to merge a change, however I'd like to be certain that this change does not break dashboard tests.

I remember @nizamial09 said we could ignore the ceph dashboard tests and ceph dashboard cephadm e2e tests, pls help our check this problem again.

@nizamial09
Copy link
Member

nizamial09 commented Feb 27, 2024

we could ignore the ceph dashboard tests and ceph dashboard cephadm e2e tests,

I raised a fix for the dashboard e2e: #55771

and as for the cephadm e2e, its mostly grafana related failures which will be gone when we have a new grafana image or it will go away when this PR gets merged (which removes the need of building grafana image whenever new grafana dashboards are delivered,.

This PR is not affected by the dashboard or dashboard cephadm e2es

@YiteGu
Copy link
Member Author

YiteGu commented Feb 27, 2024

@vshankar we could continue now :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

8 participants