Skip to content

mds: make num_fwd and num_retry to __u32#45669

Merged
vshankar merged 4 commits intomainfrom
header
Feb 22, 2023
Merged

mds: make num_fwd and num_retry to __u32#45669
vshankar merged 4 commits intomainfrom
header

Conversation

@lxbsz
Copy link
Member

@lxbsz lxbsz commented Mar 28, 2022

The num_fwd in MClientRequestForward is int32_t, while the num_fwd
in ceph_mds_request_head is __u8. This is buggy when the num_fwd
is larger than 256 it will always be truncate to 0 again. But the
client couldn't recoginize this.

Fixes: https://tracker.ceph.com/issues/57854
Signed-off-by: Xiubo Li xiubli@redhat.com

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@lxbsz lxbsz requested a review from a team March 28, 2022 12:22
@github-actions github-actions bot added the cephfs Ceph File System label Mar 28, 2022
@lxbsz
Copy link
Member Author

lxbsz commented Mar 28, 2022

IMO we need to limit the total number of the num_fwd instead of infinitely looping like in #45073.

@lxbsz lxbsz requested a review from gregsfortytwo March 28, 2022 14:17
@lxbsz lxbsz marked this pull request as draft March 30, 2022 06:58
@lxbsz
Copy link
Member Author

lxbsz commented Apr 6, 2022

Have fixed this in #45728 and #45688 to hard limit the retry times to 256 as a workaround currently.

While we should fix this anyway, because that workaround is ugly.

@djgalloway djgalloway changed the base branch from master to main May 25, 2022 20:01
@lxbsz lxbsz marked this pull request as ready for review July 5, 2022 08:34
@lxbsz lxbsz force-pushed the header branch 3 times, most recently from 6a0669b to f465af3 Compare July 6, 2022 02:33
@lxbsz lxbsz requested a review from a team July 6, 2022 02:34
@kotreshhr
Copy link
Contributor

Just curious to understand, you mentioned that this issue being fixed is easily seen with the PR #45073 (*#45073 (comment))**. The num_fwd is per request right? Does a single create request caused this to overflow ?

@lxbsz
Copy link
Member Author

lxbsz commented Jul 6, 2022

Just curious to understand, you mentioned that this issue being fixed is easily seen with the PR #45073 (*#45073 (comment))**. The num_fwd is per request right?

Correct.

Does a single create request caused this to overflow ?

Yeah, it is. A single create request was bouncing between mds.0 and mds.1 infinitely and then overflowed.

@lxbsz
Copy link
Member Author

lxbsz commented Jul 6, 2022

jenkins test make check

@kotreshhr
Copy link
Contributor

Just curious to understand, you mentioned that this issue being fixed is easily seen with the PR #45073 (*#45073 (comment))**. The num_fwd is per request right?

Correct.

Does a single create request caused this to overflow ?

Yeah, it is. A single create request was bouncing between mds.0 and mds.1 infinitely and then overflowed.

If the CInode and CDir of the directory are on different mdses, I am failing to understand why it should take more than 256 hops? Is this some corner case that is being hit here ?

@kotreshhr
Copy link
Contributor

The PR and the commits miss the tracker id. Doesn't this need backport ?

@lxbsz
Copy link
Member Author

lxbsz commented Jul 7, 2022

Just curious to understand, you mentioned that this issue being fixed is easily seen with the PR #45073 (*#45073 (comment))**. The num_fwd is per request right?

Correct.

Does a single create request caused this to overflow ?

Yeah, it is. A single create request was bouncing between mds.0 and mds.1 infinitely and then overflowed.

If the CInode and CDir of the directory are on different mdses, I am failing to understand why it should take more than 256 hops? Is this some corner case that is being hit here ?

As mentioned in PR #45073, this bug was found from @luis-henrix previous patches, which has been fixed after that. Except that as I know there is another case that when the pool is full, it will forward the create requests to the CInode auth MDS which could cause the requests hopping, maybe there are some others cases we haven't figured out yet which also could cause this.

For the retry attempt one, it should be triggered by the clients instead, such as when reconnecting.

Anyway the type mismatch issue should be fixed, which introduced confusion in the mds logs.

@lxbsz
Copy link
Member Author

lxbsz commented Jul 7, 2022

The PR and the commits miss the tracker id. Doesn't this need backport ?

IMO it should be okay, since the client side fixes have been backported.

Copy link
Contributor

@kotreshhr kotreshhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me otherwise

@gregsfortytwo
Copy link
Member

@lxbsz Can you talk about how you made sure these encoding changes maintain compatibility with old clients? I see you're changing the ceph_mds_request_head from a WRITE_RAW_ENCODER to hand-rolled encode/decode functions, and it's possible to do that correctly but makes me nervous — we need to be very careful with it.

I also note that it looks like the changes mean the MDS simply ignores the "classic" u8 num_fwd, which seems like it will break old clients if the MDS forwards requests around?

@lxbsz
Copy link
Member Author

lxbsz commented Jul 8, 2022

@lxbsz Can you talk about how you made sure these encoding changes maintain compatibility with old clients? I see you're changing the ceph_mds_request_head from a WRITE_RAW_ENCODER to hand-rolled encode/decode functions, and it's possible to do that correctly but makes me nervous — we need to be very careful with it.

In the MDS side it will only maintain ext_num_retry and increase the new ext_num_fwd counters. And at the same time will assign them to the old num_retry/num_fwd counters, which maybe overflowed.

The old clients are still using the old num_retry/num_fwd counters. For the very old clients who haven't contained #45728 and #45688 fixes yet, the bugs still exists and we cannot avoid it, otherwise with #45728 and #45688 the clients could stop retrying/forwarding the requests if exceeding 256 times.

I also note that it looks like the changes mean the MDS simply ignores the "classic" u8 num_fwd, which seems like it will break old clients if the MDS forwards requests around?

In the MDS side it will only maintain ext_num_retry and increase the new ext_num_fwd counters, and will pass them back to clients via the old counters at the same time.

NOTE: the retry counter will be increased by the clients.

As above mentioned, for the old clients they will continue parsing and using the old counters and in the MDS side the new counters are just two relay stations, receiving the values from clients and saving them and then passing them back via the old counters.

====

Okay, I found bug:

In old clients when parsing the new struct ceph_mds_request_head they will just copy it. And will left 8 bytes memories to the next decoder. For this IMO we should also test the CEPHFS_FEATURE_32BITS_RETRY_FWD feature bit to distinguish the old clients.

@vshankar
Copy link
Contributor

This is good to merge. @lxbsz please rebase.

Prepare switching num_retry and num_fwd to __le32.

Fixes: https://tracker.ceph.com/issues/57854
Signed-off-by: Xiubo Li <xiubli@redhat.com>
To make sure both the num_retry and num_fwd in all the structs have
the same type, which is 32 bits.

And also for old cephs and clients they still could use the 8 bits
members.

Fixes: https://tracker.ceph.com/issues/57854
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Check the CEPHFS_FEATURE_32BITS_RETRY_FWD feature bit and if not
set, that means it's connecting to an old MDS and will limit the
max retry to 256 times.

Fixes: https://tracker.ceph.com/issues/57854
Signed-off-by: Xiubo Li <xiubli@redhat.com>
The MDS will increase the forward count, and if the forward count
is less than the one saved in request in client side, that means
the MDS is old version and it was overflowed. Then just stop
forwarding.

Fixes: https://tracker.ceph.com/issues/57854
Signed-off-by: Xiubo Li <xiubli@redhat.com>
@lxbsz
Copy link
Member Author

lxbsz commented Feb 16, 2023

This is good to merge. @lxbsz please rebase.

Done. The conflicts were introduced by one old commit, which is doing -EMULTIHOP --> -CEPHFS_EMULTIHOP

@vshankar vshankar dismissed stale reviews from gregsfortytwo and jtlayton February 22, 2023 09:12

Comments were addressed.

@vshankar vshankar merged commit 8ce3185 into main Feb 22, 2023
@vshankar vshankar deleted the header branch February 22, 2023 09:13
mihalicyn added a commit to mihalicyn/ceph that referenced this pull request Oct 23, 2023
…quest_head_legacy

When a client is too old and uses struct ceph_mds_request_head_legacy we must
fill new ext_num_retry and ext_num_fwd fields from an old num_retry and num_fwd.

Fixes: ceph#45669

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
mihalicyn added a commit to mihalicyn/ceph that referenced this pull request Oct 23, 2023
…quest_head_legacy

When a client is too old and uses struct ceph_mds_request_head_legacy we must
fill new ext_num_retry and ext_num_fwd fields from an old num_retry and num_fwd.

Fixes: ceph#45669
Fixes: https://tracker.ceph.com/issues/63288

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
mihalicyn added a commit to mihalicyn/ceph that referenced this pull request Oct 24, 2023
…quest_head_legacy

When a client is too old and uses struct ceph_mds_request_head_legacy we must
fill new ext_num_retry and ext_num_fwd fields from an old num_retry and num_fwd.

Fixes: ceph#45669
Fixes: https://tracker.ceph.com/issues/63288
Fixes: commit cbd7e30 ("ceph_fs.h: add 32 bits extended num_retry and num_fwd support")

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
mihalicyn added a commit to mihalicyn/ceph that referenced this pull request Nov 8, 2023
…quest_head_legacy

When a client is too old and uses struct ceph_mds_request_head_legacy we must
fill new ext_num_retry and ext_num_fwd fields from an old num_retry and num_fwd.

Fixes: ceph#45669
Fixes: https://tracker.ceph.com/issues/63288
Fixes: commit cbd7e30 ("ceph_fs.h: add 32 bits extended num_retry and num_fwd support")

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
(cherry picked from commit 43f32a4)
mihalicyn added a commit to mihalicyn/ceph that referenced this pull request Nov 8, 2023
…quest_head_legacy

When a client is too old and uses struct ceph_mds_request_head_legacy we must
fill new ext_num_retry and ext_num_fwd fields from an old num_retry and num_fwd.

Fixes: ceph#45669
Fixes: https://tracker.ceph.com/issues/63288
Fixes: commit cbd7e30 ("ceph_fs.h: add 32 bits extended num_retry and num_fwd support")

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
(cherry picked from commit 43f32a4)
mihalicyn added a commit to mihalicyn/ceph that referenced this pull request Nov 8, 2023
…quest_head_legacy

When a client is too old and uses struct ceph_mds_request_head_legacy we must
fill new ext_num_retry and ext_num_fwd fields from an old num_retry and num_fwd.

Fixes: ceph#45669
Fixes: https://tracker.ceph.com/issues/63288
Fixes: commit cbd7e30 ("ceph_fs.h: add 32 bits extended num_retry and num_fwd support")

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
(cherry picked from commit 43f32a4)
mihalicyn added a commit to mihalicyn/ceph that referenced this pull request Nov 8, 2023
…quest_head_legacy

When a client is too old and uses struct ceph_mds_request_head_legacy we must
fill new ext_num_retry and ext_num_fwd fields from an old num_retry and num_fwd.

Fixes: ceph#45669
Fixes: https://tracker.ceph.com/issues/63288
Fixes: commit cbd7e30 ("ceph_fs.h: add 32 bits extended num_retry and num_fwd support")

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
(cherry picked from commit 43f32a4)
dmick pushed a commit to ceph/ceph-ci that referenced this pull request Nov 21, 2023
…quest_head_legacy

When a client is too old and uses struct ceph_mds_request_head_legacy we must
fill new ext_num_retry and ext_num_fwd fields from an old num_retry and num_fwd.

Fixes: ceph/ceph#45669
Fixes: https://tracker.ceph.com/issues/63288
Fixes: commit cbd7e30 ("ceph_fs.h: add 32 bits extended num_retry and num_fwd support")

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
(cherry picked from commit 43f32a4)
yuriw pushed a commit to ceph/ceph-ci that referenced this pull request Nov 30, 2023
…quest_head_legacy

When a client is too old and uses struct ceph_mds_request_head_legacy we must
fill new ext_num_retry and ext_num_fwd fields from an old num_retry and num_fwd.

Fixes: ceph/ceph#45669
Fixes: https://tracker.ceph.com/issues/63288
Fixes: commit cbd7e30 ("ceph_fs.h: add 32 bits extended num_retry and num_fwd support")

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
(cherry picked from commit 43f32a4)
(cherry picked from commit 312bb5b)
aaSharma14 pushed a commit to rhcs-dashboard/ceph that referenced this pull request Dec 6, 2023
…quest_head_legacy

When a client is too old and uses struct ceph_mds_request_head_legacy we must
fill new ext_num_retry and ext_num_fwd fields from an old num_retry and num_fwd.

Fixes: ceph#45669
Fixes: https://tracker.ceph.com/issues/63288
Fixes: commit cbd7e30 ("ceph_fs.h: add 32 bits extended num_retry and num_fwd support")

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
aaSharma14 pushed a commit to rhcs-dashboard/ceph that referenced this pull request Dec 26, 2023
…quest_head_legacy

When a client is too old and uses struct ceph_mds_request_head_legacy we must
fill new ext_num_retry and ext_num_fwd fields from an old num_retry and num_fwd.

Fixes: ceph#45669
Fixes: https://tracker.ceph.com/issues/63288
Fixes: commit cbd7e30 ("ceph_fs.h: add 32 bits extended num_retry and num_fwd support")

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
aaSharma14 pushed a commit to rhcs-dashboard/ceph that referenced this pull request Jan 2, 2024
…quest_head_legacy

When a client is too old and uses struct ceph_mds_request_head_legacy we must
fill new ext_num_retry and ext_num_fwd fields from an old num_retry and num_fwd.

Fixes: ceph#45669
Fixes: https://tracker.ceph.com/issues/63288
Fixes: commit cbd7e30 ("ceph_fs.h: add 32 bits extended num_retry and num_fwd support")

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
rishabh-d-dave pushed a commit to rishabh-d-dave/ceph that referenced this pull request Jan 29, 2024
…quest_head_legacy

When a client is too old and uses struct ceph_mds_request_head_legacy we must
fill new ext_num_retry and ext_num_fwd fields from an old num_retry and num_fwd.

Fixes: ceph#45669
Fixes: https://tracker.ceph.com/issues/63288
Fixes: commit cbd7e30 ("ceph_fs.h: add 32 bits extended num_retry and num_fwd support")

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
(cherry picked from commit 43f32a4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cephfs Ceph File System

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants