Skip to content

mds: revert standby-replay trimming changes#48483

Merged
vshankar merged 4 commits intoceph:mainfrom
batrick:i48673
Nov 29, 2023
Merged

mds: revert standby-replay trimming changes#48483
vshankar merged 4 commits intoceph:mainfrom
batrick:i48673

Conversation

@batrick
Copy link
Member

@batrick batrick commented Oct 13, 2022

I want to do some testing in Linode with this change-set to see if I can reproduce the issues with up:rejoin phase coming from standby-replay [1]. So leaving this as a draft for now.

Once merged I think this should also cook in main's QA for a month or so before we do any backports.

[1] https://tracker.ceph.com/issues/40213

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@batrick batrick added the cephfs Ceph File System label Oct 13, 2022
@batrick
Copy link
Member Author

batrick commented Oct 13, 2022

cc @lxbsz

@lxbsz
Copy link
Member

lxbsz commented Oct 14, 2022

cc @lxbsz

This looks reasonable and good to me.

@batrick
Copy link
Member Author

batrick commented Nov 11, 2022

jenkins test make check

@vshankar
Copy link
Contributor

@batrick I see you added your testing tag - do you plan to run this through tests anytime soon or I can pick it up...

@vshankar
Copy link
Contributor

vshankar commented Feb 1, 2023

@batrick planning to run this through tests anytime soon?

@batrick
Copy link
Member Author

batrick commented Feb 2, 2023

I still plan to... waiting for sepia to stabilize : /

@vshankar
Copy link
Contributor

I still plan to... waiting for sepia to stabilize : /

@batrick Do you plan to pick this up anytime soon?

@vshankar
Copy link
Contributor

I still plan to... waiting for sepia to stabilize : /

@batrick Do you plan to pick this up anytime soon?

@batrick - any update on this?

batrick added 4 commits June 22, 2023 14:33
Fixes: 138fea6
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Revert "mds: do not trim the inodes from the lru list in standby_replay"
Revert "mds: trim cache during standby replay"
This reverts commit 79bb44c.
This reverts commit c0fe25b.

standby-replay daemons were changed to keep minimal metadata from the journal
in cache but the original intent of standby-replay was to have a cache that's
as warm as the rank itself.  This reverts the two commits which changed that
behavior.

Part of these reason for this is that the new rapid cache trimming behavior was
not correct at all. The trimming loop would break when it runs into a dentry
with non-null linkage. This would nearly always be the case.  It was thought
that this was a problem introduced by [2] as MDCache::standby_trim_segment has
a different trim check [4] but the original issue (tracker 48673) is as old as
[1], indicating the problem predates [2].

So, this commit reverts all of that. I have lingering suspicions that the
standby-replay daemon is not pinning some dentries properly which causes [5]
but this did not show up unless the MDS was rapidly evicting some dentries.
More research needs done there.

[1] c0fe25b
[2] 79bb44c
[3] https://github.com/ceph/ceph/blob/84fba097049ec4f72549588eaacc64f30c7a88a8/src/mds/MDCache.cc#L6816-L6820
[4] https://github.com/ceph/ceph/blob/84fba097049ec4f72549588eaacc64f30c7a88a8/src/mds/MDCache.cc#L7476-L7481
[5] https://tracker.ceph.com/issues/50246

Fixes: https://tracker.ceph.com/issues/48673
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
@vshankar
Copy link
Contributor

@ifed01 FYI - this is being run through fs suite by @batrick

@vshankar
Copy link
Contributor

@batrick Do you have an update on the tests?

@batrick
Copy link
Member Author

batrick commented Jul 17, 2023

Not yet, I ran this through QA. Will update when I have time to go through the results.

@batrick
Copy link
Member Author

batrick commented Jul 21, 2023

jenkins test make check

@batrick
Copy link
Member Author

batrick commented Jul 21, 2023

jenkins test make check arm64

@kotreshhr
Copy link
Contributor

looks good

@rishabh-d-dave
Copy link
Contributor

@batrick Do you want me to test this PR?

@batrick
Copy link
Member Author

batrick commented Sep 11, 2023

@batrick Do you want me to test this PR?

You may.

@vshankar
Copy link
Contributor

@batrick Do you want me to test this PR?

You may.

On it this week.

@vshankar vshankar marked this pull request as ready for review November 7, 2023 04:22
@vshankar
Copy link
Contributor

vshankar commented Nov 7, 2023

@vshankar
Copy link
Contributor

vshankar commented Nov 7, 2023

@vshankar
Copy link
Contributor

Tests running here - https://pulpito.ceph.com/vshankar-2023-11-07_06:04:04-fs-wip-vshankar-testing-20231107.042705-testing-default-smithi/

Test run looks good. I'll prepare the run wiki and merge this soon. Nice work @batrick

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@microyahoo
Copy link

hi @vshankar, @batrick, would you plan to backport this PR to pacific? I have met this issue frequently.

@vshankar
Copy link
Contributor

vshankar commented Apr 8, 2024

hi @vshankar, @batrick, would you plan to backport this PR to pacific? I have met this issue frequently.

Hi @microyahoo - pacific is EOL'd. See: https://docs.ceph.com/en/latest/releases/pacific/#v16-2-15-pacific

You should consider upgrading.

@microyahoo
Copy link

hi @vshankar, @batrick, would you plan to backport this PR to pacific? I have met this issue frequently.

Hi @microyahoo - pacific is EOL'd. See: https://docs.ceph.com/en/latest/releases/pacific/#v16-2-15-pacific

You should consider upgrading.

hi @vshankar, thanks for your kindly reminder, I will upgrade to latest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cephfs Ceph File System

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants