mds: revert standby-replay trimming changes#48483
Conversation
|
cc @lxbsz |
This looks reasonable and good to me. |
|
jenkins test make check |
|
@batrick I see you added your testing tag - do you plan to run this through tests anytime soon or I can pick it up... |
|
@batrick planning to run this through tests anytime soon? |
|
I still plan to... waiting for sepia to stabilize : / |
@batrick Do you plan to pick this up anytime soon? |
Fixes: 138fea6 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Revert "mds: do not trim the inodes from the lru list in standby_replay" Revert "mds: trim cache during standby replay" This reverts commit 79bb44c. This reverts commit c0fe25b. standby-replay daemons were changed to keep minimal metadata from the journal in cache but the original intent of standby-replay was to have a cache that's as warm as the rank itself. This reverts the two commits which changed that behavior. Part of these reason for this is that the new rapid cache trimming behavior was not correct at all. The trimming loop would break when it runs into a dentry with non-null linkage. This would nearly always be the case. It was thought that this was a problem introduced by [2] as MDCache::standby_trim_segment has a different trim check [4] but the original issue (tracker 48673) is as old as [1], indicating the problem predates [2]. So, this commit reverts all of that. I have lingering suspicions that the standby-replay daemon is not pinning some dentries properly which causes [5] but this did not show up unless the MDS was rapidly evicting some dentries. More research needs done there. [1] c0fe25b [2] 79bb44c [3] https://github.com/ceph/ceph/blob/84fba097049ec4f72549588eaacc64f30c7a88a8/src/mds/MDCache.cc#L6816-L6820 [4] https://github.com/ceph/ceph/blob/84fba097049ec4f72549588eaacc64f30c7a88a8/src/mds/MDCache.cc#L7476-L7481 [5] https://tracker.ceph.com/issues/50246 Fixes: https://tracker.ceph.com/issues/48673 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
|
@batrick Do you have an update on the tests? |
|
Not yet, I ran this through QA. Will update when I have time to go through the results. |
|
jenkins test make check |
|
jenkins test make check arm64 |
|
looks good |
|
@batrick Do you want me to test this PR? |
You may. |
On it this week. |
|
Builds in progress - https://shaman.ceph.com/builds/ceph/wip-vshankar-testing-20231107.042705/ |
Test run looks good. I'll prepare the run wiki and merge this soon. Nice work @batrick |
Hi @microyahoo - pacific is EOL'd. See: https://docs.ceph.com/en/latest/releases/pacific/#v16-2-15-pacific You should consider upgrading. |
hi @vshankar, thanks for your kindly reminder, I will upgrade to latest. |
I want to do some testing in Linode with this change-set to see if I can reproduce the issues with up:rejoin phase coming from standby-replay [1]. So leaving this as a draft for now.
Once merged I think this should also cook in main's QA for a month or so before we do any backports.
[1] https://tracker.ceph.com/issues/40213
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows