mds/cache: defer trim() until no MDS is rejoining#57124
mds/cache: defer trim() until no MDS is rejoining#57124
Conversation
|
This is a new fix after #52648. |
leonid-s-usov
left a comment
There was a problem hiding this comment.
It's a little concerning that the rejoin_done callback is complete before we receive all rejoin acks. Is that a deliberate design? If so, I'd consider a better name for rejoin_done, such that would imply that it's an early internal event not yet confirmed by the peers
However, this code suggests that rejoin_done shouldn't be called unless the ack set is empty:
if (rejoin_gather.empty() && // make sure we've gotten our FULL inodes, too.
rejoin_ack_gather.empty()) {
// finally, kickstart past snap parent opens
open_snaprealms();
} else {
dout(7) << "still need rejoin from (" << rejoin_gather << ")"
<< ", rejoin_ack from (" << rejoin_ack_gather << ")" << dendl;
}
I'm afraid that this issue will need further investigation to find and eliminate a race between the call to open_snaprealms(), which completes rejoin_done, and the upkeep of the rejoin_ack_gather set which happens in the context of mds map updates
No, the 1), when the local For the 1) case the So this PR is trying to fix the 2) case. IMO the
The
Please note the
Not
|
Sure, I understand, though the I'm trying to say that your change shouldn't be needed unless we have a race somewhere. The Other places in the code look at the |
Oh ok, this is interesting.. So, we may have the acks to await while |
Yeah, correct. |
Thanks, @lxbsz , I see that However, I'm still uncomfortable with testing for the |
If I got you here. In the 2) case it's none busy with the |
Let me have a look whether could we just check the |
|
@leonid-s-usov Updated it and please take a look. Thanks |
|
@lxbsz I recall from #52648 (comment) that @batrick was a bit against avoid trimming when some MDSs are in up:rejoin. Has that been though over with this change? |
@vshankar Thanks for pointing this out. @batrick With this change the local MDS will stop trimming the MDCaches whenever any of the MDSs in the cluster is rejoining. This time I am using the |
Yes, this is the type of devilry I think we need to do everything to avoid. I did some looking back at the code history. Most of this surrounds changes Zheng made about 10 years ago: (2013!) and what appeared to be the intended fix: except the ffcbcdd#diff-ea7cb1a6ba9fa08363b14dd00a86bc6b79e01673e93af84ffdfcdbd0d3f26b19R6304-R6308 which, if I understood correctly, should have resolved this problem: 8a1114c#diff-ea7cb1a6ba9fa08363b14dd00a86bc6b79e01673e93af84ffdfcdbd0d3f26b19R4670-R4672 except clearly it doesn't. I'd be very careful about assuming the comment for that At this point, my suggestion is to revert #52648 and aggressively test this to get fresh logs concerning the original problem (sadly, the mds logs for tracker 62036 are garbage collected). |
I can confirm that this is exactly the case we hit, more detail please see the RCA in the tracker: https://tracker.ceph.com/issues/50821#note-18 Let me try to find a better approach for this.
|
… received" This reverts commit dd78380.
Just before the last cache_rejoin ack being received the entire subtree, together with the inode subtree root belongs to, were trimmed the isolated_inodes list couldn't be correctly erased. We should defer calling the trim() until the last cache_rejoin ack being received. Fixes: https://tracker.ceph.com/issues/50821 Signed-off-by: Xiubo Li <xiubli@redhat.com>
| trim_client_leases(); | ||
| } | ||
| if (is_ready_to_trim_cache() || mds->is_standby_replay()) { | ||
| if (is_open() || mds->is_standby_replay()) { |
There was a problem hiding this comment.
Under what condition will we reach this function and not be open?
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
jenkins retest this please |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
do we need to prevent MDS joins when in trimming state ? |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution! |
Just before the last cache_rejoin ack being received the entire subtree, together with the inode subtree root belongs to, were trimmed the isolated_inodes list couldn't be correctly erased. We should defer calling the trim() until the last cache_rejoin ack being received.
The "rejoin_done" will be set when the local MDS is rejoining, but "rejoin_ack_gather" will be always set when the other MDSs are rejoining.
Fixes: https://tracker.ceph.com/issues/50821
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e