mds/cache: defer trim() until no MDS is rejoining by lxbsz · Pull Request #57124 · ceph/ceph

lxbsz · 2024-04-29T04:04:22Z

Just before the last cache_rejoin ack being received the entire subtree, together with the inode subtree root belongs to, were trimmed the isolated_inodes list couldn't be correctly erased. We should defer calling the trim() until the last cache_rejoin ack being received.

The "rejoin_done" will be set when the local MDS is rejoining, but "rejoin_ack_gather" will be always set when the other MDSs are rejoining.

Fixes: https://tracker.ceph.com/issues/50821

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

lxbsz · 2024-04-29T04:04:37Z

This is a new fix after #52648.

leonid-s-usov

It's a little concerning that the rejoin_done callback is complete before we receive all rejoin acks. Is that a deliberate design? If so, I'd consider a better name for rejoin_done, such that would imply that it's an early internal event not yet confirmed by the peers

However, this code suggests that rejoin_done shouldn't be called unless the ack set is empty:

    if (rejoin_gather.empty() &&     // make sure we've gotten our FULL inodes, too.
	rejoin_ack_gather.empty()) {
      // finally, kickstart past snap parent opens
      open_snaprealms();
    } else {
      dout(7) << "still need rejoin from (" << rejoin_gather << ")"
	      << ", rejoin_ack from (" << rejoin_ack_gather << ")" << dendl;
    }

I'm afraid that this issue will need further investigation to find and eliminate a race between the call to open_snaprealms(), which completes rejoin_done, and the upkeep of the rejoin_ack_gather set which happens in the context of mds map updates

lxbsz · 2024-04-29T07:35:12Z

It's a little concerning that the rejoin_done callback is complete before we receive all rejoin acks. Is that a deliberate design?

No, the rejoin_done will always be cleared just after we receive all the rejoin acks. Just in two cases we need to do the rejoin:

1), when the local MDS is in up:rejoin state
2), when any other MDS is in up:rejoin state

For the 1) case the rejoin_done will be non-empty but won't be sure that the rejoin_ack_gather is set, while for the 2) case the rejoin_ack_gather will be always set instead.

So this PR is trying to fix the 2) case.

IMO the rejoin_done is for the 1) case only and means the current MDS is successfully joined to the fs cluster.

If so, I'd consider a better name for rejoin_done, such that would imply that it's an early internal event not yet confirmed by the peers

The rejoin_done isn't a internal event and need to wait peers' acks to clear it.

However, this code suggests that rejoin_done shouldn't be called unless the ack set is empty:

    if (rejoin_gather.empty() &&     // make sure we've gotten our FULL inodes, too.
	rejoin_ack_gather.empty()) {
      // finally, kickstart past snap parent opens
      open_snaprealms();
    } else {
      dout(7) << "still need rejoin from (" << rejoin_gather << ")"
	      << ", rejoin_ack from (" << rejoin_ack_gather << ")" << dendl;
    }

Please note the rejoin_done here means:

mds/MDCache.h:1359: std::unique_ptr<MDSContext> rejoin_done;

Not mds/MDSRank.h:563: void rejoin_done();.

I'm afraid that this issue will need further investigation to find and eliminate a race between the call to open_snaprealms(), which completes rejoin_done, and the upkeep of the rejoin_ack_gather set which happens in the context of mds map updates

leonid-s-usov · 2024-04-29T07:48:50Z

Please note the rejoin_done here means:
   mds/MDCache.h:1359:  std::unique_ptr<MDSContext> rejoin_done;
Not mds/MDSRank.h:563: void rejoin_done();.

Sure, I understand, though the rejoin_done (MDSContext*) is calling MDSRank::rejoin_done, so they are close relatives.

I'm trying to say that your change shouldn't be needed unless we have a race somewhere. The open_snaprealm() should only be called when rejoin_ack_gather is empty, and that's the only place where rejoin_done will be reset.

Other places in the code look at the rejoin_done pointer state to reason about the state of the rejoin process, so I think we should find and eliminate the race rather than just patch is_ready_to_trim_cache.

leonid-s-usov · 2024-04-29T07:52:07Z

IMO the rejoin_done is for the 1) case only and means the current MDS is successfully joined to the fs cluster.

Oh ok, this is interesting.. So, we may have the acks to await while rejoin_done would be empty?

lxbsz · 2024-04-29T08:03:29Z

IMO the rejoin_done is for the 1) case only and means the current MDS is successfully joined to the fs cluster.

Oh ok, this is interesting.. So, we may have the acks to await while rejoin_done would be empty?

Yeah, correct.

leonid-s-usov · 2024-04-29T08:11:54Z

1), when the local MDS is in up:rejoin state
2), when any other MDS is in up:rejoin state

Thanks, @lxbsz , I see that rejoin_send_rejoins is used both if we rejoin and if anyone else rejoins; the difference will be that we only call rejoin_gather_finish() if we are rejoining.

  if (mds->is_rejoin() && rejoin_gather.empty()) {
    dout(10) << "nothing to rejoin" << dendl;
    rejoin_gather_finish();
  }

However, I'm still uncomfortable with testing for the rejoin_ack_gather in the can_trim method. Could it be that what we are looking for in that method is a stable mds map state? That would hide implementation details like !rejoin_done and rejoin_ack_gather.empty(), and work well for any other transient cluster state? If that's too much, we could just verify that no MDS in the map is rejoining.

lxbsz · 2024-04-29T08:21:05Z

Please note the rejoin_done here means:
   mds/MDCache.h:1359:  std::unique_ptr<MDSContext> rejoin_done;
Not mds/MDSRank.h:563: void rejoin_done();.
Sure, I understand, though the rejoin_done (MDSContext*) is calling MDSRank::rejoin_done, so they are close relatives.

I'm trying to say that your change shouldn't be needed unless we have a race somewhere. The open_snaprealm() should only be called when rejoin_ack_gather is empty, and that's the only place where rejoin_done will be reset.

Other places in the code look at the rejoin_done pointer state to reason about the state of the rejoin process, so I think we should find and eliminate the race rather than just patch is_ready_to_trim_cache.

If I got you here. In the 2) case it's none busy with the rejoin_done and it will always be null. The 2) case will happen during the local MDS is in up:active state and other MDSs are joining. And then when the local MDS detects this from the mdsmap it will try to call rejoin_joint_start(). Then in the rejoin_ack_gather will be set and before all the acks being received the trim() should be skipped.

lxbsz · 2024-04-29T09:34:40Z

1), when the local MDS is in up:rejoin state
2), when any other MDS is in up:rejoin state

Thanks, @lxbsz , I see that rejoin_send_rejoins is used both if we rejoin and if anyone else rejoins; the difference will be that we only call rejoin_gather_finish() if we are rejoining.
  if (mds->is_rejoin() && rejoin_gather.empty()) {
    dout(10) << "nothing to rejoin" << dendl;
    rejoin_gather_finish();
  }
However, I'm still uncomfortable with testing for the rejoin_ack_gather in the can_trim method. Could it be that what we are looking for in that method is a stable mds map state? That would hide implementation details like !rejoin_done and rejoin_ack_gather.empty(), and work well for any other transient cluster state? If that's too much, we could just verify that no MDS in the map is rejoining.

Let me have a look whether could we just check the mdsmap instead, if we can then this sounds a better approach. Thanks @leonid-s-usov

lxbsz · 2024-04-30T01:10:08Z

@leonid-s-usov Updated it and please take a look. Thanks

leonid-s-usov

Looks great!

vshankar · 2024-05-08T10:10:34Z

@lxbsz I recall from #52648 (comment) that @batrick was a bit against avoid trimming when some MDSs are in up:rejoin. Has that been though over with this change?

lxbsz · 2024-05-09T00:38:21Z

@lxbsz I recall from #52648 (comment) that @batrick was a bit against avoid trimming when some MDSs are in up:rejoin. Has that been though over with this change?

@vshankar Thanks for pointing this out.

@batrick With this change the local MDS will stop trimming the MDCaches whenever any of the MDSs in the cluster is rejoining. This time I am using the mdsmap->is_rejoining instead of mds->is_cluster_degraded(). It seem still hitting the same issue you pointed ?

batrick · 2024-05-09T01:20:36Z

@lxbsz I recall from #52648 (comment) that @batrick was a bit against avoid trimming when some MDSs are in up:rejoin. Has that been though over with this change?

@vshankar Thanks for pointing this out.

@batrick With this change the local MDS will stop trimming the MDCaches whenever any of the MDSs in the cluster is rejoining. This time I am using the mdsmap->is_rejoining instead of mds->is_cluster_degraded(). It seem still hitting the same issue you pointed ?

Yes, this is the type of devilry I think we need to do everything to avoid. I did some looking back at the code history. Most of this surrounds changes Zheng made about 10 years ago:

8a1114c

(2013!) and what appeared to be the intended fix:

ffcbcdd

except the FIXME was never resolved. In particular I'm interested in:

ffcbcdd#diff-ea7cb1a6ba9fa08363b14dd00a86bc6b79e01673e93af84ffdfcdbd0d3f26b19R6304-R6308

which, if I understood correctly, should have resolved this problem:

8a1114c#diff-ea7cb1a6ba9fa08363b14dd00a86bc6b79e01673e93af84ffdfcdbd0d3f26b19R4670-R4672

except clearly it doesn't.

I'd be very careful about assuming the comment for that FIXME is completely accurate. There may be other cases we haven't considered.

At this point, my suggestion is to revert #52648 and aggressively test this to get fresh logs concerning the original problem (sadly, the mds logs for tracker 62036 are garbage collected).

lxbsz · 2024-05-09T01:51:20Z

@lxbsz I recall from #52648 (comment) that @batrick was a bit against avoid trimming when some MDSs are in up:rejoin. Has that been though over with this change?

@vshankar Thanks for pointing this out.
@batrick With this change the local MDS will stop trimming the MDCaches whenever any of the MDSs in the cluster is rejoining. This time I am using the mdsmap->is_rejoining instead of mds->is_cluster_degraded(). It seem still hitting the same issue you pointed ?

Yes, this is the type of devilry I think we need to do everything to avoid. I did some looking back at the code history. Most of this surrounds changes Zheng made about 10 years ago:

8a1114c

(2013!) and what appeared to be the intended fix:

ffcbcdd

except the FIXME was never resolved. In particular I'm interested in:

ffcbcdd#diff-ea7cb1a6ba9fa08363b14dd00a86bc6b79e01673e93af84ffdfcdbd0d3f26b19R6304-R6308

which, if I understood correctly, should have resolved this problem:

8a1114c#diff-ea7cb1a6ba9fa08363b14dd00a86bc6b79e01673e93af84ffdfcdbd0d3f26b19R4670-R4672

except clearly it doesn't.

I'd be very careful about assuming the comment for that FIXME is completely accurate. There may be other cases we haven't considered.

I can confirm that this is exactly the case we hit, more detail please see the RCA in the tracker:

https://tracker.ceph.com/issues/50821#note-18

Let me try to find a better approach for this.

At this point, my suggestion is to revert #52648 and aggressively test this to get fresh logs concerning the original problem (sadly, the mds logs for tracker 62036 are garbage collected).

… received" This reverts commit dd78380.

Just before the last cache_rejoin ack being received the entire subtree, together with the inode subtree root belongs to, were trimmed the isolated_inodes list couldn't be correctly erased. We should defer calling the trim() until the last cache_rejoin ack being received. Fixes: https://tracker.ceph.com/issues/50821 Signed-off-by: Xiubo Li <xiubli@redhat.com>

gregsfortytwo · 2024-05-10T02:56:28Z

src/mds/MDCache.cc

          trim_client_leases();
        }
-        if (is_ready_to_trim_cache() || mds->is_standby_replay()) {
+        if (is_open() || mds->is_standby_replay()) {


Under what condition will we reach this function and not be open?

This was added by @batrick in #48483. I went through the code and got that only when the mds is in up:active state will the open == true.

github-actions · 2024-07-09T05:01:36Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

lxbsz · 2024-07-25T02:50:38Z

jenkins retest this please

github-actions · 2024-09-23T03:05:51Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

mchangir · 2024-09-23T03:23:16Z

do we need to prevent MDS joins when in trimming state ?

github-actions · 2024-11-22T05:01:50Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

github-actions · 2024-12-22T08:01:41Z

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

lxbsz requested review from a team, batrick and vshankar April 29, 2024 04:04

github-actions bot added the cephfs Ceph File System label Apr 29, 2024

leonid-s-usov changed the title ~~mds: new fix to defer trim() until after the last cache_rejoin ack be…~~ mds/cache: defer trim() until after the last cache_rejoin ack Apr 29, 2024

leonid-s-usov suggested changes Apr 29, 2024

View reviewed changes

lxbsz force-pushed the wip-50821 branch from cf6dd17 to d8536de Compare April 30, 2024 01:09

lxbsz requested a review from leonid-s-usov April 30, 2024 01:09

leonid-s-usov approved these changes Apr 30, 2024

View reviewed changes

leonid-s-usov changed the title ~~mds/cache: defer trim() until after the last cache_rejoin ack~~ mds/cache: defer trim() until no MDS is rejoining Apr 30, 2024

lxbsz added 2 commits May 9, 2024 12:28

Revert "mds: defer trim() until after the last cache_rejoin ack being…

6315e7f

… received" This reverts commit dd78380.

lxbsz force-pushed the wip-50821 branch from d8536de to 5c7a6ae Compare May 9, 2024 07:47

gregsfortytwo reviewed May 10, 2024

View reviewed changes

batrick mentioned this pull request May 17, 2024

squid: mds: defer trim() until after the last cache_rejoin ack being received #56746

Closed

14 tasks

github-actions bot added the stale label Jul 9, 2024

github-actions bot removed the stale label Jul 25, 2024

github-actions bot added the stale label Sep 23, 2024

github-actions bot removed the stale label Sep 23, 2024

github-actions bot added the stale label Nov 22, 2024

github-actions bot closed this Dec 22, 2024

Conversation

lxbsz commented Apr 29, 2024 • edited by leonid-s-usov Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lxbsz commented Apr 29, 2024

Uh oh!

leonid-s-usov left a comment

Choose a reason for hiding this comment

Uh oh!

lxbsz commented Apr 29, 2024

Uh oh!

leonid-s-usov commented Apr 29, 2024

Uh oh!

leonid-s-usov commented Apr 29, 2024

Uh oh!

lxbsz commented Apr 29, 2024

Uh oh!

leonid-s-usov commented Apr 29, 2024

Uh oh!

lxbsz commented Apr 29, 2024

Uh oh!

lxbsz commented Apr 29, 2024

Uh oh!

lxbsz commented Apr 30, 2024

Uh oh!

leonid-s-usov left a comment

Choose a reason for hiding this comment

Uh oh!

vshankar commented May 8, 2024

Uh oh!

lxbsz commented May 9, 2024

Uh oh!

batrick commented May 9, 2024

Uh oh!

lxbsz commented May 9, 2024

Uh oh!

gregsfortytwo May 10, 2024

Choose a reason for hiding this comment

Uh oh!

lxbsz May 10, 2024

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 9, 2024

Uh oh!

lxbsz commented Jul 25, 2024

Uh oh!

github-actions bot commented Sep 23, 2024

Uh oh!

mchangir commented Sep 23, 2024

Uh oh!

github-actions bot commented Nov 22, 2024

Uh oh!

github-actions bot commented Dec 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

lxbsz commented Apr 29, 2024 •

edited by leonid-s-usov

Loading