Skip to content

mds/cache: defer trim() until no MDS is rejoining#57124

Closed
lxbsz wants to merge 2 commits intoceph:mainfrom
lxbsz:wip-50821
Closed

mds/cache: defer trim() until no MDS is rejoining#57124
lxbsz wants to merge 2 commits intoceph:mainfrom
lxbsz:wip-50821

Conversation

@lxbsz
Copy link
Member

@lxbsz lxbsz commented Apr 29, 2024

Just before the last cache_rejoin ack being received the entire subtree, together with the inode subtree root belongs to, were trimmed the isolated_inodes list couldn't be correctly erased. We should defer calling the trim() until the last cache_rejoin ack being received.

The "rejoin_done" will be set when the local MDS is rejoining, but "rejoin_ack_gather" will be always set when the other MDSs are rejoining.

Fixes: https://tracker.ceph.com/issues/50821

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@lxbsz lxbsz requested review from a team, batrick and vshankar April 29, 2024 04:04
@github-actions github-actions bot added the cephfs Ceph File System label Apr 29, 2024
@lxbsz
Copy link
Member Author

lxbsz commented Apr 29, 2024

This is a new fix after #52648.

@leonid-s-usov leonid-s-usov changed the title mds: new fix to defer trim() until after the last cache_rejoin ack be… mds/cache: defer trim() until after the last cache_rejoin ack Apr 29, 2024
Copy link
Contributor

@leonid-s-usov leonid-s-usov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little concerning that the rejoin_done callback is complete before we receive all rejoin acks. Is that a deliberate design? If so, I'd consider a better name for rejoin_done, such that would imply that it's an early internal event not yet confirmed by the peers

However, this code suggests that rejoin_done shouldn't be called unless the ack set is empty:

    if (rejoin_gather.empty() &&     // make sure we've gotten our FULL inodes, too.
	rejoin_ack_gather.empty()) {
      // finally, kickstart past snap parent opens
      open_snaprealms();
    } else {
      dout(7) << "still need rejoin from (" << rejoin_gather << ")"
	      << ", rejoin_ack from (" << rejoin_ack_gather << ")" << dendl;
    }

I'm afraid that this issue will need further investigation to find and eliminate a race between the call to open_snaprealms(), which completes rejoin_done, and the upkeep of the rejoin_ack_gather set which happens in the context of mds map updates

@lxbsz
Copy link
Member Author

lxbsz commented Apr 29, 2024

It's a little concerning that the rejoin_done callback is complete before we receive all rejoin acks. Is that a deliberate design?

No, the rejoin_done will always be cleared just after we receive all the rejoin acks. Just in two cases we need to do the rejoin:

1), when the local MDS is in up:rejoin state
2), when any other MDS is in up:rejoin state

For the 1) case the rejoin_done will be non-empty but won't be sure that the rejoin_ack_gather is set, while for the 2) case the rejoin_ack_gather will be always set instead.

So this PR is trying to fix the 2) case.

IMO the rejoin_done is for the 1) case only and means the current MDS is successfully joined to the fs cluster.

If so, I'd consider a better name for rejoin_done, such that would imply that it's an early internal event not yet confirmed by the peers

The rejoin_done isn't a internal event and need to wait peers' acks to clear it.

However, this code suggests that rejoin_done shouldn't be called unless the ack set is empty:

    if (rejoin_gather.empty() &&     // make sure we've gotten our FULL inodes, too.
	rejoin_ack_gather.empty()) {
      // finally, kickstart past snap parent opens
      open_snaprealms();
    } else {
      dout(7) << "still need rejoin from (" << rejoin_gather << ")"
	      << ", rejoin_ack from (" << rejoin_ack_gather << ")" << dendl;
    }

Please note the rejoin_done here means:

mds/MDCache.h:1359: std::unique_ptr<MDSContext> rejoin_done;

Not mds/MDSRank.h:563: void rejoin_done();.

I'm afraid that this issue will need further investigation to find and eliminate a race between the call to open_snaprealms(), which completes rejoin_done, and the upkeep of the rejoin_ack_gather set which happens in the context of mds map updates

@leonid-s-usov
Copy link
Contributor

Please note the rejoin_done here means:

   mds/MDCache.h:1359:  std::unique_ptr<MDSContext> rejoin_done;

Not mds/MDSRank.h:563: void rejoin_done();.

Sure, I understand, though the rejoin_done (MDSContext*) is calling MDSRank::rejoin_done, so they are close relatives.

I'm trying to say that your change shouldn't be needed unless we have a race somewhere. The open_snaprealm() should only be called when rejoin_ack_gather is empty, and that's the only place where rejoin_done will be reset.

Other places in the code look at the rejoin_done pointer state to reason about the state of the rejoin process, so I think we should find and eliminate the race rather than just patch is_ready_to_trim_cache.

@leonid-s-usov
Copy link
Contributor

IMO the rejoin_done is for the 1) case only and means the current MDS is successfully joined to the fs cluster.

Oh ok, this is interesting.. So, we may have the acks to await while rejoin_done would be empty?

@lxbsz
Copy link
Member Author

lxbsz commented Apr 29, 2024

IMO the rejoin_done is for the 1) case only and means the current MDS is successfully joined to the fs cluster.

Oh ok, this is interesting.. So, we may have the acks to await while rejoin_done would be empty?

Yeah, correct.

@leonid-s-usov
Copy link
Contributor

1), when the local MDS is in up:rejoin state
2), when any other MDS is in up:rejoin state

Thanks, @lxbsz , I see that rejoin_send_rejoins is used both if we rejoin and if anyone else rejoins; the difference will be that we only call rejoin_gather_finish() if we are rejoining.

  if (mds->is_rejoin() && rejoin_gather.empty()) {
    dout(10) << "nothing to rejoin" << dendl;
    rejoin_gather_finish();
  }

However, I'm still uncomfortable with testing for the rejoin_ack_gather in the can_trim method. Could it be that what we are looking for in that method is a stable mds map state? That would hide implementation details like !rejoin_done and rejoin_ack_gather.empty(), and work well for any other transient cluster state? If that's too much, we could just verify that no MDS in the map is rejoining.

@lxbsz
Copy link
Member Author

lxbsz commented Apr 29, 2024

Please note the rejoin_done here means:

   mds/MDCache.h:1359:  std::unique_ptr<MDSContext> rejoin_done;

Not mds/MDSRank.h:563: void rejoin_done();.

Sure, I understand, though the rejoin_done (MDSContext*) is calling MDSRank::rejoin_done, so they are close relatives.

I'm trying to say that your change shouldn't be needed unless we have a race somewhere. The open_snaprealm() should only be called when rejoin_ack_gather is empty, and that's the only place where rejoin_done will be reset.

Other places in the code look at the rejoin_done pointer state to reason about the state of the rejoin process, so I think we should find and eliminate the race rather than just patch is_ready_to_trim_cache.

If I got you here. In the 2) case it's none busy with the rejoin_done and it will always be null. The 2) case will happen during the local MDS is in up:active state and other MDSs are joining. And then when the local MDS detects this from the mdsmap it will try to call rejoin_joint_start(). Then in the rejoin_ack_gather will be set and before all the acks being received the trim() should be skipped.

@lxbsz
Copy link
Member Author

lxbsz commented Apr 29, 2024

1), when the local MDS is in up:rejoin state
2), when any other MDS is in up:rejoin state

Thanks, @lxbsz , I see that rejoin_send_rejoins is used both if we rejoin and if anyone else rejoins; the difference will be that we only call rejoin_gather_finish() if we are rejoining.

  if (mds->is_rejoin() && rejoin_gather.empty()) {
    dout(10) << "nothing to rejoin" << dendl;
    rejoin_gather_finish();
  }

However, I'm still uncomfortable with testing for the rejoin_ack_gather in the can_trim method. Could it be that what we are looking for in that method is a stable mds map state? That would hide implementation details like !rejoin_done and rejoin_ack_gather.empty(), and work well for any other transient cluster state? If that's too much, we could just verify that no MDS in the map is rejoining.

Let me have a look whether could we just check the mdsmap instead, if we can then this sounds a better approach. Thanks @leonid-s-usov

@lxbsz
Copy link
Member Author

lxbsz commented Apr 30, 2024

@leonid-s-usov Updated it and please take a look. Thanks

Copy link
Contributor

@leonid-s-usov leonid-s-usov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@leonid-s-usov leonid-s-usov changed the title mds/cache: defer trim() until after the last cache_rejoin ack mds/cache: defer trim() until no MDS is rejoining Apr 30, 2024
@vshankar
Copy link
Contributor

vshankar commented May 8, 2024

@lxbsz I recall from #52648 (comment) that @batrick was a bit against avoid trimming when some MDSs are in up:rejoin. Has that been though over with this change?

@lxbsz
Copy link
Member Author

lxbsz commented May 9, 2024

@lxbsz I recall from #52648 (comment) that @batrick was a bit against avoid trimming when some MDSs are in up:rejoin. Has that been though over with this change?

@vshankar Thanks for pointing this out.

@batrick With this change the local MDS will stop trimming the MDCaches whenever any of the MDSs in the cluster is rejoining. This time I am using the mdsmap->is_rejoining instead of mds->is_cluster_degraded(). It seem still hitting the same issue you pointed ?

@batrick
Copy link
Member

batrick commented May 9, 2024

@lxbsz I recall from #52648 (comment) that @batrick was a bit against avoid trimming when some MDSs are in up:rejoin. Has that been though over with this change?

@vshankar Thanks for pointing this out.

@batrick With this change the local MDS will stop trimming the MDCaches whenever any of the MDSs in the cluster is rejoining. This time I am using the mdsmap->is_rejoining instead of mds->is_cluster_degraded(). It seem still hitting the same issue you pointed ?

Yes, this is the type of devilry I think we need to do everything to avoid. I did some looking back at the code history. Most of this surrounds changes Zheng made about 10 years ago:

8a1114c

(2013!) and what appeared to be the intended fix:

ffcbcdd

except the FIXME was never resolved. In particular I'm interested in:

ffcbcdd#diff-ea7cb1a6ba9fa08363b14dd00a86bc6b79e01673e93af84ffdfcdbd0d3f26b19R6304-R6308

which, if I understood correctly, should have resolved this problem:

8a1114c#diff-ea7cb1a6ba9fa08363b14dd00a86bc6b79e01673e93af84ffdfcdbd0d3f26b19R4670-R4672

except clearly it doesn't.

I'd be very careful about assuming the comment for that FIXME is completely accurate. There may be other cases we haven't considered.

At this point, my suggestion is to revert #52648 and aggressively test this to get fresh logs concerning the original problem (sadly, the mds logs for tracker 62036 are garbage collected).

@lxbsz
Copy link
Member Author

lxbsz commented May 9, 2024

@lxbsz I recall from #52648 (comment) that @batrick was a bit against avoid trimming when some MDSs are in up:rejoin. Has that been though over with this change?

@vshankar Thanks for pointing this out.
@batrick With this change the local MDS will stop trimming the MDCaches whenever any of the MDSs in the cluster is rejoining. This time I am using the mdsmap->is_rejoining instead of mds->is_cluster_degraded(). It seem still hitting the same issue you pointed ?

Yes, this is the type of devilry I think we need to do everything to avoid. I did some looking back at the code history. Most of this surrounds changes Zheng made about 10 years ago:

8a1114c

(2013!) and what appeared to be the intended fix:

ffcbcdd

except the FIXME was never resolved. In particular I'm interested in:

ffcbcdd#diff-ea7cb1a6ba9fa08363b14dd00a86bc6b79e01673e93af84ffdfcdbd0d3f26b19R6304-R6308

which, if I understood correctly, should have resolved this problem:

8a1114c#diff-ea7cb1a6ba9fa08363b14dd00a86bc6b79e01673e93af84ffdfcdbd0d3f26b19R4670-R4672

except clearly it doesn't.

I'd be very careful about assuming the comment for that FIXME is completely accurate. There may be other cases we haven't considered.

I can confirm that this is exactly the case we hit, more detail please see the RCA in the tracker:

https://tracker.ceph.com/issues/50821#note-18

Let me try to find a better approach for this.

At this point, my suggestion is to revert #52648 and aggressively test this to get fresh logs concerning the original problem (sadly, the mds logs for tracker 62036 are garbage collected).

lxbsz added 2 commits May 9, 2024 12:28
Just before the last cache_rejoin ack being received the entire
subtree, together with the inode subtree root belongs to, were
trimmed the isolated_inodes list couldn't be correctly erased. We
should defer calling the trim() until the last cache_rejoin ack
being received.

Fixes: https://tracker.ceph.com/issues/50821
Signed-off-by: Xiubo Li <xiubli@redhat.com>
trim_client_leases();
}
if (is_ready_to_trim_cache() || mds->is_standby_replay()) {
if (is_open() || mds->is_standby_replay()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what condition will we reach this function and not be open?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added by @batrick in #48483. I went through the code and got that only when the mds is in up:active state will the open == true.

@github-actions
Copy link

github-actions bot commented Jul 9, 2024

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Jul 9, 2024
@lxbsz
Copy link
Member Author

lxbsz commented Jul 25, 2024

jenkins retest this please

@github-actions github-actions bot removed the stale label Jul 25, 2024
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Sep 23, 2024
@mchangir
Copy link
Contributor

do we need to prevent MDS joins when in trimming state ?

@github-actions github-actions bot removed the stale label Sep 23, 2024
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Nov 22, 2024
@github-actions
Copy link

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this Dec 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cephfs Ceph File System stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants