Skip to content

mds: drop client metrics during recovery#57084

Merged
batrick merged 1 commit intoceph:mainfrom
batrick:i65660
Jun 23, 2024
Merged

mds: drop client metrics during recovery#57084
batrick merged 1 commit intoceph:mainfrom
batrick:i65660

Conversation

@batrick
Copy link
Member

@batrick batrick commented Apr 25, 2024

Fixes: https://tracker.ceph.com/issues/65660

Checklist

  • Tracker (select at least one)
    • References tracker ticket
  • Component impact
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • No doc update is appropriate
  • Tests (select at least one)
    • No tests
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@batrick batrick added cephfs Ceph File System needs-review labels Apr 25, 2024
@batrick batrick requested a review from a team April 25, 2024 00:45
Fixes: https://tracker.ceph.com/issues/65660
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
@batrick
Copy link
Member Author

batrick commented May 1, 2024

jenkins test make check

{
dout(1) << "active_start" << dendl;

m_is_active = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where you are resetting it. I'd suggest that you put the code at the end of handle_mds_map:

    m_is_active = is_active();

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resetting?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO no need to reset it.

@batrick BTW, why not just check the MDS' state from the mdsmap instead of adding a new m_is_active ? For lockless case ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need the mds_lock to look at the MDSMap. We don't want the metrics aggregator to be acquiring that lock generally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

@leonid-s-usov
Copy link
Contributor

I appreciate that as of today an instance of MDSRank may never experience a transition from active -> inactive in a way that would affect this code. However, relying on that fact here creates an implicit dependency on how MDSRank instances should be managed that isn't unit-tested (as of today). We should try to minimize such implicit dependencies, IMO

@batrick
Copy link
Member Author

batrick commented May 3, 2024

I appreciate that as of today an instance of MDSRank may never experience a transition from active -> inactive in a way that would affect this code. However, relying on that fact here creates an implicit dependency on how MDSRank instances should be managed that isn't unit-tested (as of today). We should try to minimize such implicit dependencies, IMO

up:active has always been treated as a "terminal" state for a rank and is reflected in the code everywhere. I don't see a benefit for making this somehow resilient to that changing.

@lxbsz
Copy link
Member

lxbsz commented May 9, 2024

@batrick I think this change will fix the case that when connecting to the old clients, which haven't included my previous fixes.

@rishabh-d-dave rishabh-d-dave added the wip-rishabh-testing Rishabh's testing label label May 20, 2024
@rishabh-d-dave
Copy link
Contributor

@batrick Picking this PR for QA

@rishabh-d-dave
Copy link
Contributor

This PR is under test in https://tracker.ceph.com/issues/66125.

Copy link
Contributor

@leonid-s-usov leonid-s-usov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that MDS can't deactivate except for shutting down, hence I'll approve as-is.

To reduce ambiguity for future readers, who, like me, could be confused by the naming and why this state never gets reset, I'd suggest the name: m_did_activate. IMO this better encodes the one-way nature of the flag.

@rishabh-d-dave
Copy link
Contributor

This PR is under test in https://tracker.ceph.com/issues/66162.

@batrick
Copy link
Member Author

batrick commented Jun 4, 2024

@rishabh-d-dave status?

@rishabh-d-dave
Copy link
Contributor

@rishabh-d-dave status?

First, 4/5 builds failed because there were failure due to infra issues. Then runs had 39 infra-related failure that were persistent even in new runs. Then re-runs couldn't be launched due to issue in teutholog-suite command and since last mid-week I had OS issues due to which I had to re-install it and set up everything again. I'll keep this at top of my priority list and get it done.

@rishabh-d-dave
Copy link
Contributor

Removing my testing label for since we have infra issues again.

@rishabh-d-dave rishabh-d-dave removed the wip-rishabh-testing Rishabh's testing label label Jun 6, 2024
@batrick
Copy link
Member Author

batrick commented Jun 11, 2024

This PR is under test in https://tracker.ceph.com/issues/66433.

@batrick
Copy link
Member Author

batrick commented Jun 13, 2024

This PR is under test in https://tracker.ceph.com/issues/66462.

batrick added a commit to batrick/ceph that referenced this pull request Jun 13, 2024
* refs/pull/57084/head:
	mds: drop client metrics during recovery
@batrick
Copy link
Member Author

batrick commented Jun 22, 2024

This PR is under test in https://tracker.ceph.com/issues/66609.

@batrick
Copy link
Member Author

batrick commented Jun 23, 2024

@batrick batrick merged commit 426beef into ceph:main Jun 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants