Skip to content

mds: add ceph.dir.bal.mask vxattr for MDS Balancer#52373

Open
yongseokoh wants to merge 3 commits intoceph:mainfrom
yongseokoh:wip-61777
Open

mds: add ceph.dir.bal.mask vxattr for MDS Balancer#52373
yongseokoh wants to merge 3 commits intoceph:mainfrom
yongseokoh:wip-61777

Conversation

@yongseokoh
Copy link
Contributor

tracker: https://tracker.ceph.com/issues/61777

Introduction
This PR introduces the ceph.dir.bal.mask vxattr, which is an option to rebalance a subtree within specific active MDSs. Similar to the CPU mask, this feature enables load balancing of specific directories across multiple MDS ranks. It is especially useful for fine-tuning and improving performance in various scenarios. Previously, the bal_rank_mask in #43284 supports isolating unpinned subtrees under the root directory ('/') to a specific MDS rank. However, with this new option vxattr, it becomes possible to isolate specific subdirectories to designated MDS ranks. By introducing the ceph.dir.bal.mask vxattr, this PR empowers Ceph administrators with enhanced control and flexibility for optimizing performance and fine-tuning their deployments.

Use Cases
The first is when it is difficult to pin a subdir to one MDS rank. The /home/images directory exists. There are /0 to /99 directories under it, and 10 million image files are stored in each directory. In this case, it is difficult to pin the entire images directory to one MDS rank. Also, pinning the huge 100 directories manually or using ephemeral pinning is not an easy task. Therefore, efficient resource management is possible by using ceph.dir.bal.mask.

Second, when there are several large directories such as /home/images, performance can be optimized by distributing them to different MDS rank groups using ceph.dir.bal.mask. Since the existing mdsmap’s bal_rank_mask isolated the entire ‘/’ directory to specific ranks, it can affect performance due to each other's migration overhead. For example, mdsmap’s bal_rank_mask is set to 0xf and /home/images and /home/backups large directories exist. If the load on /home/images instantaneously increases, metadata distribution occurs across ranks 0 to 3. Thus, users of /home/backups may be affected by noisy neighbors unnecessarily. If the two directories are set to MDS rank 0-1 (ceph.dir.bal.mask 0x3) and 2-3 (ceph.dir.bal.mask 0xC) respectively, the effect on each other can be minimized. Like this, it can be used efficiently for various directories.

How to use

# / root is distributed within MDS rank 0 to 3
# it overrides mdsmap bal_rank_mask value
setfattr -n ceph.dir.bal.mask -v 0x0f /

# /home/images subdir is distributed within MDS rank 1 and 2
setfattr -n ceph.dir.bal.mask -v 0x06 /home/images

# /home/images subdir is distributed within MDS rank 3 and 4
setfattr -n ceph.dir.bal.mask -v 0x18 /home/backups

# remove values and they will obey parent the value or mdsmap bal_rank_mask
setfattr -n ceph.dir.bal.mask -v -1 /home/images
setfattr -n ceph.dir.bal.mask -v -1 /home/backups

# /home/images is placed only on MDS rank1 (similar to ceph.dir.dir)
setfattr -n ceph.dir.bal.mask -v 0x02 /home/images

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@yongseokoh yongseokoh requested a review from a team as a code owner July 10, 2023 06:18
@github-actions github-actions bot added cephfs Ceph File System core mon labels Jul 10, 2023
@vshankar vshankar self-assigned this Aug 17, 2023
Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few things:

  • Please add the PR discussion to your commit message.
  • Because the bitset is 256 bits, it's not generally easy to compute the xattr value for the caller. I think it would be helpful to have the MDS compute the value by allowing something like setfattr -n ceph.dir.bal.mak -v 0,1,3,15 dir/ such that the MDS will do the bitwise or of those bits.
  • I'd like to see some tests in qa/tasks/cephfs/test_exports.py. You can use vstart_runner.py to test.
  • There should be some docs added to explain this and the MDSMap bal rank mask. Which should users prefer and when? Is it valid to set the rank mask on the root directory? Are values inherited or override-able?

if (r != 0) {
return r;
}
std::bitset<MAX_MDS> rank_mask = std::bitset<MAX_MDS>(bin_string);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please carve out this into a separate commit since this is through the fs set interface (for mdsmap) rather than the vxattr interface proposal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will split the PR once the discussion about restrictionon rank0 is resolved.
#52373 (comment)

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@yongseokoh
Copy link
Contributor Author

@batrick @vshankar Please review my changes.

Few things:

  • Please add the PR discussion to your commit message.

Done.

  • Because the bitset is 256 bits, it's not generally easy to compute the xattr value for the caller. I think it would be helpful to have the MDS compute the value by allowing something like setfattr -n ceph.dir.bal.mak -v 0,1,3,15 dir/ such that the MDS will do the bitwise or of those bits.

I agree. It is not easy to calculate numerous bits and configure the bitfield.

  • I'd like to see some tests in qa/tasks/cephfs/test_exports.py. You can use vstart_runner.py to test.

Test cases for ceph.dir.bal.mask were implemented in test_exports.py.

  • There should be some docs added to explain this and the MDSMap bal rank mask. Which should users prefer and when? Is it valid to set the rank mask on the root directory? Are values inherited or override-able?

How to use ceph.dir.bal.mask is explained in the document.

@idryomov idryomov removed the request for review from a team September 16, 2023 13:21
@idryomov idryomov removed the rbd label Sep 16, 2023
@yongseokoh
Copy link
Contributor Author

jenkins retest this please

@yongseokoh yongseokoh force-pushed the wip-61777 branch 2 times, most recently from 4aca9f4 to 53a786c Compare September 18, 2023 04:55
@yongseokoh
Copy link
Contributor Author

jenkins test make check

@yongseokoh
Copy link
Contributor Author

jenkins test make check arm64

@yongseokoh
Copy link
Contributor Author

jenkins test make check

@batrick
Copy link
Member

batrick commented Aug 28, 2024

I'll try again, thanks.


CInode *CInode::get_rank_mask_inode(bool inherit)
{
if (!g_conf().get_val<bool>("mds_bal_export_pin"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be too expensive to run so frequently. Please cache the config variable in MDCache (as we do elsewhere).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@batrick I fixed it.

@yongseokoh
Copy link
Contributor Author

jenkins test make check

@yongseokoh
Copy link
Contributor Author

@batrick Could you please review this updated PR when you get a moment?

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@yongseokoh yongseokoh closed this Dec 17, 2024
@batrick batrick reopened this Oct 31, 2025
Copy link
Contributor

@anthonyeleven anthonyeleven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Various nitpicky docs suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does "nicely" have specific meaning here? Since it's in a function name I suspect so

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anthonyeleven Could you clarify which part of the code you're referring to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Odd that this doesn't appear to be anchored to the line in the file.

      dout(7) << "try to export nicely " << cd->get_path() << " auth " << cd->authority().first << " to " << target << " mask " << bitmask_to_str(rank_mask_bitset) << dendl;
      mds->mdcache->migrator->export_dir_nicely(cd, target);

What does it mean to export "nicely"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anthonyeleven
“nicely” means performing a graceful export — the directory is transferred to the target MDS without forcing or interrupting ongoing operations.
I didn’t modify this function in this change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the error string here and below include more detail, like perhaps an encoded representation of the string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anthonyeleven Could you clarify which part you’re referring to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may have subsequently updated the commit. I saw something like an parsing error reported, without saying what the error was and what the actual string value was.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anthonyeleven Could you please point me to the specific line of code you’re referring to?

Could you please confirm if this is the line you were referring to:
https://github.com/ceph/ceph/pull/52373/files#diff-729d5135082091929c032d8d6a0552bd2a5c658006aa7cb409764d85e64f9431R631

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I no longer see the line to which I was referring, so nevermind

@yongseokoh
Copy link
Contributor Author

@anthonyeleven Please feel free to add any further comments on the MDS code section or the code block formatting — I’ll make the updates accordingly.

Copy link
Contributor

@anthonyeleven anthonyeleven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs lgtm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may have subsequently updated the commit. I saw something like an parsing error reported, without saying what the error was and what the actual string value was.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I no longer see the line to which I was referring, so nevermind

@yongseokoh
Copy link
Contributor Author

@anthonyeleven Changes applied as suggested. Let me know if I can rebase now.

@anthonyeleven
Copy link
Contributor

Docs look good to me; others need to approve the code, and I see conflicts reported.

That introduces the ceph.dir.bal.mask vxattr, which
is an option to rebalance a subtree within
specific active MDSs. Similar to the CPU mask,
this feature enables load balancing of specific
directories across multiple MDS ranks. It is especially
useful for fine-tuning and improving performance
in various scenarios. Previously, the bal_rank_mask
in ceph#43284 supports isolating unpinned subtrees under
the root directory ('/') to a specific MDS rank.
However, with this new option vxattr, it becomes
possible to isolate specific subdirectories to
designated MDS ranks. By introducing the
ceph.dir.bal.mask vxattr, this PR empowers
Ceph administrators with enhanced control and
flexibility for optimizing performance and
fine-tuning their deployments.

trakcer: https://tracker.ceph.com/issues/61777
Signed-off-by: Yongseok Oh <yongseok.oh@linecorp.com>
Signed-off-by: Yongseok Oh <yongseok.oh@linecorp.com>
Signed-off-by: Yongseok Oh <yongseok.oh@linecorp.com>
@github-actions
Copy link

github-actions bot commented Jan 8, 2026

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Jan 8, 2026
@ceph-jenkins
Copy link
Collaborator

Can one of the admins verify this patch?

@github-actions github-actions bot removed the stale label Jan 8, 2026
@github-actions
Copy link

github-actions bot commented Mar 9, 2026

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants