Skip to content

balancer: refactor calc_pg_upmaps to allow for more simplicity#44002

Merged
ljflores merged 1 commit intoceph:masterfrom
JoshSalomon:wip-primary-balancer
Jan 13, 2022
Merged

balancer: refactor calc_pg_upmaps to allow for more simplicity#44002
ljflores merged 1 commit intoceph:masterfrom
JoshSalomon:wip-primary-balancer

Conversation

@JoshSalomon
Copy link
Contributor

@JoshSalomon JoshSalomon commented Nov 18, 2021

balancer: Balancer code refactoring

This is the first PR in a series of PRs that aim at adding primary balancer to Ceph and improving the current upmap balancer functionality. This first PR focuses on simplifying (refactoring) the code of calc_pg_upmaps so it is easier to change in the future. This PR keeps the existing functionality as-is and does not change anything but the code structure.
See issue in https://bugzilla.redhat.com/show_bug.cgi?id=1870804
See more info in https://github.com/JoshSalomon/ceph/blob/wip-primary-balancer/doc/dev/prim-balancer-design.rst

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
  • Teuthology
    • Completed teuthology run
    • No teuthology test necessary (e.g., documentation)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@ljflores
Copy link
Member

Hey @JoshSalomon I see the problem with the Signed-off-by failure. A few of your commits are missing the doc: prefix. The format of all your commit titles, for this PR, should be doc: ________.

Here is the error message:

raise AssertionError("\n".join([
                "The title/s of following commit/s is/are not started with 'doc', but they only touch files under 'doc/'. Please make sure the commit titles",
                "are started with 'doc'. See the 'Submitting Patches' guide:",
                "https://github.com/ceph/ceph/blob/master/SubmittingPatches.rst#commit-title",

If you fix the commits that are missing a prefix, you should be good.

@ljflores
Copy link
Member

@JoshSalomon also, not sure if this will trigger the Signed-off-by test, but you might want to reconfigure your name to be "Josh Salomon" instead of your email, like so:

Signed-off-by: Josh Salomon <josh.salomon@gmail.com>

That is generally the convention for contributions.

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@JoshSalomon
Copy link
Contributor Author

@JoshSalomon also, not sure if this will trigger the Signed-off-by test, but you might want to reconfigure your name to be "Josh Salomon" instead of your email, like so:

Signed-off-by: Josh Salomon <josh.salomon@gmail.com>

That is generally the convention for contributions.

I know and I changed it - what is left is commit before the change

@ljflores
Copy link
Member

jenkins test signed

@ljflores
Copy link
Member

ljflores commented Nov 19, 2021

@JoshSalomon the Signed-off-by check looks good! I see that now the branch needs a rebase. Let me know if you need any help resolving the conflicts.

Also, are the new commits from other people meant to be in this PR? Or were they unintentionally added after the force push? If it's the latter, let me know if you need help removing the irrelevant commits.

@ljflores
Copy link
Member

@JoshSalomon I see now that you actually can remove/add labels on drafts. If you want to do that, hit the gear icon next to "Labels" in the upper right corner, and you should be able to edit as you need.

@ljflores
Copy link
Member

ljflores commented Dec 6, 2021

Hey @JoshSalomon,

Your PR looks okay to me. If you ever see this in the make check details:

 Identified problems

  | CTest Failure1 tests failed out of 258  Total Test time (real) = 1304.34 sec  The following tests FAILED: 	 32 - run-rbd-unit-tests-61.sh (Failed)

Know that it is a known failure, and that usually a retest will solve the issue. You are not making any modifications to rbd, so it's clear that this failure is unrelated.

Same goes for docs. For whatever reason, there are failures going on right now in docs, but your modifications to the docs passed before, so they are likely unrelated. If you ever are curious about why the docs are failing, you can click on "Details" next to "Required" and inspect the error message. But I believe in your case, it's unrelated.

Edit:
If you ever want to retest your PR, leave a comment that says jenkins retest this please. This comment will tell jenkins to retrigger the tests.

Here is a list of all available jenkins commands, which you can trigger in the same way as above-- by leaving a comment. This list is available on all PRs when you click on the dropdown menu that says "Show available jenkins commands":

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

@ljflores
Copy link
Member

ljflores commented Dec 6, 2021

jenkins retest this please

@ljflores
Copy link
Member

ljflores commented Dec 7, 2021

jenkins test api

@JoshSalomon
Copy link
Contributor Author

jenkins test docs

@JoshSalomon
Copy link
Contributor Author

jenkins render docs

@ljflores
Copy link
Member

ljflores commented Dec 7, 2021

@JoshSalomon are you ready to take this out of draft form? If so, you can select "Ready for review" next to "This pull request is still a work in progress."

Also, let me know when you are ready for reviews. You can add me as a reviewer, and also it might be a good idea to add @neha-ojha and @jdurgin.

@ljflores ljflores removed the crimson label Dec 7, 2021
@ljflores
Copy link
Member

ljflores commented Dec 7, 2021

jenkins render docs

@ljflores
Copy link
Member

ljflores commented Dec 7, 2021

This is the error message from the render docs rest:

Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-ffhm0q7k-build/

Doesn't seem to have anything to do with your doc commits, however, we should try to get the test to pass to confirm that.

@ljflores ljflores changed the title Wip primary balancer osd: primary balancer Dec 21, 2021
@ljflores
Copy link
Member

Newest teuthology run:
http://pulpito.front.sepia.ceph.com/lflores-2021-12-21_07:21:04-rados-wip-primary-balancer-distro-default-smithi/

It's still not done yet, but I'll add a summary of failures when the jobs complete.

Copy link
Member

@ljflores ljflores left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some spacing that looks a little off here, but it otherwise looks good to me! I've gone through everything and it seems like you've effectively translated the same logic over into the refactoring.

The main thing I want to focus on now is ensuring that this branch is not causing regressions. In the most recent Teuthology run of this branch (see prevous comment), there were a lot of failures. Looking through them, they don't seem obviously related, but I'd like to try and get more of these tests to pass.

I re-ran the failed/dead jobs here. We'll see how this run goes.
http://pulpito.front.sepia.ceph.com/lflores-2022-01-04_06:21:44-rados-wip-primary-balancer-distro-default-smithi/

Beyond that, we might want to consider rebasing the branch again to test the changes on the most recent version of master to avoid a mass of unrelated failures.

@ljflores
Copy link
Member

ljflores commented Jan 4, 2022

@JoshSalomon also we will want to squash commits that relate to the same concept. We can go over this in our meeting. :)

@ljflores
Copy link
Member

@JoshSalomon how is the commit squashing going? Let me know if you need help

@wangnengjie
Copy link

Hello, I wonder in what way do you design the primary balance feature without data movement? It seems that there are no relative code in commits. I think it's almost impossible to achieve this through upmap feature (especially for ec pool). So I'm really curious about your idea.

@JoshSalomon
Copy link
Contributor Author

Hello, I wonder in what way do you design the primary balance feature without data movement? It seems that there are no relative code in commits. I think it's almost impossible to achieve this through upmap feature (especially for ec pool). So I'm really curious about your idea.

I will let @ljflores explain how we change primaries without data movement - in any case this will be in a new PR and not in this one. This PR is just a first in a list of PRs that will improve current upmap balancer and will add primary balancing. (I thought we can make a single PR but I was wrong)
Regarding EC, primary balancer has limited value in EC pools (almost no value when working in fast mode) and currently we plan on implementing primary balancer (actually workload balancer is a better name) for replicated pools only. In the future we may add support for EC pools without the fast mode, but I am not sure it is worth the effort (since the value is limited in most cases). In any case I believe it requires a separate implementation (a different algorithm) than for replicated pools.

@neha-ojha
Copy link
Member

Hello, I wonder in what way do you design the primary balance feature without data movement? It seems that there are no relative code in commits. I think it's almost impossible to achieve this through upmap feature (especially for ec pool). So I'm really curious about your idea.

I will let @ljflores explain how we change primaries without data movement - in any case this will be in a new PR and not in this one. This PR is just a first in a list of PRs that will improve current upmap balancer and will add primary balancing. (I thought we can make a single PR but I was wrong) Regarding EC, primary balancer has limited value in EC pools (almost no value when working in fast mode) and currently we plan on implementing primary balancer (actually workload balancer is a better name) for replicated pools only. In the future we may add support for EC pools without the fast mode, but I am not sure it is worth the effort (since the value is limited in most cases). In any case I believe it requires a separate implementation (a different algorithm) than for replicated pools.

Can we please update the PR title to reflect the primary aim of this PR?

@wangnengjie
Copy link

Hello, I wonder in what way do you design the primary balance feature without data movement? It seems that there are no relative code in commits. I think it's almost impossible to achieve this through upmap feature (especially for ec pool). So I'm really curious about your idea.

I will let @ljflores explain how we change primaries without data movement - in any case this will be in a new PR and not in this one. This PR is just a first in a list of PRs that will improve current upmap balancer and will add primary balancing. (I thought we can make a single PR but I was wrong) Regarding EC, primary balancer has limited value in EC pools (almost no value when working in fast mode) and currently we plan on implementing primary balancer (actually workload balancer is a better name) for replicated pools only. In the future we may add support for EC pools without the fast mode, but I am not sure it is worth the effort (since the value is limited in most cases). In any case I believe it requires a separate implementation (a different algorithm) than for replicated pools.

Thanks for your reply. I have two questions. As I'm new to ceph these questions may be emm... a little bit ridiculous...
What is the fast mode? Is it a feature ceph currently support? I didn't find a doc or blog describe it. Is it a cache layer?
Why workload balancer has limited value in EC pools? Due to the algorithm design or something else? With a large cluster that may set k+m a big number (probably > 10 or 20), there are only several primary pgs on each osd. In such a case the number of primary pg is significantly unbalanced. Maybe the bottleneck is not disk but network and cpu (calc during recover).

@JoshSalomon
Copy link
Contributor Author

Hello, I wonder in what way do you design the primary balance feature without data movement? It seems that there are no relative code in commits. I think it's almost impossible to achieve this through upmap feature (especially for ec pool). So I'm really curious about your idea.

I will let @ljflores explain how we change primaries without data movement - in any case this will be in a new PR and not in this one. This PR is just a first in a list of PRs that will improve current upmap balancer and will add primary balancing. (I thought we can make a single PR but I was wrong) Regarding EC, primary balancer has limited value in EC pools (almost no value when working in fast mode) and currently we plan on implementing primary balancer (actually workload balancer is a better name) for replicated pools only. In the future we may add support for EC pools without the fast mode, but I am not sure it is worth the effort (since the value is limited in most cases). In any case I believe it requires a separate implementation (a different algorithm) than for replicated pools.

Can we please update the PR title to reflect the primary aim of this PR?

Done - changed the title and the description.

@JoshSalomon
Copy link
Contributor Author

JoshSalomon commented Jan 12, 2022

Hello, I wonder in what way do you design the primary balance feature without data movement? It seems that there are no relative code in commits. I think it's almost impossible to achieve this through upmap feature (especially for ec pool). So I'm really curious about your idea.

I will let @ljflores explain how we change primaries without data movement - in any case this will be in a new PR and not in this one. This PR is just a first in a list of PRs that will improve current upmap balancer and will add primary balancing. (I thought we can make a single PR but I was wrong) Regarding EC, primary balancer has limited value in EC pools (almost no value when working in fast mode) and currently we plan on implementing primary balancer (actually workload balancer is a better name) for replicated pools only. In the future we may add support for EC pools without the fast mode, but I am not sure it is worth the effort (since the value is limited in most cases). In any case I believe it requires a separate implementation (a different algorithm) than for replicated pools.

Thanks for your reply. I have two questions. As I'm new to ceph these questions may be emm... a little bit ridiculous... What is the fast mode? Is it a feature ceph currently support? I didn't find a doc or blog describe it. Is it a cache layer? Why workload balancer has limited value in EC pools? Due to the algorithm design or something else? With a large cluster that may set k+m a big number (probably > 10 or 20), there are only several primary pgs on each osd. In such a case the number of primary pg is significantly unbalanced. Maybe the bottleneck is not disk but network and cpu (calc during recover).

For fast mode search fast_read in https://docs.ceph.com/en/latest/rados/operations/pools/
In EC the read request is sent to multiple OSDs not only to the primary (and in fast_mode to all) so it is not that important how the primaries are distributed, but how the Ks in all the PGs are distributed. Therefore the implementation of the workload balancer is different. Additionally in EC you need to move data because balancing means almost by definition changing roles between the K and the N (unlike replica case) so the cost of balancing is much higher. I am not claiming that EC does not need workload balancing, I say it is a different type of balancer, the cost is higher and I think that in many cases the problem is smaller since the number of the OSDS to be balanced (K * #PGs) is higher and the statistical distribution is better. (experiments show that unbalanced primaries impacts the system less when you have more PGs in the pool)

@wangnengjie
Copy link

Hello, I wonder in what way do you design the primary balance feature without data movement? It seems that there are no relative code in commits. I think it's almost impossible to achieve this through upmap feature (especially for ec pool). So I'm really curious about your idea.

I will let @ljflores explain how we change primaries without data movement - in any case this will be in a new PR and not in this one. This PR is just a first in a list of PRs that will improve current upmap balancer and will add primary balancing. (I thought we can make a single PR but I was wrong) Regarding EC, primary balancer has limited value in EC pools (almost no value when working in fast mode) and currently we plan on implementing primary balancer (actually workload balancer is a better name) for replicated pools only. In the future we may add support for EC pools without the fast mode, but I am not sure it is worth the effort (since the value is limited in most cases). In any case I believe it requires a separate implementation (a different algorithm) than for replicated pools.

Thanks for your reply. I have two questions. As I'm new to ceph these questions may be emm... a little bit ridiculous... What is the fast mode? Is it a feature ceph currently support? I didn't find a doc or blog describe it. Is it a cache layer? Why workload balancer has limited value in EC pools? Due to the algorithm design or something else? With a large cluster that may set k+m a big number (probably > 10 or 20), there are only several primary pgs on each osd. In such a case the number of primary pg is significantly unbalanced. Maybe the bottleneck is not disk but network and cpu (calc during recover).

For fast mode search fast_read in https://docs.ceph.com/en/latest/rados/operations/pools/ In EC the read request is sent to multiple OSDs not only to the primary (and in fast_mode to all) so it is not that important how the primaries are distributed, but how the Ks in all the PGs are distributed. Therefore the implementation of the workload balancer is different. Additionally in EC you need to move data because balancing means almost by definition changing roles between the K and the N (unlike replica case) so the cost of balancing is much higher. I am not claiming that EC does not need workload balancing, I say it is a different type of balancer, the cost is higher and I think that in many cases the problem is smaller since the number of the OSDS to be balanced (K * #PGs) is higher and the statistical distribution is better. (experiments show that unbalanced primaries impacts the system less when you have more PGs in the pool)

Okay, I get it! Thanks a lot!
Just to say that although sub read requests will be sent to multiple OSDs, the sender (also decoder) of these sub read requests is the primary OSD itself, am I right? The client's read request still only sends to primary if CEPH_OSD_FLAG_BALANCE_READS is not set.

@ljflores ljflores changed the title osd: primary balancer osd: refactor calc_pg_upmaps to allow for more simplicity Jan 12, 2022
@JoshSalomon
Copy link
Contributor Author

Okay, I get it! Thanks a lot! Just to say that although sub read requests will be sent to multiple OSDs, the sender (also decoder) of these sub read requests is the primary OSD itself, am I right? The client's read request still only sends to primary if CEPH_OSD_FLAG_BALANCE_READS is not set.

You are right

@neha-ojha
Copy link
Member

neha-ojha commented Jan 13, 2022

@JoshSalomon @ljflores Let's target the refactor commit for Quincy and hold off on merging
(create a separate PR) the developer documentation for primary balancing until the feature is implemented and ready for backport, to avoid any kind of confusion.

This is the first commit in a series of commits that aims at adding a primary balancer to Ceph and improving the current upmap balancer functionality. This first commit focuses on simplifying (refactoring) the code of `calc_pg_upmaps` so it is easier to change in the future. This PR keeps the existing functionality as-is and does not change anything but the code structure.

As part of the work is major refactoring of OSDMap::calc_pg_upmaps, the first thing is adding an --upmap-seed param to osdmaptool so test results can be compared without the random factor.

Other changes made:
    - Divided sections of `OSDMap::calc_pg_upmaps` into their own separate functions
    - Renamed tmp to tmp_osd_map
    - Changed all the occurances of 'first' and 'second' in the function to more meaningful names.

Signed-off-by: Josh Salomon <josh.salomon@gmail.com>
@ljflores ljflores force-pushed the wip-primary-balancer branch from c9dcf1c to 86d6d11 Compare January 13, 2022 02:30
@ljflores ljflores changed the title osd: refactor calc_pg_upmaps to allow for more simplicity balancer: refactor calc_pg_upmaps to allow for more simplicity Jan 13, 2022
@ljflores
Copy link
Member

ljflores commented Jan 13, 2022

@JoshSalomon @neha-ojha

The commit in this PR now contains all of the refactor changes for Quincy (changes that were previously spread across commits from src/osd/OSDMap.h/cc and src/tools/osdmaptoolcc). #44565 contains the doc commit, which can be merged after the Quincy feature freeze this Friday.

@ljflores
Copy link
Member

@xiexingguo if you have a chance to review this, it would be great to have a +1.

Copy link
Member

@xiexingguo xiexingguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. Thanks!

@ljflores ljflores merged commit 9ebf397 into ceph:master Jan 13, 2022
@JoshSalomon JoshSalomon deleted the wip-primary-balancer branch January 25, 2022 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants