balancer: refactor calc_pg_upmaps to allow for more simplicity by JoshSalomon · Pull Request #44002 · ceph/ceph

JoshSalomon · 2021-11-18T14:34:16Z

balancer: Balancer code refactoring

This is the first PR in a series of PRs that aim at adding primary balancer to Ceph and improving the current upmap balancer functionality. This first PR focuses on simplifying (refactoring) the code of calc_pg_upmaps so it is easier to change in the future. This PR keeps the existing functionality as-is and does not change anything but the code structure.
See issue in https://bugzilla.redhat.com/show_bug.cgi?id=1870804
See more info in https://github.com/JoshSalomon/ceph/blob/wip-primary-balancer/doc/dev/prim-balancer-design.rst

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests
Teuthology
- Completed teuthology run
- No teuthology test necessary (e.g., documentation)

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

ljflores · 2021-11-18T15:20:42Z

Hey @JoshSalomon I see the problem with the Signed-off-by failure. A few of your commits are missing the doc: prefix. The format of all your commit titles, for this PR, should be doc: ________.

Here is the error message:

raise AssertionError("\n".join([
                "The title/s of following commit/s is/are not started with 'doc', but they only touch files under 'doc/'. Please make sure the commit titles",
                "are started with 'doc'. See the 'Submitting Patches' guide:",
                "https://github.com/ceph/ceph/blob/master/SubmittingPatches.rst#commit-title",

If you fix the commits that are missing a prefix, you should be good.

ljflores · 2021-11-18T15:27:45Z

@JoshSalomon also, not sure if this will trigger the Signed-off-by test, but you might want to reconfigure your name to be "Josh Salomon" instead of your email, like so:

Signed-off-by: Josh Salomon <josh.salomon@gmail.com>

That is generally the convention for contributions.

github-actions · 2021-11-18T15:43:04Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

JoshSalomon · 2021-11-18T15:44:43Z

@JoshSalomon also, not sure if this will trigger the Signed-off-by test, but you might want to reconfigure your name to be "Josh Salomon" instead of your email, like so:
Signed-off-by: Josh Salomon <josh.salomon@gmail.com>
That is generally the convention for contributions.

I know and I changed it - what is left is commit before the change

ljflores · 2021-11-18T16:09:35Z

jenkins test signed

ljflores · 2021-11-19T18:51:05Z

@JoshSalomon the Signed-off-by check looks good! I see that now the branch needs a rebase. Let me know if you need any help resolving the conflicts.

Also, are the new commits from other people meant to be in this PR? Or were they unintentionally added after the force push? If it's the latter, let me know if you need help removing the irrelevant commits.

ljflores · 2021-11-30T15:03:51Z

@JoshSalomon I see now that you actually can remove/add labels on drafts. If you want to do that, hit the gear icon next to "Labels" in the upper right corner, and you should be able to edit as you need.

ljflores · 2021-12-06T18:41:27Z

Hey @JoshSalomon,

Your PR looks okay to me. If you ever see this in the make check details:

 Identified problems

  | CTest Failure1 tests failed out of 258  Total Test time (real) = 1304.34 sec  The following tests FAILED: 	 32 - run-rbd-unit-tests-61.sh (Failed)

Know that it is a known failure, and that usually a retest will solve the issue. You are not making any modifications to rbd, so it's clear that this failure is unrelated.

Same goes for docs. For whatever reason, there are failures going on right now in docs, but your modifications to the docs passed before, so they are likely unrelated. If you ever are curious about why the docs are failing, you can click on "Details" next to "Required" and inspect the error message. But I believe in your case, it's unrelated.

Edit:
If you ever want to retest your PR, leave a comment that says jenkins retest this please. This comment will tell jenkins to retrigger the tests.

Here is a list of all available jenkins commands, which you can trigger in the same way as above-- by leaving a comment. This list is available on all PRs when you click on the dropdown menu that says "Show available jenkins commands":

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

ljflores · 2021-12-06T18:44:12Z

jenkins retest this please

ljflores · 2021-12-07T02:14:47Z

jenkins test api

JoshSalomon · 2021-12-07T18:51:02Z

jenkins test docs

JoshSalomon · 2021-12-07T18:59:55Z

jenkins render docs

ljflores · 2021-12-07T19:04:31Z

@JoshSalomon are you ready to take this out of draft form? If so, you can select "Ready for review" next to "This pull request is still a work in progress."

Also, let me know when you are ready for reviews. You can add me as a reviewer, and also it might be a good idea to add @neha-ojha and @jdurgin.

ljflores · 2021-12-07T21:20:34Z

jenkins render docs

ljflores · 2021-12-07T21:36:56Z

This is the error message from the render docs rest:

Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-ffhm0q7k-build/

Doesn't seem to have anything to do with your doc commits, however, we should try to get the test to pass to confirm that.

ljflores · 2021-12-21T21:25:44Z

Newest teuthology run:
http://pulpito.front.sepia.ceph.com/lflores-2021-12-21_07:21:04-rados-wip-primary-balancer-distro-default-smithi/

It's still not done yet, but I'll add a summary of failures when the jobs complete.

ljflores

There is some spacing that looks a little off here, but it otherwise looks good to me! I've gone through everything and it seems like you've effectively translated the same logic over into the refactoring.

The main thing I want to focus on now is ensuring that this branch is not causing regressions. In the most recent Teuthology run of this branch (see prevous comment), there were a lot of failures. Looking through them, they don't seem obviously related, but I'd like to try and get more of these tests to pass.

I re-ran the failed/dead jobs here. We'll see how this run goes.
http://pulpito.front.sepia.ceph.com/lflores-2022-01-04_06:21:44-rados-wip-primary-balancer-distro-default-smithi/

Beyond that, we might want to consider rebasing the branch again to test the changes on the most recent version of master to avoid a mass of unrelated failures.

src/osd/OSDMap.cc

ljflores · 2022-01-04T06:49:22Z

@JoshSalomon also we will want to squash commits that relate to the same concept. We can go over this in our meeting. :)

ljflores · 2022-01-10T16:08:31Z

@JoshSalomon how is the commit squashing going? Let me know if you need help

wangnengjie · 2022-01-11T11:21:30Z

Hello, I wonder in what way do you design the primary balance feature without data movement? It seems that there are no relative code in commits. I think it's almost impossible to achieve this through upmap feature (especially for ec pool). So I'm really curious about your idea.

JoshSalomon · 2022-01-11T16:56:04Z

Hello, I wonder in what way do you design the primary balance feature without data movement? It seems that there are no relative code in commits. I think it's almost impossible to achieve this through upmap feature (especially for ec pool). So I'm really curious about your idea.

I will let @ljflores explain how we change primaries without data movement - in any case this will be in a new PR and not in this one. This PR is just a first in a list of PRs that will improve current upmap balancer and will add primary balancing. (I thought we can make a single PR but I was wrong)
Regarding EC, primary balancer has limited value in EC pools (almost no value when working in fast mode) and currently we plan on implementing primary balancer (actually workload balancer is a better name) for replicated pools only. In the future we may add support for EC pools without the fast mode, but I am not sure it is worth the effort (since the value is limited in most cases). In any case I believe it requires a separate implementation (a different algorithm) than for replicated pools.

neha-ojha · 2022-01-12T00:36:05Z

Hello, I wonder in what way do you design the primary balance feature without data movement? It seems that there are no relative code in commits. I think it's almost impossible to achieve this through upmap feature (especially for ec pool). So I'm really curious about your idea.

I will let @ljflores explain how we change primaries without data movement - in any case this will be in a new PR and not in this one. This PR is just a first in a list of PRs that will improve current upmap balancer and will add primary balancing. (I thought we can make a single PR but I was wrong) Regarding EC, primary balancer has limited value in EC pools (almost no value when working in fast mode) and currently we plan on implementing primary balancer (actually workload balancer is a better name) for replicated pools only. In the future we may add support for EC pools without the fast mode, but I am not sure it is worth the effort (since the value is limited in most cases). In any case I believe it requires a separate implementation (a different algorithm) than for replicated pools.

Can we please update the PR title to reflect the primary aim of this PR?

wangnengjie · 2022-01-12T03:31:02Z

Hello, I wonder in what way do you design the primary balance feature without data movement? It seems that there are no relative code in commits. I think it's almost impossible to achieve this through upmap feature (especially for ec pool). So I'm really curious about your idea.

I will let @ljflores explain how we change primaries without data movement - in any case this will be in a new PR and not in this one. This PR is just a first in a list of PRs that will improve current upmap balancer and will add primary balancing. (I thought we can make a single PR but I was wrong) Regarding EC, primary balancer has limited value in EC pools (almost no value when working in fast mode) and currently we plan on implementing primary balancer (actually workload balancer is a better name) for replicated pools only. In the future we may add support for EC pools without the fast mode, but I am not sure it is worth the effort (since the value is limited in most cases). In any case I believe it requires a separate implementation (a different algorithm) than for replicated pools.

Thanks for your reply. I have two questions. As I'm new to ceph these questions may be emm... a little bit ridiculous...
What is the fast mode? Is it a feature ceph currently support? I didn't find a doc or blog describe it. Is it a cache layer?
Why workload balancer has limited value in EC pools? Due to the algorithm design or something else? With a large cluster that may set k+m a big number (probably > 10 or 20), there are only several primary pgs on each osd. In such a case the number of primary pg is significantly unbalanced. Maybe the bottleneck is not disk but network and cpu (calc during recover).

JoshSalomon · 2022-01-12T11:45:51Z

Hello, I wonder in what way do you design the primary balance feature without data movement? It seems that there are no relative code in commits. I think it's almost impossible to achieve this through upmap feature (especially for ec pool). So I'm really curious about your idea.

I will let @ljflores explain how we change primaries without data movement - in any case this will be in a new PR and not in this one. This PR is just a first in a list of PRs that will improve current upmap balancer and will add primary balancing. (I thought we can make a single PR but I was wrong) Regarding EC, primary balancer has limited value in EC pools (almost no value when working in fast mode) and currently we plan on implementing primary balancer (actually workload balancer is a better name) for replicated pools only. In the future we may add support for EC pools without the fast mode, but I am not sure it is worth the effort (since the value is limited in most cases). In any case I believe it requires a separate implementation (a different algorithm) than for replicated pools.

Can we please update the PR title to reflect the primary aim of this PR?

Done - changed the title and the description.

JoshSalomon · 2022-01-12T11:59:18Z

Hello, I wonder in what way do you design the primary balance feature without data movement? It seems that there are no relative code in commits. I think it's almost impossible to achieve this through upmap feature (especially for ec pool). So I'm really curious about your idea.

I will let @ljflores explain how we change primaries without data movement - in any case this will be in a new PR and not in this one. This PR is just a first in a list of PRs that will improve current upmap balancer and will add primary balancing. (I thought we can make a single PR but I was wrong) Regarding EC, primary balancer has limited value in EC pools (almost no value when working in fast mode) and currently we plan on implementing primary balancer (actually workload balancer is a better name) for replicated pools only. In the future we may add support for EC pools without the fast mode, but I am not sure it is worth the effort (since the value is limited in most cases). In any case I believe it requires a separate implementation (a different algorithm) than for replicated pools.

Thanks for your reply. I have two questions. As I'm new to ceph these questions may be emm... a little bit ridiculous... What is the fast mode? Is it a feature ceph currently support? I didn't find a doc or blog describe it. Is it a cache layer? Why workload balancer has limited value in EC pools? Due to the algorithm design or something else? With a large cluster that may set k+m a big number (probably > 10 or 20), there are only several primary pgs on each osd. In such a case the number of primary pg is significantly unbalanced. Maybe the bottleneck is not disk but network and cpu (calc during recover).

For fast mode search fast_read in https://docs.ceph.com/en/latest/rados/operations/pools/
In EC the read request is sent to multiple OSDs not only to the primary (and in fast_mode to all) so it is not that important how the primaries are distributed, but how the Ks in all the PGs are distributed. Therefore the implementation of the workload balancer is different. Additionally in EC you need to move data because balancing means almost by definition changing roles between the K and the N (unlike replica case) so the cost of balancing is much higher. I am not claiming that EC does not need workload balancing, I say it is a different type of balancer, the cost is higher and I think that in many cases the problem is smaller since the number of the OSDS to be balanced (K * #PGs) is higher and the statistical distribution is better. (experiments show that unbalanced primaries impacts the system less when you have more PGs in the pool)

wangnengjie · 2022-01-12T13:14:16Z

Hello, I wonder in what way do you design the primary balance feature without data movement? It seems that there are no relative code in commits. I think it's almost impossible to achieve this through upmap feature (especially for ec pool). So I'm really curious about your idea.

I will let @ljflores explain how we change primaries without data movement - in any case this will be in a new PR and not in this one. This PR is just a first in a list of PRs that will improve current upmap balancer and will add primary balancing. (I thought we can make a single PR but I was wrong) Regarding EC, primary balancer has limited value in EC pools (almost no value when working in fast mode) and currently we plan on implementing primary balancer (actually workload balancer is a better name) for replicated pools only. In the future we may add support for EC pools without the fast mode, but I am not sure it is worth the effort (since the value is limited in most cases). In any case I believe it requires a separate implementation (a different algorithm) than for replicated pools.

Thanks for your reply. I have two questions. As I'm new to ceph these questions may be emm... a little bit ridiculous... What is the fast mode? Is it a feature ceph currently support? I didn't find a doc or blog describe it. Is it a cache layer? Why workload balancer has limited value in EC pools? Due to the algorithm design or something else? With a large cluster that may set k+m a big number (probably > 10 or 20), there are only several primary pgs on each osd. In such a case the number of primary pg is significantly unbalanced. Maybe the bottleneck is not disk but network and cpu (calc during recover).

For fast mode search fast_read in https://docs.ceph.com/en/latest/rados/operations/pools/ In EC the read request is sent to multiple OSDs not only to the primary (and in fast_mode to all) so it is not that important how the primaries are distributed, but how the Ks in all the PGs are distributed. Therefore the implementation of the workload balancer is different. Additionally in EC you need to move data because balancing means almost by definition changing roles between the K and the N (unlike replica case) so the cost of balancing is much higher. I am not claiming that EC does not need workload balancing, I say it is a different type of balancer, the cost is higher and I think that in many cases the problem is smaller since the number of the OSDS to be balanced (K * #PGs) is higher and the statistical distribution is better. (experiments show that unbalanced primaries impacts the system less when you have more PGs in the pool)

Okay, I get it! Thanks a lot!
Just to say that although sub read requests will be sent to multiple OSDs, the sender (also decoder) of these sub read requests is the primary OSD itself, am I right? The client's read request still only sends to primary if CEPH_OSD_FLAG_BALANCE_READS is not set.

JoshSalomon · 2022-01-12T16:04:36Z

Okay, I get it! Thanks a lot! Just to say that although sub read requests will be sent to multiple OSDs, the sender (also decoder) of these sub read requests is the primary OSD itself, am I right? The client's read request still only sends to primary if CEPH_OSD_FLAG_BALANCE_READS is not set.

You are right

neha-ojha · 2022-01-13T00:43:19Z

@JoshSalomon @ljflores Let's target the refactor commit for Quincy and hold off on merging
(create a separate PR) the developer documentation for primary balancing until the feature is implemented and ready for backport, to avoid any kind of confusion.

This is the first commit in a series of commits that aims at adding a primary balancer to Ceph and improving the current upmap balancer functionality. This first commit focuses on simplifying (refactoring) the code of `calc_pg_upmaps` so it is easier to change in the future. This PR keeps the existing functionality as-is and does not change anything but the code structure. As part of the work is major refactoring of OSDMap::calc_pg_upmaps, the first thing is adding an --upmap-seed param to osdmaptool so test results can be compared without the random factor. Other changes made: - Divided sections of `OSDMap::calc_pg_upmaps` into their own separate functions - Renamed tmp to tmp_osd_map - Changed all the occurances of 'first' and 'second' in the function to more meaningful names. Signed-off-by: Josh Salomon <josh.salomon@gmail.com>

ljflores · 2022-01-13T02:36:03Z

@JoshSalomon @neha-ojha

The commit in this PR now contains all of the refactor changes for Quincy (changes that were previously spread across commits from src/osd/OSDMap.h/cc and src/tools/osdmaptoolcc). #44565 contains the doc commit, which can be merged after the Quincy feature freeze this Friday.

ljflores · 2022-01-13T02:39:55Z

http://pulpito.front.sepia.ceph.com/yuriw-2022-01-12_00:42:29-rados-wip-yuri7-testing-2022-01-11-1457-distro-default-smithi

Unrelated failures tracked in

ljflores · 2022-01-13T02:43:08Z

@xiexingguo if you have a chance to review this, it would be great to have a +1.

xiexingguo

I like it. Thanks!

github-actions bot added the documentation label Nov 18, 2021

markhpc added the performance label Nov 18, 2021

JoshSalomon force-pushed the wip-primary-balancer branch from 556031a to 30b6fdb Compare November 18, 2021 15:42

github-actions bot added cephadm core crimson orchestrator pybind needs-rebase labels Nov 18, 2021

JoshSalomon force-pushed the wip-primary-balancer branch from cbf108d to e5f09f0 Compare November 30, 2021 14:25

github-actions bot removed the needs-rebase label Nov 30, 2021

sebastian-philipp removed pybind orchestrator cephadm labels Dec 2, 2021

ljflores removed the crimson label Dec 7, 2021

ljflores changed the title ~~Wip primary balancer~~ osd: primary balancer Dec 21, 2021

ljflores approved these changes Jan 4, 2022

View reviewed changes

ljflores reviewed Jan 4, 2022

View reviewed changes

src/osd/OSDMap.cc Show resolved Hide resolved

src/osd/OSDMap.cc Show resolved Hide resolved

JoshSalomon force-pushed the wip-primary-balancer branch from 4a4becc to c9dcf1c Compare January 11, 2022 15:21

ljflores added the needs-qa label Jan 11, 2022

yuriw added the wip-yuri7-testing label Jan 11, 2022

ljflores changed the title ~~osd: primary balancer~~ osd: refactor calc_pg_upmaps to allow for more simplicity Jan 12, 2022

yuriw removed the wip-yuri7-testing label Jan 12, 2022

ljflores mentioned this pull request Jan 13, 2022

doc/dev: explain the primary balancer scope and design #44565

Merged

14 tasks

ljflores force-pushed the wip-primary-balancer branch from c9dcf1c to 86d6d11 Compare January 13, 2022 02:30

ljflores changed the title ~~osd: refactor calc_pg_upmaps to allow for more simplicity~~ balancer: refactor calc_pg_upmaps to allow for more simplicity Jan 13, 2022

xiexingguo approved these changes Jan 13, 2022

View reviewed changes

ljflores merged commit 9ebf397 into ceph:master Jan 13, 2022

JoshSalomon deleted the wip-primary-balancer branch January 25, 2022 15:27

Conversation

JoshSalomon commented Nov 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

ljflores commented Nov 18, 2021

Uh oh!

ljflores commented Nov 18, 2021

Uh oh!

github-actions bot commented Nov 18, 2021

Uh oh!

JoshSalomon commented Nov 18, 2021

Uh oh!

ljflores commented Nov 18, 2021

Uh oh!

ljflores commented Nov 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ljflores commented Nov 30, 2021

Uh oh!

ljflores commented Dec 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ljflores commented Dec 6, 2021

Uh oh!

ljflores commented Dec 7, 2021

Uh oh!

JoshSalomon commented Dec 7, 2021

Uh oh!

JoshSalomon commented Dec 7, 2021

Uh oh!

ljflores commented Dec 7, 2021

Uh oh!

ljflores commented Dec 7, 2021

Uh oh!

ljflores commented Dec 7, 2021

Uh oh!

ljflores commented Dec 21, 2021

Uh oh!

ljflores left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ljflores commented Jan 4, 2022

Uh oh!

ljflores commented Jan 10, 2022

Uh oh!

wangnengjie commented Jan 11, 2022

Uh oh!

JoshSalomon commented Jan 11, 2022

Uh oh!

neha-ojha commented Jan 12, 2022

Uh oh!

wangnengjie commented Jan 12, 2022

Uh oh!

JoshSalomon commented Jan 12, 2022

Uh oh!

JoshSalomon commented Jan 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wangnengjie commented Jan 12, 2022

Uh oh!

JoshSalomon commented Jan 12, 2022

Uh oh!

neha-ojha commented Jan 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ljflores commented Jan 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ljflores commented Jan 13, 2022

Uh oh!

ljflores commented Jan 13, 2022

Uh oh!

xiexingguo left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

JoshSalomon commented Nov 18, 2021 •

edited

Loading

ljflores commented Nov 19, 2021 •

edited

Loading

ljflores commented Dec 6, 2021 •

edited

Loading

JoshSalomon commented Jan 12, 2022 •

edited

Loading

neha-ojha commented Jan 13, 2022 •

edited

Loading

ljflores commented Jan 13, 2022 •

edited

Loading