osd: Apply randomly selected scheduler type across all OSD shards by sseshasa · Pull Request #53524 · ceph/ceph

sseshasa · 2023-09-19T09:21:19Z

Originally, the choice of 'debug_random' for osd_op_queue and
osd_op_queue_cut_off resulted in the selection of a random scheduler
type an cut-off for each OSD shard. A more realistic scenario for testing
would be the selection of the random scheduler type and cut-off applied
globally for all shards of an OSD. In other words, all OSD shards would
employ the same scheduler type and cut-off. For e.g., this scenario would
be possible during upgrades when the scheduler type has changed
between releases.

The following changes are made as part of this change:

Determine the scheduler type and the op queue cut-off before initializing
the OSD shards in OSD class constructor.
Pass the scheduler type string and the op queue cut-off value to the
OSDShard's make_scheduler() method for each shard. This ensures all
shards of the OSD are initialized with the same scheduler type and the
op queue cut-off.
Modify the unused OSDShard::get_scheduler_type() method to return
the scheduler type set for the queue.
Introduce OpScheduler::get_type() and OpQueue::get_type() pure
virtual functions that is defined by the respective queue
implementation. This returns the string pertaining to the
scheduler type. This is called by OSDShard::get_scheduler_type().
Add OSD::osd_scheduler_type() method for determining the scheduler
type set on the OSD shards. Since all OSD shards are set to use
the same scheduler type, the shard with the lowest id is used to
get the scheduler type using OSDShard::get_scheduler_type().

Fixes: https://tracker.ceph.com/issues/62171
Signed-off-by: Sridhar Seshasayee sseshasa@redhat.com

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

src/osd/OSD.cc

src/common/OpQueue.h

src/osd/OSD.cc

src/osd/osd_types.cc

src/common/OpQueue.h

src/osd/OSD.cc

src/osd/scheduler/OpScheduler.cc

src/osd/osd_types.h

src/osd/scheduler/OpScheduler.cc

sseshasa · 2023-10-17T14:23:40Z

@athanatos PTAL at the latest changes. I addressed most of the review comments.

mClockPriorityQueue (mClockQueue class) is an older mClock implementation of the OpQueue abstraction. This was replaced by a simpler implementation of the OpScheduler abstraction as part of ceph#30650. The simpler implementation of mClockScheduler is being currently used. This commit removes the unused src/common/mClockPriorityQueue.h along with the associated unit test file: test_mclock_priority_queue.cc. Other miscellaneous changes, - Remove the cmake references to the unit test file - Remove the inclusion of the header file in mClockScheduler.h Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>

…ards Originally, the choice of 'debug_random' for osd_op_queue resulted in the selection of a random scheduler type for each OSD shard. A more realistic scenario for testing would be the selection of the random scheduler type applied globally for all shards of an OSD. In other words, all OSD shards would employ the same scheduler type. For e.g., this scenario would be possible during upgrades when the scheduler type has changed between releases. The following changes are made as part of the commit: 1. Introduce enum class op_queue_type_t within osd_types.h that holds the various op queue types supported. This header in included by OpQueue.h. Add helper functions osd_types.cc to return the op_queue_type_t as enum or a string representing the enum member. 2. Determine the scheduler type before initializing the OSD shards in OSD class constructor. 3. Pass the determined op_queue_type_t to the OSDShard's make_scheduler() method for each shard. This ensures all shards of the OSD are initialized with the same scheduler type. 4. Rename & modify the unused OSDShard::get_scheduler_type() method to return op_queue_type_t set for the queue. 5. Introduce OpScheduler::get_type() and OpQueue::get_type() pure virtual functions and define them within the respective queue implementation. This returns a value pertaining to the op queue type. This is called by OSDShard::get_op_queue_type(). 6. Add OSD::osd_op_queue_type() method for determining the scheduler type set on the OSD shards. Since all OSD shards are set to use the same scheduler type, the shard with the lowest id is used to get the scheduler type using OSDShard::get_op_queue_type(). 7. Improve comment description related to 'osd_op_queue' option in common/options/osd.yaml.in. Call Flow -------- OSD OSDShard OpScheduler/OpQueue --- -------- ------------------- osd_op_queue_type() -> get_op_queue_type() -> get_type() Fixes: https://tracker.ceph.com/issues/62171 Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>

…system All OSD shards are guaranteed to use the same scheduler type. Therefore, OSD::osd_op_queue_type() is used where applicable to determine the scheduler type. This results in the appropriate setting of other config options based on the randomly selected scheduler type in case the global 'osd_op_queue' config option is set to 'debug_random' (for e.g., in CI tests). Note: If 'osd_op_queue' is set to 'debug_random', the PG specific code (PGPeering, PrimaryLogPG) would continue to use the existing mechanism of querying the config option key (osd_op_queue) as before using get_val(). Fixes: https://tracker.ceph.com/issues/62171 Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>

sseshasa · 2023-11-14T03:55:50Z

Teuthology RADOS Suite Test Result

Original Run:
https://pulpito.ceph.com/sseshasa-2023-11-03_06:54:32-rados-wip-sseshasa-testing-2023-11-02-1957-distro-default-smithi/

Re-Run Failed Jobs:
https://pulpito.ceph.com/sseshasa-2023-11-08_05:53:03-rados-wip-sseshasa-testing-2023-11-02-1957-distro-default-smithi/

Overall Result:
Passed: 282/300 Failed: 16 Dead: 2

Failures Unrelated:

https://tracker.ceph.com/issues/62449 - test/cls_2pc_queue: TestCls2PCQueue.MultiProducer and TestCls2PCQueue.AsyncConsumer failure.
https://tracker.ceph.com/issues/61774 - centos 9 testing reveals rocksdb "Leak_StillReachable" memory leak in mons
https://tracker.ceph.com/issues/61940 - "test_cephfs_mirror" fails from stray cephadm daemon
https://tracker.ceph.com/issues/59142 - mgr/dashboard: fix e2e for dashboard v3
https://tracker.ceph.com/issues/59196 - ceph_test_lazy_omap_stats segfault while waiting for active+clean
https://tracker.ceph.com/issues/62776 - rados: cluster [WRN] overall HEALTH_WARN - do not have an application enabled
https://tracker.ceph.com/issues/63192 - Fix the POOL_APP_NOT_ENABLED warning to only be generated after the new pool has no application for some amount of time

I looked into one test failure - singleton/{all/ec-inconsistent-hinfo shown below:
https://pulpito.ceph.com/sseshasa-2023-11-08_05:53:03-rados-wip-sseshasa-testing-2023-11-02-1957-distro-default-smithi/7451781

The reason for the above failure was due to a random "osd_op_queue_cut_off" value set for each OSD shard. Ideally, the cut off must be the same across all the OSD shards. After fixing this (see commit: osd: Apply randomly determined IO priority cutoff across all OSD shards and ensuring that all OSD shards are assigned the same cut off value, the above test passed as shown in the runs below (test run 10 times):

sseshasa · 2023-11-14T04:49:03Z

@athanatos PTAL at the latest commit (Title: osd: Apply randomly determined IO priority cutoff across all OSD shards) that determines and applies the osd_op_queue_cut_off (when set to 'debug_random') across OSD shards. I had overlooked considering this parameter in my earlier attempt. Except for the new commit mentioned above, all other changes remain the same.

Determine the op priority cutoff for an OSD and apply it on all the OSD shards, which is a more realistic scenario. Previously, the cut off value was randomized between OSD shards leading to issues in testing. The IO priority cut off is first determined before initializing the OSD shards. The cut off value is then passed to the OpScheduler implementations that are modified accordingly to apply the values during initialization. Fixes: https://tracker.ceph.com/issues/62171 Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>

sseshasa · 2023-11-14T06:43:37Z

jenkins test make check

athanatos · 2023-11-15T07:01:59Z

@sseshasa Has the new commit been tested? If so, feel free to merge with my review.

sseshasa · 2023-11-15T10:04:21Z

@sseshasa Has the new commit been tested? If so, feel free to merge with my review.

@athanatos Yes, the following teuthology runs were with the new commit included:

Additionally, I will get this PR tested again as part of Yuri's main runs to ensure nothing's broken.

sseshasa · 2023-11-15T10:15:03Z

@yuriw @ljflores Can you please include this in one of the next main runs? Thanks!

ljflores · 2023-12-19T19:01:34Z

Rados approved: https://tracker.ceph.com/projects/rados/wiki/MAIN#httpstrellocomck9RNURve1897-wip-yuri3-testing-2023-12-07-0727-old-wip-yuri3-testing-wip-neorados-learning-from-experience

With the osd_delete_sleep_ssd and osd_delete_sleep_hdd options disabled with mClock, it was noticed that PG deletion was completing much faster with mClock scheduler. In order to give mClock a more accurate cost of the PG Deletion operation, we calculate it by taking into consideration how many objects are being deleted. Signed-off-by: Aishwarya Mathuria <amathuri@redhat.com>

…orelist The changes introduced in PR: ceph#53524 made the randomized values of osd_op_queue and osd_op_queue_cut_off consistent across all OSD shards. Due to the above, ec-inconsistent-hinfo test could fail with the following cluster warning (benign) depending on the randomly selected scheduler type. "cluster [WRN] Error(s) ignored for 2:ad551702:::test:head enough copies available" In summary, the warning is generated due to the difference in the PG deletion rates between WPQ and mClock schedulers. Therefore, the warning shows up in cases where the mClock scheduler is the op queue scheduler chosen randomly for the test. The PG deletion rate with mClock scheduler is quicker compared to the WPQ scheduler since it doesn't use sleeps between each delete transaction and relies on the cost of the deletion which in turn is proportional to the average size of the objects in the PG. For a more detailed analysis, see the associated tracker. Fixes: https://tracker.ceph.com/issues/64573 Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>

…orelist The changes introduced in PR: ceph#53524 made the randomized values of osd_op_queue and osd_op_queue_cut_off consistent across all OSD shards. Due to the above, ec-inconsistent-hinfo test could fail with the following cluster warning (benign) depending on the randomly selected scheduler type. "cluster [WRN] Error(s) ignored for 2:ad551702:::test:head enough copies available" In summary, the warning is generated due to the difference in the PG deletion rates between WPQ and mClock schedulers. Therefore, the warning shows up in cases where the mClock scheduler is the op queue scheduler chosen randomly for the test. The PG deletion rate with mClock scheduler is quicker compared to the WPQ scheduler since it doesn't use sleeps between each delete transaction and relies on the cost of the deletion which in turn is proportional to the average size of the objects in the PG. For a more detailed analysis, see the associated tracker. Fixes: https://tracker.ceph.com/issues/64573 Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com> (cherry picked from commit 9677ad6)

…orelist The changes introduced in PR: ceph#53524 made the randomized values of osd_op_queue and osd_op_queue_cut_off consistent across all OSD shards. Due to the above, ec-inconsistent-hinfo test could fail with the following cluster warning (benign) depending on the randomly selected scheduler type. "cluster [WRN] Error(s) ignored for 2:ad551702:::test:head enough copies available" In summary, the warning is generated due to the difference in the PG deletion rates between WPQ and mClock schedulers. Therefore, the warning shows up in cases where the mClock scheduler is the op queue scheduler chosen randomly for the test. The PG deletion rate with mClock scheduler is quicker compared to the WPQ scheduler since it doesn't use sleeps between each delete transaction and relies on the cost of the deletion which in turn is proportional to the average size of the objects in the PG. For a more detailed analysis, see the associated tracker. Fixes: https://tracker.ceph.com/issues/64573 Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>

…orelist The changes introduced in PR: ceph#53524 made the randomized values of osd_op_queue and osd_op_queue_cut_off consistent across all OSD shards. Due to the above, ec-inconsistent-hinfo test could fail with the following cluster warning (benign) depending on the randomly selected scheduler type. "cluster [WRN] Error(s) ignored for 2:ad551702:::test:head enough copies available" In summary, the warning is generated due to the difference in the PG deletion rates between WPQ and mClock schedulers. Therefore, the warning shows up in cases where the mClock scheduler is the op queue scheduler chosen randomly for the test. The PG deletion rate with mClock scheduler is quicker compared to the WPQ scheduler since it doesn't use sleeps between each delete transaction and relies on the cost of the deletion which in turn is proportional to the average size of the objects in the PG. For a more detailed analysis, see the associated tracker. Fixes: https://tracker.ceph.com/issues/64573 Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com> (cherry picked from commit 9677ad6)

github-actions bot added common core crimson labels Sep 19, 2023

sseshasa added the wip-sseshasa-testing label Sep 19, 2023

sseshasa force-pushed the wip-fix-osd-op-queue-rndmize-62171 branch from 302425f to e0527b2 Compare September 21, 2023 09:36

sseshasa changed the title ~~(Draft) osd, crimson/osd: Remove 'debug_random' option for osd_op_queue~~ (Draft) osd: Apply randomly selected scheduler type across all OSD shards Sep 21, 2023

sseshasa removed the crimson label Sep 21, 2023

sseshasa changed the title ~~(Draft) osd: Apply randomly selected scheduler type across all OSD shards~~ osd: Apply randomly selected scheduler type across all OSD shards Sep 21, 2023

sseshasa removed the common label Sep 21, 2023

sseshasa marked this pull request as ready for review September 21, 2023 13:14

sseshasa requested a review from a team as a code owner September 21, 2023 13:14

sseshasa requested review from athanatos and rzarzynski September 21, 2023 13:15

athanatos reviewed Sep 21, 2023

View reviewed changes

src/osd/OSD.cc Outdated Show resolved Hide resolved

rzarzynski reviewed Sep 21, 2023

View reviewed changes

src/osd/OSD.cc Outdated Show resolved Hide resolved

sseshasa force-pushed the wip-fix-osd-op-queue-rndmize-62171 branch 2 times, most recently from 2cbf1ce to c8c19d6 Compare September 25, 2023 15:34

github-actions bot added the common label Sep 25, 2023

sseshasa force-pushed the wip-fix-osd-op-queue-rndmize-62171 branch from c8c19d6 to c381c85 Compare September 25, 2023 17:51

athanatos requested changes Sep 25, 2023

View reviewed changes

src/common/OpQueue.h Outdated Show resolved Hide resolved

src/osd/OSD.cc Outdated Show resolved Hide resolved

src/osd/OSD.cc Outdated Show resolved Hide resolved

sseshasa removed the wip-sseshasa-testing label Sep 26, 2023

sseshasa force-pushed the wip-fix-osd-op-queue-rndmize-62171 branch from c381c85 to d69c167 Compare October 6, 2023 16:38

athanatos requested changes Oct 6, 2023

View reviewed changes

sseshasa force-pushed the wip-fix-osd-op-queue-rndmize-62171 branch from d69c167 to 9130e0b Compare October 10, 2023 05:54

github-actions bot added build/ops tests labels Oct 10, 2023

athanatos self-requested a review October 25, 2023 04:56

athanatos approved these changes Oct 25, 2023

View reviewed changes

sseshasa added the needs-qa label Oct 26, 2023

sseshasa added the wip-sseshasa-testing label Nov 2, 2023

sseshasa added 3 commits November 14, 2023 00:25

sseshasa force-pushed the wip-fix-osd-op-queue-rndmize-62171 branch from 9130e0b to be931cd Compare November 14, 2023 04:25

sseshasa force-pushed the wip-fix-osd-op-queue-rndmize-62171 branch from be931cd to d2b5a33 Compare November 14, 2023 04:51

sseshasa force-pushed the wip-fix-osd-op-queue-rndmize-62171 branch from d2b5a33 to bfbc6b6 Compare November 14, 2023 05:59

athanatos approved these changes Nov 15, 2023

View reviewed changes

sseshasa removed the wip-sseshasa-testing label Nov 15, 2023

yuriw added the wip-yuri7-testing label Nov 17, 2023

yuriw merged commit 33f2b8c into ceph:main Dec 19, 2023

This was referenced Dec 21, 2023

osd: Change PG Deletion cost for mClock #54872

Merged

quincy: osd: Apply randomly selected scheduler type across all OSD shards #54980

Closed

reef: osd: Apply randomly selected scheduler type across all OSD shards #54981

Merged

sseshasa mentioned this pull request Feb 26, 2024

qa: Add benign cluster warning from ec-inconsistent-hinfo test to ignorelist #55764

Merged

14 tasks

Conversation

sseshasa commented Sep 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sseshasa commented Oct 17, 2023

Uh oh!

sseshasa commented Nov 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sseshasa commented Nov 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sseshasa commented Nov 14, 2023

Uh oh!

athanatos commented Nov 15, 2023

Uh oh!

sseshasa commented Nov 15, 2023

Uh oh!

sseshasa commented Nov 15, 2023

Uh oh!

ljflores commented Dec 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sseshasa commented Sep 19, 2023 •

edited

Loading

sseshasa commented Nov 14, 2023 •

edited

Loading

sseshasa commented Nov 14, 2023 •

edited

Loading