Skip to content

osd: Apply randomly selected scheduler type across all OSD shards #53524

Merged
yuriw merged 4 commits intoceph:mainfrom
sseshasa:wip-fix-osd-op-queue-rndmize-62171
Dec 19, 2023
Merged

osd: Apply randomly selected scheduler type across all OSD shards #53524
yuriw merged 4 commits intoceph:mainfrom
sseshasa:wip-fix-osd-op-queue-rndmize-62171

Conversation

@sseshasa
Copy link
Contributor

@sseshasa sseshasa commented Sep 19, 2023

Originally, the choice of 'debug_random' for osd_op_queue and
osd_op_queue_cut_off resulted in the selection of a random scheduler
type an cut-off for each OSD shard. A more realistic scenario for testing
would be the selection of the random scheduler type and cut-off applied
globally for all shards of an OSD. In other words, all OSD shards would
employ the same scheduler type and cut-off. For e.g., this scenario would
be possible during upgrades when the scheduler type has changed
between releases.

The following changes are made as part of this change:

  1. Determine the scheduler type and the op queue cut-off before initializing
    the OSD shards in OSD class constructor.
  2. Pass the scheduler type string and the op queue cut-off value to the
    OSDShard's make_scheduler() method for each shard. This ensures all
    shards of the OSD are initialized with the same scheduler type and the
    op queue cut-off.
  3. Modify the unused OSDShard::get_scheduler_type() method to return
    the scheduler type set for the queue.
  4. Introduce OpScheduler::get_type() and OpQueue::get_type() pure
    virtual functions that is defined by the respective queue
    implementation. This returns the string pertaining to the
    scheduler type. This is called by OSDShard::get_scheduler_type().
  5. Add OSD::osd_scheduler_type() method for determining the scheduler
    type set on the OSD shards. Since all OSD shards are set to use
    the same scheduler type, the shard with the lowest id is used to
    get the scheduler type using OSDShard::get_scheduler_type().

Fixes: https://tracker.ceph.com/issues/62171
Signed-off-by: Sridhar Seshasayee sseshasa@redhat.com

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@sseshasa sseshasa force-pushed the wip-fix-osd-op-queue-rndmize-62171 branch from 302425f to e0527b2 Compare September 21, 2023 09:36
@sseshasa sseshasa changed the title (Draft) osd, crimson/osd: Remove 'debug_random' option for osd_op_queue (Draft) osd: Apply randomly selected scheduler type across all OSD shards Sep 21, 2023
@sseshasa sseshasa removed the crimson label Sep 21, 2023
@sseshasa sseshasa changed the title (Draft) osd: Apply randomly selected scheduler type across all OSD shards osd: Apply randomly selected scheduler type across all OSD shards Sep 21, 2023
@sseshasa sseshasa removed the common label Sep 21, 2023
@sseshasa sseshasa marked this pull request as ready for review September 21, 2023 13:14
@sseshasa sseshasa requested a review from a team as a code owner September 21, 2023 13:14
@sseshasa sseshasa force-pushed the wip-fix-osd-op-queue-rndmize-62171 branch 2 times, most recently from 2cbf1ce to c8c19d6 Compare September 25, 2023 15:34
@sseshasa sseshasa force-pushed the wip-fix-osd-op-queue-rndmize-62171 branch from c8c19d6 to c381c85 Compare September 25, 2023 17:51
@sseshasa sseshasa force-pushed the wip-fix-osd-op-queue-rndmize-62171 branch from c381c85 to d69c167 Compare October 6, 2023 16:38
@sseshasa sseshasa force-pushed the wip-fix-osd-op-queue-rndmize-62171 branch from d69c167 to 9130e0b Compare October 10, 2023 05:54
@sseshasa
Copy link
Contributor Author

@athanatos PTAL at the latest changes. I addressed most of the review comments.

@athanatos athanatos self-requested a review October 25, 2023 04:56
mClockPriorityQueue (mClockQueue class) is an older mClock implementation
of the OpQueue abstraction. This was replaced by a simpler implementation
of the OpScheduler abstraction as part of
ceph#30650.

The simpler implementation of mClockScheduler is being currently used.
This commit removes the unused src/common/mClockPriorityQueue.h along
with the associated unit test file: test_mclock_priority_queue.cc.

Other miscellaneous changes,
 - Remove the cmake references to the unit test file
 - Remove the inclusion of the header file in mClockScheduler.h

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
…ards

Originally, the choice of 'debug_random' for osd_op_queue resulted in the
selection of a random scheduler type for each OSD shard. A more realistic
scenario for testing would be the selection of the random scheduler type
applied globally for all shards of an OSD. In other words, all OSD shards
would employ the same scheduler type. For e.g., this scenario would be
possible during upgrades when the scheduler type has changed between
releases.

The following changes are made as part of the commit:
 1. Introduce enum class op_queue_type_t within osd_types.h that holds the
    various op queue types supported. This header in included by OpQueue.h.
    Add helper functions osd_types.cc to return the op_queue_type_t as
    enum or a string representing the enum member.
 2. Determine the scheduler type before initializing the OSD shards in
    OSD class constructor.
 3. Pass the determined op_queue_type_t to the OSDShard's make_scheduler()
    method for each shard. This ensures all shards of the OSD are
    initialized with the same scheduler type.
 4. Rename & modify the unused OSDShard::get_scheduler_type() method to
    return op_queue_type_t set for the queue.
 5. Introduce OpScheduler::get_type() and OpQueue::get_type() pure
    virtual functions and define them within the respective queue
    implementation. This returns a value pertaining to the op queue type.
    This is called by OSDShard::get_op_queue_type().
 6. Add OSD::osd_op_queue_type() method for determining the scheduler
    type set on the OSD shards. Since all OSD shards are set to use
    the same scheduler type, the shard with the lowest id is used to
    get the scheduler type using OSDShard::get_op_queue_type().
 7. Improve comment description related to 'osd_op_queue' option in
    common/options/osd.yaml.in.

Call Flow
--------
OSD                     OSDShard                 OpScheduler/OpQueue
---                     --------                 -------------------
osd_op_queue_type() ->
                        get_op_queue_type() ->
                                                 get_type()

Fixes: https://tracker.ceph.com/issues/62171
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
…system

All OSD shards are guaranteed to use the same scheduler type. Therefore,
OSD::osd_op_queue_type() is used where applicable to determine the
scheduler type. This results in the appropriate setting of other config
options based on the randomly selected scheduler type in case the global
'osd_op_queue' config option is set to 'debug_random' (for e.g., in CI
tests).

Note: If 'osd_op_queue' is set to 'debug_random', the PG specific code
(PGPeering, PrimaryLogPG) would continue to use the existing mechanism of
querying the config option key (osd_op_queue) as before using get_val().

Fixes: https://tracker.ceph.com/issues/62171
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
@sseshasa
Copy link
Contributor Author

sseshasa commented Nov 14, 2023

Teuthology RADOS Suite Test Result

Original Run:
https://pulpito.ceph.com/sseshasa-2023-11-03_06:54:32-rados-wip-sseshasa-testing-2023-11-02-1957-distro-default-smithi/

Re-Run Failed Jobs:
https://pulpito.ceph.com/sseshasa-2023-11-08_05:53:03-rados-wip-sseshasa-testing-2023-11-02-1957-distro-default-smithi/

Overall Result:
Passed: 282/300 Failed: 16 Dead: 2

Failures Unrelated:

  1. https://tracker.ceph.com/issues/62449 - test/cls_2pc_queue: TestCls2PCQueue.MultiProducer and TestCls2PCQueue.AsyncConsumer failure.
  2. https://tracker.ceph.com/issues/61774 - centos 9 testing reveals rocksdb "Leak_StillReachable" memory leak in mons
  3. https://tracker.ceph.com/issues/61940 - "test_cephfs_mirror" fails from stray cephadm daemon
  4. https://tracker.ceph.com/issues/59142 - mgr/dashboard: fix e2e for dashboard v3
  5. https://tracker.ceph.com/issues/59196 - ceph_test_lazy_omap_stats segfault while waiting for active+clean
  6. https://tracker.ceph.com/issues/62776 - rados: cluster [WRN] overall HEALTH_WARN - do not have an application enabled
  7. https://tracker.ceph.com/issues/63192 - Fix the POOL_APP_NOT_ENABLED warning to only be generated after the new pool has no application for some amount of time

I looked into one test failure - singleton/{all/ec-inconsistent-hinfo shown below:
https://pulpito.ceph.com/sseshasa-2023-11-08_05:53:03-rados-wip-sseshasa-testing-2023-11-02-1957-distro-default-smithi/7451781

The reason for the above failure was due to a random "osd_op_queue_cut_off" value set for each OSD shard. Ideally, the cut off must be the same across all the OSD shards. After fixing this (see commit: osd: Apply randomly determined IO priority cutoff across all OSD shards and ensuring that all OSD shards are assigned the same cut off value, the above test passed as shown in the runs below (test run 10 times):

  1. https://pulpito.ceph.com/sseshasa-2023-11-14_03:11:20-rados:singleton-wip-sseshasa-testing-2023-11-14-distro-default-smithi/
  2. https://pulpito.ceph.com/sseshasa-2023-11-14_08:58:45-rados:singleton-wip-sseshasa-testing-2023-11-14_1-distro-default-smithi/
  3. https://pulpito.ceph.com/sseshasa-2023-11-14_08:59:36-rados:singleton-wip-sseshasa-testing-2023-11-14_1-distro-default-smithi/

@sseshasa sseshasa force-pushed the wip-fix-osd-op-queue-rndmize-62171 branch from 9130e0b to be931cd Compare November 14, 2023 04:25
@sseshasa
Copy link
Contributor Author

sseshasa commented Nov 14, 2023

@athanatos PTAL at the latest commit (Title: osd: Apply randomly determined IO priority cutoff across all OSD shards) that determines and applies the osd_op_queue_cut_off (when set to 'debug_random') across OSD shards. I had overlooked considering this parameter in my earlier attempt. Except for the new commit mentioned above, all other changes remain the same.

@sseshasa sseshasa force-pushed the wip-fix-osd-op-queue-rndmize-62171 branch from be931cd to d2b5a33 Compare November 14, 2023 04:51
Determine the op priority cutoff for an OSD and apply it on all the OSD
shards, which is a more realistic scenario. Previously, the cut off value
was randomized between OSD shards leading to issues in testing. The IO
priority cut off is first determined before initializing the OSD shards.
The cut off value is then passed to the OpScheduler implementations that
are modified accordingly to apply the values during initialization.

Fixes: https://tracker.ceph.com/issues/62171
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
@sseshasa sseshasa force-pushed the wip-fix-osd-op-queue-rndmize-62171 branch from d2b5a33 to bfbc6b6 Compare November 14, 2023 05:59
@sseshasa
Copy link
Contributor Author

jenkins test make check

@athanatos
Copy link
Contributor

@sseshasa Has the new commit been tested? If so, feel free to merge with my review.

@sseshasa
Copy link
Contributor Author

@sseshasa Has the new commit been tested? If so, feel free to merge with my review.

@athanatos Yes, the following teuthology runs were with the new commit included:

  1. https://pulpito.ceph.com/sseshasa-2023-11-14_03:11:20-rados:singleton-wip-sseshasa-testing-2023-11-14-distro-default-smithi/
  2. https://pulpito.ceph.com/sseshasa-2023-11-14_08:58:45-rados:singleton-wip-sseshasa-testing-2023-11-14_1-distro-default-smithi/
  3. https://pulpito.ceph.com/sseshasa-2023-11-14_08:59:36-rados:singleton-wip-sseshasa-testing-2023-11-14_1-distro-default-smithi/

Additionally, I will get this PR tested again as part of Yuri's main runs to ensure nothing's broken.

@sseshasa
Copy link
Contributor Author

@yuriw @ljflores Can you please include this in one of the next main runs? Thanks!

@yuriw yuriw merged commit 33f2b8c into ceph:main Dec 19, 2023
sseshasa referenced this pull request Dec 21, 2023
With the osd_delete_sleep_ssd and osd_delete_sleep_hdd options disabled with mClock, it was noticed that PG deletion was completing much faster with mClock scheduler.
In order to give mClock a more accurate cost of the PG Deletion operation, we calculate it by taking into consideration how many objects are being deleted.

Signed-off-by: Aishwarya Mathuria <amathuri@redhat.com>
sseshasa added a commit to sseshasa/ceph that referenced this pull request Feb 26, 2024
…orelist

The changes introduced in PR: ceph#53524
made the randomized values of osd_op_queue and osd_op_queue_cut_off
consistent across all OSD shards.

Due to the above, ec-inconsistent-hinfo test could fail with the following
cluster warning (benign) depending on the randomly selected scheduler type.

"cluster [WRN] Error(s) ignored for 2:ad551702:::test:head enough copies
available"

In summary, the warning is generated due to the difference in the PG
deletion rates between WPQ and mClock schedulers. Therefore, the warning
shows up in cases where the mClock scheduler is the op queue scheduler
chosen randomly for the test. The PG deletion rate with mClock scheduler
is quicker compared to the WPQ scheduler since it doesn't use sleeps
between each delete transaction and relies on the cost of the deletion
which in turn is proportional to the average size of the objects in the PG.

For a more detailed analysis, see the associated tracker.

Fixes: https://tracker.ceph.com/issues/64573
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
sseshasa added a commit to sseshasa/ceph that referenced this pull request Mar 13, 2024
…orelist

The changes introduced in PR: ceph#53524
made the randomized values of osd_op_queue and osd_op_queue_cut_off
consistent across all OSD shards.

Due to the above, ec-inconsistent-hinfo test could fail with the following
cluster warning (benign) depending on the randomly selected scheduler type.

"cluster [WRN] Error(s) ignored for 2:ad551702:::test:head enough copies
available"

In summary, the warning is generated due to the difference in the PG
deletion rates between WPQ and mClock schedulers. Therefore, the warning
shows up in cases where the mClock scheduler is the op queue scheduler
chosen randomly for the test. The PG deletion rate with mClock scheduler
is quicker compared to the WPQ scheduler since it doesn't use sleeps
between each delete transaction and relies on the cost of the deletion
which in turn is proportional to the average size of the objects in the PG.

For a more detailed analysis, see the associated tracker.

Fixes: https://tracker.ceph.com/issues/64573
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit 9677ad6)
bill-scales pushed a commit to bill-scales/ceph that referenced this pull request Mar 13, 2024
…orelist

The changes introduced in PR: ceph#53524
made the randomized values of osd_op_queue and osd_op_queue_cut_off
consistent across all OSD shards.

Due to the above, ec-inconsistent-hinfo test could fail with the following
cluster warning (benign) depending on the randomly selected scheduler type.

"cluster [WRN] Error(s) ignored for 2:ad551702:::test:head enough copies
available"

In summary, the warning is generated due to the difference in the PG
deletion rates between WPQ and mClock schedulers. Therefore, the warning
shows up in cases where the mClock scheduler is the op queue scheduler
chosen randomly for the test. The PG deletion rate with mClock scheduler
is quicker compared to the WPQ scheduler since it doesn't use sleeps
between each delete transaction and relies on the cost of the deletion
which in turn is proportional to the average size of the objects in the PG.

For a more detailed analysis, see the associated tracker.

Fixes: https://tracker.ceph.com/issues/64573
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
sseshasa added a commit to sseshasa/ceph that referenced this pull request Mar 26, 2024
…orelist

The changes introduced in PR: ceph#53524
made the randomized values of osd_op_queue and osd_op_queue_cut_off
consistent across all OSD shards.

Due to the above, ec-inconsistent-hinfo test could fail with the following
cluster warning (benign) depending on the randomly selected scheduler type.

"cluster [WRN] Error(s) ignored for 2:ad551702:::test:head enough copies
available"

In summary, the warning is generated due to the difference in the PG
deletion rates between WPQ and mClock schedulers. Therefore, the warning
shows up in cases where the mClock scheduler is the op queue scheduler
chosen randomly for the test. The PG deletion rate with mClock scheduler
is quicker compared to the WPQ scheduler since it doesn't use sleeps
between each delete transaction and relies on the cost of the deletion
which in turn is proportional to the average size of the objects in the PG.

For a more detailed analysis, see the associated tracker.

Fixes: https://tracker.ceph.com/issues/64573
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit 9677ad6)
mkogan1 pushed a commit to mkogan1/ceph that referenced this pull request Aug 7, 2024
…orelist

The changes introduced in PR: ceph#53524
made the randomized values of osd_op_queue and osd_op_queue_cut_off
consistent across all OSD shards.

Due to the above, ec-inconsistent-hinfo test could fail with the following
cluster warning (benign) depending on the randomly selected scheduler type.

"cluster [WRN] Error(s) ignored for 2:ad551702:::test:head enough copies
available"

In summary, the warning is generated due to the difference in the PG
deletion rates between WPQ and mClock schedulers. Therefore, the warning
shows up in cases where the mClock scheduler is the op queue scheduler
chosen randomly for the test. The PG deletion rate with mClock scheduler
is quicker compared to the WPQ scheduler since it doesn't use sleeps
between each delete transaction and relies on the cost of the deletion
which in turn is proportional to the average size of the objects in the PG.

For a more detailed analysis, see the associated tracker.

Fixes: https://tracker.ceph.com/issues/64573
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit 9677ad6)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants