osd: Apply randomly selected scheduler type across all OSD shards #53524
osd: Apply randomly selected scheduler type across all OSD shards #53524
Conversation
302425f to
e0527b2
Compare
2cbf1ce to
c8c19d6
Compare
c8c19d6 to
c381c85
Compare
c381c85 to
d69c167
Compare
d69c167 to
9130e0b
Compare
|
@athanatos PTAL at the latest changes. I addressed most of the review comments. |
mClockPriorityQueue (mClockQueue class) is an older mClock implementation of the OpQueue abstraction. This was replaced by a simpler implementation of the OpScheduler abstraction as part of ceph#30650. The simpler implementation of mClockScheduler is being currently used. This commit removes the unused src/common/mClockPriorityQueue.h along with the associated unit test file: test_mclock_priority_queue.cc. Other miscellaneous changes, - Remove the cmake references to the unit test file - Remove the inclusion of the header file in mClockScheduler.h Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
…ards
Originally, the choice of 'debug_random' for osd_op_queue resulted in the
selection of a random scheduler type for each OSD shard. A more realistic
scenario for testing would be the selection of the random scheduler type
applied globally for all shards of an OSD. In other words, all OSD shards
would employ the same scheduler type. For e.g., this scenario would be
possible during upgrades when the scheduler type has changed between
releases.
The following changes are made as part of the commit:
1. Introduce enum class op_queue_type_t within osd_types.h that holds the
various op queue types supported. This header in included by OpQueue.h.
Add helper functions osd_types.cc to return the op_queue_type_t as
enum or a string representing the enum member.
2. Determine the scheduler type before initializing the OSD shards in
OSD class constructor.
3. Pass the determined op_queue_type_t to the OSDShard's make_scheduler()
method for each shard. This ensures all shards of the OSD are
initialized with the same scheduler type.
4. Rename & modify the unused OSDShard::get_scheduler_type() method to
return op_queue_type_t set for the queue.
5. Introduce OpScheduler::get_type() and OpQueue::get_type() pure
virtual functions and define them within the respective queue
implementation. This returns a value pertaining to the op queue type.
This is called by OSDShard::get_op_queue_type().
6. Add OSD::osd_op_queue_type() method for determining the scheduler
type set on the OSD shards. Since all OSD shards are set to use
the same scheduler type, the shard with the lowest id is used to
get the scheduler type using OSDShard::get_op_queue_type().
7. Improve comment description related to 'osd_op_queue' option in
common/options/osd.yaml.in.
Call Flow
--------
OSD OSDShard OpScheduler/OpQueue
--- -------- -------------------
osd_op_queue_type() ->
get_op_queue_type() ->
get_type()
Fixes: https://tracker.ceph.com/issues/62171
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
…system All OSD shards are guaranteed to use the same scheduler type. Therefore, OSD::osd_op_queue_type() is used where applicable to determine the scheduler type. This results in the appropriate setting of other config options based on the randomly selected scheduler type in case the global 'osd_op_queue' config option is set to 'debug_random' (for e.g., in CI tests). Note: If 'osd_op_queue' is set to 'debug_random', the PG specific code (PGPeering, PrimaryLogPG) would continue to use the existing mechanism of querying the config option key (osd_op_queue) as before using get_val(). Fixes: https://tracker.ceph.com/issues/62171 Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
|
Teuthology RADOS Suite Test Result Original Run: Re-Run Failed Jobs: Overall Result: Failures Unrelated:
I looked into one test failure - singleton/{all/ec-inconsistent-hinfo shown below: The reason for the above failure was due to a random "osd_op_queue_cut_off" value set for each OSD shard. Ideally, the cut off must be the same across all the OSD shards. After fixing this (see commit: osd: Apply randomly determined IO priority cutoff across all OSD shards and ensuring that all OSD shards are assigned the same cut off value, the above test passed as shown in the runs below (test run 10 times):
|
9130e0b to
be931cd
Compare
|
@athanatos PTAL at the latest commit (Title: osd: Apply randomly determined IO priority cutoff across all OSD shards) that determines and applies the osd_op_queue_cut_off (when set to 'debug_random') across OSD shards. I had overlooked considering this parameter in my earlier attempt. Except for the new commit mentioned above, all other changes remain the same. |
be931cd to
d2b5a33
Compare
Determine the op priority cutoff for an OSD and apply it on all the OSD shards, which is a more realistic scenario. Previously, the cut off value was randomized between OSD shards leading to issues in testing. The IO priority cut off is first determined before initializing the OSD shards. The cut off value is then passed to the OpScheduler implementations that are modified accordingly to apply the values during initialization. Fixes: https://tracker.ceph.com/issues/62171 Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
d2b5a33 to
bfbc6b6
Compare
|
jenkins test make check |
|
@sseshasa Has the new commit been tested? If so, feel free to merge with my review. |
@athanatos Yes, the following teuthology runs were with the new commit included:
Additionally, I will get this PR tested again as part of Yuri's main runs to ensure nothing's broken. |
With the osd_delete_sleep_ssd and osd_delete_sleep_hdd options disabled with mClock, it was noticed that PG deletion was completing much faster with mClock scheduler. In order to give mClock a more accurate cost of the PG Deletion operation, we calculate it by taking into consideration how many objects are being deleted. Signed-off-by: Aishwarya Mathuria <amathuri@redhat.com>
…orelist The changes introduced in PR: ceph#53524 made the randomized values of osd_op_queue and osd_op_queue_cut_off consistent across all OSD shards. Due to the above, ec-inconsistent-hinfo test could fail with the following cluster warning (benign) depending on the randomly selected scheduler type. "cluster [WRN] Error(s) ignored for 2:ad551702:::test:head enough copies available" In summary, the warning is generated due to the difference in the PG deletion rates between WPQ and mClock schedulers. Therefore, the warning shows up in cases where the mClock scheduler is the op queue scheduler chosen randomly for the test. The PG deletion rate with mClock scheduler is quicker compared to the WPQ scheduler since it doesn't use sleeps between each delete transaction and relies on the cost of the deletion which in turn is proportional to the average size of the objects in the PG. For a more detailed analysis, see the associated tracker. Fixes: https://tracker.ceph.com/issues/64573 Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
…orelist The changes introduced in PR: ceph#53524 made the randomized values of osd_op_queue and osd_op_queue_cut_off consistent across all OSD shards. Due to the above, ec-inconsistent-hinfo test could fail with the following cluster warning (benign) depending on the randomly selected scheduler type. "cluster [WRN] Error(s) ignored for 2:ad551702:::test:head enough copies available" In summary, the warning is generated due to the difference in the PG deletion rates between WPQ and mClock schedulers. Therefore, the warning shows up in cases where the mClock scheduler is the op queue scheduler chosen randomly for the test. The PG deletion rate with mClock scheduler is quicker compared to the WPQ scheduler since it doesn't use sleeps between each delete transaction and relies on the cost of the deletion which in turn is proportional to the average size of the objects in the PG. For a more detailed analysis, see the associated tracker. Fixes: https://tracker.ceph.com/issues/64573 Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com> (cherry picked from commit 9677ad6)
…orelist The changes introduced in PR: ceph#53524 made the randomized values of osd_op_queue and osd_op_queue_cut_off consistent across all OSD shards. Due to the above, ec-inconsistent-hinfo test could fail with the following cluster warning (benign) depending on the randomly selected scheduler type. "cluster [WRN] Error(s) ignored for 2:ad551702:::test:head enough copies available" In summary, the warning is generated due to the difference in the PG deletion rates between WPQ and mClock schedulers. Therefore, the warning shows up in cases where the mClock scheduler is the op queue scheduler chosen randomly for the test. The PG deletion rate with mClock scheduler is quicker compared to the WPQ scheduler since it doesn't use sleeps between each delete transaction and relies on the cost of the deletion which in turn is proportional to the average size of the objects in the PG. For a more detailed analysis, see the associated tracker. Fixes: https://tracker.ceph.com/issues/64573 Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
…orelist The changes introduced in PR: ceph#53524 made the randomized values of osd_op_queue and osd_op_queue_cut_off consistent across all OSD shards. Due to the above, ec-inconsistent-hinfo test could fail with the following cluster warning (benign) depending on the randomly selected scheduler type. "cluster [WRN] Error(s) ignored for 2:ad551702:::test:head enough copies available" In summary, the warning is generated due to the difference in the PG deletion rates between WPQ and mClock schedulers. Therefore, the warning shows up in cases where the mClock scheduler is the op queue scheduler chosen randomly for the test. The PG deletion rate with mClock scheduler is quicker compared to the WPQ scheduler since it doesn't use sleeps between each delete transaction and relies on the cost of the deletion which in turn is proportional to the average size of the objects in the PG. For a more detailed analysis, see the associated tracker. Fixes: https://tracker.ceph.com/issues/64573 Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com> (cherry picked from commit 9677ad6)
…orelist The changes introduced in PR: ceph#53524 made the randomized values of osd_op_queue and osd_op_queue_cut_off consistent across all OSD shards. Due to the above, ec-inconsistent-hinfo test could fail with the following cluster warning (benign) depending on the randomly selected scheduler type. "cluster [WRN] Error(s) ignored for 2:ad551702:::test:head enough copies available" In summary, the warning is generated due to the difference in the PG deletion rates between WPQ and mClock schedulers. Therefore, the warning shows up in cases where the mClock scheduler is the op queue scheduler chosen randomly for the test. The PG deletion rate with mClock scheduler is quicker compared to the WPQ scheduler since it doesn't use sleeps between each delete transaction and relies on the cost of the deletion which in turn is proportional to the average size of the objects in the PG. For a more detailed analysis, see the associated tracker. Fixes: https://tracker.ceph.com/issues/64573 Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com> (cherry picked from commit 9677ad6)
Originally, the choice of 'debug_random' for osd_op_queue and
osd_op_queue_cut_off resulted in the selection of a random scheduler
type an cut-off for each OSD shard. A more realistic scenario for testing
would be the selection of the random scheduler type and cut-off applied
globally for all shards of an OSD. In other words, all OSD shards would
employ the same scheduler type and cut-off. For e.g., this scenario would
be possible during upgrades when the scheduler type has changed
between releases.
The following changes are made as part of this change:
the OSD shards in OSD class constructor.
OSDShard's make_scheduler() method for each shard. This ensures all
shards of the OSD are initialized with the same scheduler type and the
op queue cut-off.
the scheduler type set for the queue.
virtual functions that is defined by the respective queue
implementation. This returns the string pertaining to the
scheduler type. This is called by OSDShard::get_scheduler_type().
type set on the OSD shards. Since all OSD shards are set to use
the same scheduler type, the shard with the lowest id is used to
get the scheduler type using OSDShard::get_scheduler_type().
Fixes: https://tracker.ceph.com/issues/62171
Signed-off-by: Sridhar Seshasayee sseshasa@redhat.com
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows