Skip to content

osd: Add mechanism to avoid running OSD bench on every OSD init when mclock_scheduler is enabled#42133

Merged
neha-ojha merged 10 commits intoceph:masterfrom
sseshasa:wip-persist-osd-iops-cap-mclock
Jul 30, 2021
Merged

osd: Add mechanism to avoid running OSD bench on every OSD init when mclock_scheduler is enabled#42133
neha-ojha merged 10 commits intoceph:masterfrom
sseshasa:wip-persist-osd-iops-cap-mclock

Conversation

@sseshasa
Copy link
Contributor

@sseshasa sseshasa commented Jul 1, 2021

The change-set implements the following to help avoid running OSD benchmark tests on every OSD init
with mclock_scheduler enabled:

  • Implement helper method to store config option key/value to the MON store.
  • Implement ConfigProxy and md_config_t methods to return the default value of a config option.
  • Use the above introduced methods to implement the logic to avoid running the benchmark everytime.
  • Introduce a new config option to help force-run the OSD benchmark when required.
  • A set of commits to address the standalone failures observed with mclock_scheduler. (Added July 12, 2021)

Fixes: https://tracker.ceph.com/issues/51464
Signed-off-by: Sridhar Seshasayee sseshasa@redhat.com

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@sseshasa sseshasa marked this pull request as ready for review July 1, 2021 11:00
@sseshasa sseshasa force-pushed the wip-persist-osd-iops-cap-mclock branch from 486d0d9 to 4da7f99 Compare July 1, 2021 11:31
@sseshasa sseshasa removed the needs-qa label Jul 1, 2021
@sseshasa sseshasa force-pushed the wip-persist-osd-iops-cap-mclock branch from 4da7f99 to 7ac4b8b Compare July 12, 2021 12:54
@sseshasa sseshasa force-pushed the wip-persist-osd-iops-cap-mclock branch from 7ac4b8b to 8f8266d Compare July 14, 2021 07:42
Copy link
Member

@neha-ojha neha-ojha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good, left some minor comments

In future, we can introduce a dev option to skip the benchmarking step completely for some teuthology tests or configurations (like filestore, which is not being considered for QoS).

Also, it would be good to document what a user (perhaps, only advanced) needs to do to skip the osd benchmark and override the osd_mclock_max_capacity_iops with a value they would like to use.

@sseshasa
Copy link
Contributor Author

overall looks good, left some minor comments

In future, we can introduce a dev option to skip the benchmarking step completely for some teuthology tests or configurations (like filestore, which is not being considered for QoS).

Also, it would be good to document what a user (perhaps, only advanced) needs to do to skip the osd benchmark and override the osd_mclock_max_capacity_iops with a value they would like to use.

Yes, I will make a note of this and implement it in a follow-up PR. Thanks!

@sseshasa sseshasa force-pushed the wip-persist-osd-iops-cap-mclock branch 2 times, most recently from 6b55bba to b9e7c8a Compare July 16, 2021 10:50
@sseshasa
Copy link
Contributor Author

Rados Suite Results From Teuthology:
https://pulpito.ceph.com/sseshasa-2021-07-14_10:37:09-rados-wip-sseshasa-testing-2021-07-14-1320-distro-basic-smithi/

Re-Run failed and dead jobs after latest push (July 16, 2021):
https://pulpito.ceph.com/sseshasa-2021-07-16_13:48:47-rados-wip-sseshasa-testing-2021-07-16-1632-distro-basic-smithi/
NOTE: Teuthology didn't schedule a small subset of the failed jobs.

Log analysis summary of the failures from first run (sseshasa-2021-07-14_10:37:09-rados-wip-sseshasa-testing-2021-07-14-1320-distro-basic-smithi):
Failed Jobs:

  1. JobID:6269914 - One OSD didn't come up after attempting to bring it up with valgrind options. Ran the test twice again and both attempts passed shown below
  1. JobID:6269984 - Cephadm related. Raised https://tracker.ceph.com/issues/51713

  2. JobID:6269996 - Mon Thrash Test - Failed due to quorum loss. Passed in the re-run above (See JobID: 6275978)

  3. JobID:6270005 - Assertion failure on osd.3 similar to https://tracker.ceph.com/issues/45702. Didn't get scheduled by teuthology in the re-run.
    2021-07-14T11:45:06.365 INFO:tasks.ceph.osd.3.smithi039.stderr:/build/ceph-17.0.0-6002-gf81345ca/src/osd/PGLog.h: 1553: FAILED ceph_assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter->second.have == eversion_t()))

  4. JobID:6270035 - Similar to https://tracker.ceph.com/issues/47025. Failed again in the re-run.
    (rados/test.sh: api_watch_notify_pp LibRadosWatchNotifyECPP.WatchNotify failed)

  5. JobID:6270096 - Failed due to SEIinux denials issue. Didn't get scheduled by teuthology in the re-run.

  6. JobID:6270109 - Failed due to "No module named 'tasks' ". Didn't get scheduled by teuthology in the re-run.

  7. JobID:6270189 - Cephadm reated - "qa/workunits/cephadm/test_dashboard_e2e.sh'". Didn't get scheduled by teuthology in the re-run.

  8. JobID:6270208 - Similar to https://tracker.ceph.com/issues/45423. Didn't get scheduled by teuthology in the re-run.
    (api_tier_pp: [ FAILED ] LibRadosTwoPoolsPP.HitSetWrite)

  9. JobID:6270238 - standalone/scrub test failures - similar to https://tracker.ceph.com/issues/49961
    Passed in re-run - See https://pulpito.ceph.com/sseshasa-2021-07-16_14:10:48-rados:standalone-wip-sseshasa-testing-2021-07-16-1632-distro-basic-smithi/

All other failures related to "No module named 'tasks.ceph' " passed in the re-run.

Dead Job:

  1. JobID:6270179 - Assertion failure in /src/mon/LogMonitor.cc:457: void LogMonitor::log_external_backlog(): Assertion `external_log_to <= get_last_committed()' failed. Didn't get scheduled by teuthology in the re-run.
    (Known issue. Tracker unknown. Understand that a PR is out to fix this.)

Add method mon_cmd_set_config() to save config option key and
value to the MON store. The ConfigMonitor command, 'config set' is
used to achieve this.

A corresponding get method is unnecessary since any config option
found on the MON store is loaded during OSD boot-up and set using
the md_config_t::set_mon_vals() method. Therefore, the existing
versions of ConfigProxy::get_val() method are sufficient to get
the latest value for the config option.

Fixes: https://tracker.ceph.com/issues/51464
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
sseshasa added 9 commits July 30, 2021 18:16
…tion

Add wrapper method "get_val_default()" to the ConfigProxy class that takes
the config option key to search. This method in-turn calls another method
with the same name added to md_config_t class that does the actual work of
searching for the config option. If the option is valid, _get_val_default()
is used to get the default value. Otherwise, the wrapper method returns
std::nullopt.

Fixes: https://tracker.ceph.com/issues/51464
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
Use "mon_cmd_set_config()" to store the OSD's max iops capacity to
the MON store during the first bring-up. Don't run the OSD benchmark
test on subsequent boot-ups if a previously persisted iops capacity is
available on the MON store and is different from the default iops
capacity.

Add the 'force_run_benchmark' flag to force a run of the benchmark
in case the default iops capacity cannot be determined.

Fixes: https://tracker.ceph.com/issues/51464
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
The new config option "osd_mclock_force_run_benchmark_on_init" is
introduced to allow a user to force run the OSD benchmark test on every
OSD boot-up even if the historical data about the OSD's iops capacity is
available on the MON config store. The 'force_run_benchmark' flag is set
to the value indicated by the new config option.

By default this new config option is set to false.

The utility of this option is to help refresh the OSD iops capacity
when the underlying device's performance characteristics have changed
significantly. In such cases, the OSD can be restarted with this option
enabled temporarily. Once the new iops capacity is updated to the MON
store, this option can be removed from the OSD's start-up config.

Fixes: https://tracker.ceph.com/issues/51464
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
List of changes:

1. Remove the enforcement to use osd_op_queue=wpq when an osd is brought
   up in the following functions:
   - run_osd()
   - run_osd_filestore() and
   - activate_osd()

2. New functions:
   - get_op_scheduler() - Get the current osd_op_queue for an osd.

3. Modified test cases:
   - test_run_osd() - Add check for osd_max_backfill count.
     The mclock scheduler overrides the count to 1000.

4. New test cases:
   - test_activate_osd_after_mark_down()
   - test_get_op_scheduler()

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
Modified test cases:
1. osd-recovery-prio.sh:
   Set osd_op_queue = wpq for all tests since mclock
   doesn't consider recovery priority as part of its
   scheduling algorithm.

2. osd-recovery-stats.sh:
   a. TEST_recovery_undersized():
     - Set osd_mclock_profile to high_recovery_ops profile.
     - Increase wait for recovery timeout to 300 secs.

3. osd-rep-recov-eio.sh:
   a. TEST_rep_backfill_unfound():
     - Set osd_mclock_profile to high_recovery_ops profile.
     - Increase wait for backfill_unfound to 360 secs.

4. repeer-on-acting-back.sh:
   a. TEST_repeer_on_down_act():
     - Set osd_mclock_profile to high_recovery_ops profile.
       (To improve the test duration)

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
Modified test cases:

1. osd-backfill-prio.sh:
  Set osd_op_queue = wpq for all tests since the mclock doesn't
  consider recovery priority as part of its scheduling algorithm.

2. osd-backfill-space.sh:
  Set osd_mclock_profile to high_recovery_ops and increase the wait
  for backfills timeout to 1200 secs for the following tests:
  - TEST_backfill_test_simple()
  - TEST_backfill_test_multi()
  - TEST_backfill_test_sametarget()
  - TEST_backfill_multi_partial()
  - TEST_ec_backfill_simple()
  - TEST_ec_backfill_multi()
  - SKIP_TEST_ec_backfill_multi_partial()
  - SKIP_TEST_ec_backfill_multi_partial()

3. osd-backfill-stats:
  - TEST_backfill_ec_down_all_out():
   Set osd_mclock_profile to high_recovery_ops and increase the wait
   for recovery timeout to 240 secs.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
…duler

Modified test cases:

1. test-erasure-eio.sh:
  a. Test_ec_backfill_unfound():
    - Set osd_mclock_profile to high_recovery_ops profile.
    - Increase the wait for backfill_unfound timeout to 240 secs.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
…uler

The following tests in the test files mentioned below use the
"osd_scrub_sleep" option to introduce delays during scrubbing to help
determine scrubbing states, validate reservations during scrubbing etc..
This works when using the "wpq" scheduler.

But when the "mclock_scheduler" is enabled, the "osd_scrub_sleep" is
disabled and overridden to 0. This is done to delegate the scheduling of
the background scrubs to the "mclock_scheduler" based on the set QoS
parameters. Due to this, the checks to verify the scrub states,
reservations etc. fail since the window to check them is very short
due to scrubs completing very quickly. This affects a small subset of
scrub tests mentioned below,

1. osd-scrub-dump.sh -> TEST_recover_unexpected()
2. osd-scrub-repair.sh -> TEST_auto_repair_bluestore_tag()
3. osd-scrub-test.sh -> TEST_scrub_abort(), TEST_deep_scrub_abort()

Only for the above tests, until there's a reliable way to query scrub
states with "--osd-scrub-sleep" set to 0, the "osd_op_queue" config
option is set to "wpq".

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
…timeout

Modified test cases:

1. ver-health.sh:
  a. TEST_check_version_health_1():
    To avoid intermittent timeouts observed in wait_for_health_string(),
    increase the wait time to 20 secs.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
@sseshasa sseshasa force-pushed the wip-persist-osd-iops-cap-mclock branch from b9e7c8a to 464e9ea Compare July 30, 2021 12:46
@sseshasa
Copy link
Contributor Author

jenkins test make check

@sseshasa
Copy link
Contributor Author

jenkins test api

@sseshasa
Copy link
Contributor Author

jenkins test make check arm64

@sseshasa
Copy link
Contributor Author

jenkins test make check

@sseshasa
Copy link
Contributor Author

jenkins test make check arm64

@neha-ojha neha-ojha merged commit bd309c2 into ceph:master Jul 30, 2021
@sseshasa sseshasa deleted the wip-persist-osd-iops-cap-mclock branch August 2, 2021 05:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants