Skip to content

objecter: request OSDMap after idle ticks#63425

Merged
NitzanMordhai merged 2 commits intoceph:mainfrom
NitzanMordhai:wip-nitzan-osdmap-incremental-process-idle
Jun 23, 2025
Merged

objecter: request OSDMap after idle ticks#63425
NitzanMordhai merged 2 commits intoceph:mainfrom
NitzanMordhai:wip-nitzan-osdmap-incremental-process-idle

Conversation

@NitzanMordhai
Copy link
Contributor

@NitzanMordhai NitzanMordhai commented May 22, 2025

If the objecter goes two ticks without receiving an OSDMap, it may fall behind on incremental updates. Call maybe_request_osdmap() in the tick handler whenever we detect that two consecutive ticks have elapsed without a fresh map, so we minimize the time needed to catch up on missed changes.

Fixes: https://tracker.ceph.com/issues/71261

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds logic to re-request an OSDMap if the objecter goes two consecutive ticks without receiving a fresh map by tracking when the last map was requested.

  • Introduce last_osdmap_received timestamp in Objecter
  • Update timestamp in _maybe_request_map()
  • In tick(), check for stale map after two intervals and re-request if needed

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/osdc/Objecter.h Added last_osdmap_received member to track map timing
src/osdc/Objecter.cc Set timestamp on map request and stale-check in tick()
Comments suppressed due to low confidence (2)

src/osdc/Objecter.cc:2248

  • [nitpick] The name window is generic and may be unclear in context. Consider a more descriptive name like stale_window or request_threshold.
auto window = ceph::make_timespan(cct->_conf->objecter_tick_interval) * 2;

src/osdc/Objecter.cc:2245

  • The new stale-check path in tick() isn't covered by existing tests. Consider adding unit or integration tests to verify that an OSDMap is re-requested after two idle ticks.
} else if (last_osdmap_received != ceph::coarse_mono_clock::time_point()) {

@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-osdmap-incremental-process-idle branch from 95d9df6 to 025a984 Compare May 22, 2025 11:30
@cbodley
Copy link
Contributor

cbodley commented May 22, 2025

thanks @NitzanMordhai!

this bug was initially caught by an api test timeout, and we temporarily disabled the failing test case in #63126. can you please reenable it in this pr so we can verify that it reduces the request latency?

-@skip_unless_dashboard_pr
 class RgwSiteTest(RgwTestCase):

cc @nizamial09 @epuertat

double thresh_s = std::chrono::duration<double>(stale_window).count();
ldout(cct, 10) << __func__ << ": osdmap stale: " << elapsed_s
<< "s > " << thresh_s << "s, maybe requesting map" << dendl;
_maybe_request_map();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I very like the high-level idea of spreading the work across time, even if (slightly) more needs to be done.

@NitzanMordhai
Copy link
Contributor Author

thanks @NitzanMordhai!

this bug was initially caught by an api test timeout, and we temporarily disabled the failing test case in #63126. can you please reenable it in this pr so we can verify that it reduces the request latency?

-@skip_unless_dashboard_pr
 class RgwSiteTest(RgwTestCase):

cc @nizamial09 @epuertat

sure, will revert the commit

@nizamial09
Copy link
Member

nizamial09 commented May 26, 2025

sure, will revert the commit

@NitzanMordhai, thanks for the fix. Btw, reverting the dashboard commit will make the decorator go away which we don't want because that's being reused in other parts as well. So let's just only remove the decorator above the class RgwSiteTest(RgwTestCase): which will enable the tests.

If the objecter goes two ticks without receiving an OSDMap, it may
fall behind on incremental updates.  Call `maybe_request_osdmap()` in
the tick handler whenever we detect that two consecutive ticks have
elapsed without a fresh map, so we minimize the time needed to catch
up on missed changes.

Fixes: https://tracker.ceph.com/issues/71261
Signed-off-by: Nitzan Mordechai <nmordec@redhat.com>
@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-osdmap-incremental-process-idle branch from ae84da7 to 9e35235 Compare May 27, 2025 05:36
@NitzanMordhai
Copy link
Contributor Author

@nizamial09 done

@cbodley
Copy link
Contributor

cbodley commented May 27, 2025

radosgw.8000.log shows very low latency👍

2025-05-27T06:50:19.055+0000 7f11e016b640 1 ====== req done req=0x7f10cc723248 op=list_realms bucket= status=0 http_status=200 latency=0.040000789s request_id=tx00000b4787514cebe5d60-00683560ab-4550-default ======

will try again a couple times to make sure it's consistent

@cbodley
Copy link
Contributor

cbodley commented May 27, 2025

jenkins test api

@cbodley
Copy link
Contributor

cbodley commented May 27, 2025

https://jenkins.ceph.com/job/ceph-api/96543/artifact/build/out/radosgw.8000.log

2025-05-27T16:35:35.899+0000 7f8e462d6640 1 ====== req done req=0x7f8d450b3248 op=list_realms bucket= status=0 http_status=200 latency=0.043998569s request_id=tx00000ff47f5009d302d47-006835e9d7-4554-default ======

@NitzanMordhai
Copy link
Contributor Author

jenkins test make check

@NitzanMordhai
Copy link
Contributor Author

jenkins test make check arm64

@nizamial09
Copy link
Member

@NitzanMordhai dashboard lint error is a valid one. Looks like the unused import

CHAR-0x1B][1;32m1[NON-XML-CHAR-0x1B][0m: [NON-XML-CHAR-0x1B][1;31mF401[NON-XML-CHAR-0x1B][0m '.helper.skip_unless_dashboard_pr' imported but unused
1     F401 '.helper.skip_unless_dashboard_pr' imported but unused
lint: exit 1 (2.73 seconds) /home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/dashboard> flake8 --config=tox.ini 

@nizamial09
Copy link
Member

@NitzanMordhai ping^

@NitzanMordhai
Copy link
Contributor Author

@NitzanMordhai ping^

thanks, will handle it now

@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-osdmap-incremental-process-idle branch from 9e35235 to 54df886 Compare June 3, 2025 10:03
@NitzanMordhai
Copy link
Contributor Author

@nizamial09 done

@github-project-automation github-project-automation bot moved this from New to Reviewer approved in Ceph-Dashboard Jun 3, 2025
@nizamial09
Copy link
Member

nizamial09 commented Jun 4, 2025

make check is failing on a random error I guess but the arm64 is still failing on lint

lint: commands[3] /home/jenkins-build/build/workspace/ceph-pull-requests-arm64/src/pybind/mgr/dashboard> isort ../../../../qa/tasks/mgr/dashboard --check
ERROR: /home/jenkins-build/build/workspace/ceph-pull-requests-arm64/qa/tasks/mgr/dashboard/test_rgw.py Imports are incorrectly sorted and/or formatted.
lint: exit 1 (6.38 seconds) /home/jenkins-build/build/workspace/ceph-pull-requests-arm64/src/pybind/mgr/dashboard> isort 

might be easier if you run tox -e fix from pybind/mgr/dashboard or simply isort qa/tasks/mgr/dashboard/test_rgw.py from the ceph directory which will fix it.

You can verify it by running tox -e lint from the dashboard folder.

Fixes: https://tracker.ceph.com/issues/71261
Signed-off-by: Nitzan Mordechai <nmordec@redhat.com>
@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-osdmap-incremental-process-idle branch from 54df886 to adaab09 Compare June 4, 2025 11:33
@github-actions
Copy link

github-actions bot commented Jun 4, 2025

Config Diff Tool Output

- removed: breakpad
! changed: osd_deep_scrub_interval: old: Deep scrub each PG (i.e., verify data checksums) at least this often. Note that this option must be set at ``global`` scope, or for both ``mgr`` and``osd``.
! changed: osd_deep_scrub_interval: new: Deep scrub each PG (i.e., verify data checksums) at least this often
! changed: osd_scrub_min_interval: old: The desired interval between scrubs of a specific PG. Note that this option must be set at ``global`` scope, or for both ``mgr`` and``osd``.
! changed: osd_scrub_min_interval: new: The desired interval between scrubs of a specific PG.
! changed: osd_scrub_max_interval: old: Scrub each PG no less often than this interval. Note that this option must be set at ``global`` scope, or for both ``mgr`` and``osd``.
! changed: osd_scrub_max_interval: new: Scrub each PG no less often than this interval
! changed: osd_deep_scrub_interval_cv: old: deep scrub intervals are varied by a random amount to prevent stampedes. This parameter determines the amount of variation. Technically ``osd_deep_scrub_interval_cv`` is the coefficient of variation for the deep scrub interval.
! changed: osd_deep_scrub_interval_cv: new: deep scrub intervals are varied by a random amount to prevent stampedes. This parameter determines the amount of variation. Technically - osd_deep_scrub_interval_cv is the coefficient of variation for the deep scrub interval.
! changed: osd_deep_scrub_interval_cv: old: The coefficient of variation for the deep scrub interval, specified as a ratio. On average, the next deep scrub for a PG is scheduled osd_deep_scrub_interval after the last deep scrub . The actual time is randomized to a normal distribution with a standard deviation of osd_deep_scrub_interval * osd_deep_scrub_interval_cv (clamped to within 2 standard deviations). The default value guarantees that 95% of deep scrubs will be scheduled in the range [0.8 * osd_deep_scrub_interval, 1.2 * osd_deep_scrub_interval].
! changed: osd_deep_scrub_interval_cv: new: The coefficient of variation for the deep scrub interval, specified as a ratio. On average, the next deep scrub for a PG is scheduled osd_deep_scrub_interval after the last deep scrub . The actual time is randomized to a normal distribution with a standard deviation of osd_deep_scrub_interval * osd_deep_scrub_interval_cv (clamped to within 2 standard deviations). The default value guarantees that 95% of the deep scrubs will be scheduled in the range [0.8 * osd_deep_scrub_interval, 1.2 * osd_deep_scrub_interval].

The above configuration changes are found in the PR. Please update the relevant release documentation if necessary.

@NitzanMordhai
Copy link
Contributor Author

@nizamial09 thanks for your review and pointing all the issues, the PR is ready now, I don't think there are any errors related to it

Copy link
Member

@nizamial09 nizamial09 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you too @NitzanMordhai !

@nizamial09
Copy link
Member

can we merge this or is it waiting on teuthology?

@NitzanMordhai
Copy link
Contributor Author

can we merge this or is it waiting on teuthology?

i added needs-qa, will ask that it will be picked and tested

@NitzanMordhai
Copy link
Contributor Author

@NitzanMordhai NitzanMordhai merged commit 47c922c into ceph:main Jun 23, 2025
14 of 15 checks passed
@NitzanMordhai NitzanMordhai deleted the wip-nitzan-osdmap-incremental-process-idle branch June 23, 2025 09:31
@github-project-automation github-project-automation bot moved this from Reviewer approved to Done in Ceph-Dashboard Jun 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

6 participants