OSD: PG stat is not synchronized between osds after deep-scrub#57582
OSD: PG stat is not synchronized between osds after deep-scrub#57582sajibreadd merged 1 commit intoceph:mainfrom
Conversation
|
@ronen-fr Can you take a look? |
@athanatos - I'll be back near a laptop only on May 30. But I'm glad we now have an explanation for this bug. |
|
@ronen-fr: just a friendly mention to keep this on your radar. |
src/osd/PeeringState.cc
Outdated
| dirty_big_info = true; | ||
| } | ||
| info.stats.stats.sum = oinfo.stats.stats.sum; | ||
| pl->publish_stats_to_osd(); |
There was a problem hiding this comment.
Not sure I understand how publish_stats_to_osd(), which starts with
void PG::publish_stats_to_osd()
{
if (!is_primary())
return;
has any effect when called after line 3031 above
There was a problem hiding this comment.
You are right, we don't need to publish it immediately(as publishing works only for primary osd). So whenever primary osd is down and secondary osd is taking control over at that time it will automatically publish the stats from the recovery_state. I did some adjustment in the code, such that the old codes are intact.
ronen-fr
left a comment
There was a problem hiding this comment.
pls see my question in the comment
|
jenkins test make check |
|
jenkins test make check arm64 |
ljflores
left a comment
There was a problem hiding this comment.
@sajibreadd, the commit should have src/osd: in front of the title.
|
Hi @sajibreadd, this PR was tested in teuthology, and I found a few regressions. PTAL when you can. /a/yuriw-2024-07-17_13:32:02-rados-wip-yuri12-testing-2024-07-16-1122-distro-default-smithi/7806002 /a/yuriw-2024-07-17_13:32:02-rados-wip-yuri12-testing-2024-07-16-1122-distro-default-smithi/7805995 See this link for the full test run: See this link for example successful test runs (w/o your PR included): Testing ref: https://tracker.ceph.com/issues/66706 |
|
@yuriw please do not merge this until the author can take another look. |
|
Probably could be the reason, as I |
…primary osd is killed, next primary osd has wrong stats. Reason behind it is PeeringState::proc_primary_info does not process or update any pg stats. Fixes: https://tracker.ceph.com/issues/66059 Signed-off-by: Md Mahamudur Rahaman Sajib <mahamudur.sajib@croit.io>
|
jenkins test api |
|
@ljflores - can we have another QA round, please? |
|
@sajibreadd yeah this will go into a future batch. |
|
@sajibreadd , @ljflores : tests continue to fail in the same pattern you have mentioned above. Seems we made a mistake in approving this PR. |
Sure |
|
@ronen-fr Should we call |
Ignore this comment, |
pg stat are not synced between osds after deep-scrub. So if primary osd is killed, next primary osd has wrong stats. Reason behind it is PeeringState::proc_primary_info does not process or update any pg stats of secondary osds after deep-scrub.
Fixes: https://tracker.ceph.com/issues/66059
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e