mds: scrub repair does not clear earlier damage health status by neesingh-rh · Pull Request #48895 · ceph/ceph

neesingh-rh · 2022-11-15T13:02:04Z

Fixes: https://tracker.ceph.com/issues/54557
Signed-off-by: Neeraj Pratap Singh neesingh@redhat.com

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

qa/tasks/cephfs/test_scrub.py

src/mds/CInode.cc

dparmar18 · 2022-11-16T12:58:01Z

@neesingh-rh L4781 says that setting the repaired flag to true would prevent it's entry in damagetable. Seems like someone did address this issue in past but inode still sneaked into the damagetable?

neesingh-rh · 2022-11-16T13:08:58Z

@neesingh-rh L4781 says that setting the repaired flag to true would prevent it's entry in damagetable. Seems like someone did address this issue in past but inode still sneaked into the damagetable?

@dparmar18 Setting the repaired flag to true makes the ScrubStack.cc to realise that backtrace is repaired and hence, logs shows Inode Repaired but its not clearing the entry (earlier) from the damage table. That's why we have to manually erase it by damage rm command which shouldn't be. And that's issue which this PR fixes.

dparmar18 · 2022-11-16T13:14:18Z

@neesingh-rh L4781 says that setting the repaired flag to true would prevent it's entry in damagetable. Seems like someone did address this issue in past but inode still sneaked into the damagetable?

@dparmar18 Setting the repaired flag to true makes the ScrubStack.cc to realise that backtrace is repaired and hence, logs shows Inode Repaired but its not clearing the entry (earlier) from the damage table. That's why we have to manually erase it by damage rm command which shouldn't be. And that's issue which this PR fixes.

Ah, that comment is a bit misleading. Thanks for the explanation!

dparmar18 · 2022-11-17T08:01:55Z

I'm fine with the current patch too but I was just thinking we can simplify this:

DamageTable

damage_entry_id_t DamageTable::remove_damaged_entry(inodeno_t ino) {
	erase(remotes.find(ino)->second->id);
}

CInode

mdcache->mds->damage_table.remove_damaged_entry(in->ino());

What do you think? @neesingh-rh

neesingh-rh · 2022-11-17T08:37:45Z

I'm fine with the current patch too but I was just thinking we can simplify this:

DamageTable
damage_entry_id_t DamageTable::remove_damaged_entry(inodeno_t ino) {
	erase(remotes.find(ino)->second->id);
}
CInode
mdcache->mds->damage_table.remove_damaged_entry(in->ino());
What do you think? @neesingh-rh

@dparmar18 Yeah, it looks simpler. And I had thought about this but just to avoid confusion in CInode.cc between two function for erase(kind of) , I went with this.(Doesn't have any strong opinion) If u say second is better, will update. What say?

src/mds/DamageTable.cc

dparmar18 · 2022-11-17T12:04:28Z

I'm fine with the current patch too but I was just thinking we can simplify this:
DamageTable
damage_entry_id_t DamageTable::remove_damaged_entry(inodeno_t ino) {
	erase(remotes.find(ino)->second->id);
}
CInode
mdcache->mds->damage_table.remove_damaged_entry(in->ino());
What do you think? @neesingh-rh
@dparmar18 Yeah, it looks simpler. And I had thought about this but just to avoid confusion in CInode.cc between two function for erase(kind of) , I went with this.(Doesn't have any strong opinion) If u say second is better, will update. What say?

@neesingh-rh This would introduce one more erase() call in CInode.cc right? Doesn't that lead to more confusion? Or I didn't understand correctly?

src/mds/DamageTable.cc

src/mds/DamageTable.h

kotreshhr · 2022-12-05T10:29:13Z

@neesingh-rh No test for this ?

neesingh-rh · 2022-12-05T10:40:21Z

@neesingh-rh No test for this ?

Will update with the tests soon.

src/mds/DamageTable.cc

vshankar · 2023-11-23T12:23:14Z

I did run the teuthology after cdoing cleanups and here's the link:https://pulpito.ceph.com/neesingh-2023-11-23_10:04:20-fs-wip-neesingh-testing-231123-distro-default-smithi/

Could you explain what the changes are that fixes that test case failure reported in #48895 (review) ?

neesingh-rh · 2023-11-23T12:39:22Z

I did run the teuthology after cdoing cleanups and here's the link:https://pulpito.ceph.com/neesingh-2023-11-23_10:04:20-fs-wip-neesingh-testing-231123-distro-default-smithi/

Could you explain what the changes are that fixes that test case failure reported in #48895 (review) ?

As we discussed earlier that after debuging the code many times it seemed that there was no problem in the code, we need to check only for the test case.
There were two failures that were happening:

assert was failing on not finding MDS_DAMAGE in the get_mon_health() dict which got solved by adding sleep of some time cause addition of MDS_DAMAGE in the dict takes some time.
It was skipping the scrubbing second time when scrub start repair was run which is only responsible for removing the damage from damage list. When I looked the code in ScrubStack.cc, It skips the scrubbing when there's no change since last scrub unless the header contains force flag and in our case we need to run scrub start repair second time , so I added force in the scrub start repair for test_health_status_after_backtrace_repair

dparmar18 · 2023-11-23T12:44:19Z

I did run the teuthology after cdoing cleanups and here's the link:https://pulpito.ceph.com/neesingh-2023-11-23_10:04:20-fs-wip-neesingh-testing-231123-distro-default-smithi/

Could you explain what the changes are that fixes that test case failure reported in #48895 (review) ?

As we discussed earlier that after debuging the code many times it seemed that there was no problem in the code, we need to check only for the test case. There were two failures that were happening:

assert was failing on not finding MDS_DAMAGE in the get_mon_health() dict which got solved by adding sleep of some time cause addition of MDS_DAMAGE in the dict takes some time.

maybe you can use wait_until_true() with a timeout?

It was skipping the scrubbing second time when scrub start repair was run which is only responsible for removing the damage from damage list. When I looked the code in ScrubStack.cc, It skips the scrubbing when there's no change since last scrub unless the header contains force flag and in our case we need to run scrub start repair second time , so I added force in the scrub start repair for test_health_status_after_backtrace_repair

neesingh-rh · 2023-11-23T13:02:04Z

I did run the teuthology after cdoing cleanups and here's the link:https://pulpito.ceph.com/neesingh-2023-11-23_10:04:20-fs-wip-neesingh-testing-231123-distro-default-smithi/

Could you explain what the changes are that fixes that test case failure reported in #48895 (review) ?

As we discussed earlier that after debuging the code many times it seemed that there was no problem in the code, we need to check only for the test case. There were two failures that were happening:

assert was failing on not finding MDS_DAMAGE in the get_mon_health() dict which got solved by adding sleep of some time cause addition of MDS_DAMAGE in the dict takes some time.

maybe you can use wait_until_true() with a timeout?

We can but if there's no harm lets stick to this

It was skipping the scrubbing second time when scrub start repair was run which is only responsible for removing the damage from damage list. When I looked the code in ScrubStack.cc, It skips the scrubbing when there's no change since last scrub unless the header contains force flag and in our case we need to run scrub start repair second time , so I added force in the scrub start repair for test_health_status_after_backtrace_repair

vshankar · 2023-11-28T09:46:58Z

I did run the teuthology after cdoing cleanups and here's the link:https://pulpito.ceph.com/neesingh-2023-11-23_10:04:20-fs-wip-neesingh-testing-231123-distro-default-smithi/

Could you explain what the changes are that fixes that test case failure reported in #48895 (review) ?

As we discussed earlier that after debuging the code many times it seemed that there was no problem in the code, we need to check only for the test case. There were two failures that were happening:

assert was failing on not finding MDS_DAMAGE in the get_mon_health() dict which got solved by adding sleep of some time cause addition of MDS_DAMAGE in the dict takes some time.

maybe you can use wait_until_true() with a timeout?

We can but if there's no harm lets stick to this

I'm with @dparmar18 on this one. Using wait_until_true() is the preferred way rather than using sleep.

vshankar · 2023-11-28T09:47:04Z

Otherwise LGTM.

neesingh-rh · 2023-11-28T11:18:50Z

Wait, its failing after the latest changes.

vshankar · 2023-11-29T04:38:52Z

jenkins test windows

qa/tasks/cephfs/test_forward_scrub.py

neesingh-rh · 2023-11-29T07:51:38Z

Run link: https://pulpito.ceph.com/neesingh-2023-11-29_05:06:25-fs-wip-neesingh-testing-231123-distro-default-smithi/

qa/tasks/cephfs/test_forward_scrub.py

Fixes: https://tracker.ceph.com/issues/54557 Signed-off-by: Neeraj Pratap Singh <neesingh@redhat.com>

dparmar18

LGTM

dparmar18 · 2023-11-29T11:07:55Z

jenkins test make check arm64

vshankar · 2023-12-07T09:42:35Z

fs+smoke suite - https://pulpito.ceph.com/?branch=wip-vshankar-testing-20231206.125818

vshankar

https://tracker.ceph.com/projects/cephfs/wiki/Main#06-Dec-2023

neesingh-rh marked this pull request as draft November 15, 2022 13:02

github-actions bot added cephfs Ceph File System tests labels Nov 15, 2022

neesingh-rh marked this pull request as ready for review November 16, 2022 06:32

neesingh-rh requested a review from a team November 16, 2022 12:10

dparmar18 reviewed Nov 16, 2022

View reviewed changes

qa/tasks/cephfs/test_scrub.py Outdated Show resolved Hide resolved

dparmar18 reviewed Nov 16, 2022

View reviewed changes

src/mds/CInode.cc Outdated Show resolved Hide resolved

neesingh-rh force-pushed the issue_54557 branch from 6b75dfd to c495b2f Compare November 16, 2022 12:46

vshankar reviewed Nov 17, 2022

View reviewed changes

src/mds/DamageTable.cc Outdated Show resolved Hide resolved

neesingh-rh marked this pull request as draft November 17, 2022 13:38

neesingh-rh force-pushed the issue_54557 branch from c495b2f to 0853433 Compare December 5, 2022 06:19

github-actions bot added the core label Dec 5, 2022

neesingh-rh force-pushed the issue_54557 branch from 0853433 to 29a48f6 Compare December 5, 2022 06:20

mchangir reviewed Dec 5, 2022

View reviewed changes

src/mds/DamageTable.cc Outdated Show resolved Hide resolved

src/mds/DamageTable.cc Outdated Show resolved Hide resolved

src/mds/DamageTable.h Outdated Show resolved Hide resolved

neesingh-rh force-pushed the issue_54557 branch 2 times, most recently from 469397f to 08e05a3 Compare December 5, 2022 09:17

neesingh-rh force-pushed the issue_54557 branch 4 times, most recently from 38d714e to ada891a Compare December 6, 2022 07:08

dparmar18 reviewed Dec 8, 2022

View reviewed changes

src/mds/DamageTable.cc Show resolved Hide resolved

vshankar requested changes Dec 13, 2022

View reviewed changes

src/mds/DamageTable.cc Outdated Show resolved Hide resolved

src/mds/DamageTable.cc Outdated Show resolved Hide resolved

src/mds/DamageTable.cc Show resolved Hide resolved

src/mds/DamageTable.cc Outdated Show resolved Hide resolved

neesingh-rh requested a review from vshankar November 23, 2023 12:20

neesingh-rh force-pushed the issue_54557 branch 3 times, most recently from df1c33b to e67b674 Compare November 28, 2023 10:42

neesingh-rh force-pushed the issue_54557 branch 2 times, most recently from 17c45c8 to 3ae9941 Compare November 28, 2023 12:37

vshankar requested changes Nov 29, 2023

View reviewed changes

qa/tasks/cephfs/test_forward_scrub.py Outdated Show resolved Hide resolved

qa/tasks/cephfs/test_forward_scrub.py Outdated Show resolved Hide resolved

neesingh-rh force-pushed the issue_54557 branch from 3ae9941 to 0c7b4f3 Compare November 29, 2023 05:03

vshankar added the wip-vshankar-testing1 label Nov 29, 2023

vshankar approved these changes Nov 29, 2023

View reviewed changes

dparmar18 reviewed Nov 29, 2023

View reviewed changes

qa/tasks/cephfs/test_forward_scrub.py Outdated Show resolved Hide resolved

qa: test cases for checking the health status after scrub repair

7f0cf0b

Fixes: https://tracker.ceph.com/issues/54557 Signed-off-by: Neeraj Pratap Singh <neesingh@redhat.com>

neesingh-rh force-pushed the issue_54557 branch from 0c7b4f3 to 7f0cf0b Compare November 29, 2023 10:39

dparmar18 approved these changes Nov 29, 2023

View reviewed changes

vshankar approved these changes Dec 13, 2023

View reviewed changes

vshankar merged commit 5365d99 into ceph:main Dec 13, 2023

vshankar removed the wip-vshankar-testing1 label Dec 13, 2023

This was referenced Dec 14, 2023

quincy: mds: scrub repair does not clear earlier damage health status #54898

Closed

reef: mds: scrub repair does not clear earlier damage health status #54899

Merged

Conversation

neesingh-rh commented Nov 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

Uh oh!

Uh oh!

dparmar18 commented Nov 16, 2022

Uh oh!

neesingh-rh commented Nov 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dparmar18 commented Nov 16, 2022

Uh oh!

dparmar18 commented Nov 17, 2022

Uh oh!

neesingh-rh commented Nov 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

dparmar18 commented Nov 17, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kotreshhr commented Dec 5, 2022

Uh oh!

neesingh-rh commented Dec 5, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vshankar commented Nov 23, 2023

Uh oh!

neesingh-rh commented Nov 23, 2023

Uh oh!

dparmar18 commented Nov 23, 2023

Uh oh!

neesingh-rh commented Nov 23, 2023

Uh oh!

vshankar commented Nov 28, 2023

Uh oh!

vshankar commented Nov 28, 2023

Uh oh!

neesingh-rh commented Nov 28, 2023

Uh oh!

vshankar commented Nov 29, 2023

Uh oh!

Uh oh!

Uh oh!

neesingh-rh commented Nov 29, 2023

Uh oh!

Uh oh!

dparmar18 left a comment

Choose a reason for hiding this comment

Uh oh!

dparmar18 commented Nov 29, 2023

Uh oh!

vshankar commented Dec 7, 2023

Uh oh!

vshankar left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

neesingh-rh commented Nov 15, 2022 •

edited

Loading

neesingh-rh commented Nov 16, 2022 •

edited

Loading

neesingh-rh commented Nov 17, 2022 •

edited

Loading