Skip to content

mds: scrub repair does not clear earlier damage health status#48895

Merged
vshankar merged 2 commits intoceph:mainfrom
neesingh-rh:issue_54557
Dec 13, 2023
Merged

mds: scrub repair does not clear earlier damage health status#48895
vshankar merged 2 commits intoceph:mainfrom
neesingh-rh:issue_54557

Conversation

@neesingh-rh
Copy link
Contributor

@neesingh-rh neesingh-rh commented Nov 15, 2022

Fixes: https://tracker.ceph.com/issues/54557
Signed-off-by: Neeraj Pratap Singh neesingh@redhat.com

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@neesingh-rh neesingh-rh marked this pull request as draft November 15, 2022 13:02
@github-actions github-actions bot added cephfs Ceph File System tests labels Nov 15, 2022
@neesingh-rh neesingh-rh marked this pull request as ready for review November 16, 2022 06:32
@neesingh-rh neesingh-rh requested a review from a team November 16, 2022 12:10
@dparmar18
Copy link
Contributor

@neesingh-rh L4781 says that setting the repaired flag to true would prevent it's entry in damagetable. Seems like someone did address this issue in past but inode still sneaked into the damagetable?

@neesingh-rh
Copy link
Contributor Author

neesingh-rh commented Nov 16, 2022

@neesingh-rh L4781 says that setting the repaired flag to true would prevent it's entry in damagetable. Seems like someone did address this issue in past but inode still sneaked into the damagetable?

@dparmar18 Setting the repaired flag to true makes the ScrubStack.cc to realise that backtrace is repaired and hence, logs shows Inode Repaired but its not clearing the entry (earlier) from the damage table. That's why we have to manually erase it by damage rm command which shouldn't be. And that's issue which this PR fixes.

@dparmar18
Copy link
Contributor

@neesingh-rh L4781 says that setting the repaired flag to true would prevent it's entry in damagetable. Seems like someone did address this issue in past but inode still sneaked into the damagetable?

@dparmar18 Setting the repaired flag to true makes the ScrubStack.cc to realise that backtrace is repaired and hence, logs shows Inode Repaired but its not clearing the entry (earlier) from the damage table. That's why we have to manually erase it by damage rm command which shouldn't be. And that's issue which this PR fixes.

Ah, that comment is a bit misleading. Thanks for the explanation!

@dparmar18
Copy link
Contributor

I'm fine with the current patch too but I was just thinking we can simplify this:

DamageTable

damage_entry_id_t DamageTable::remove_damaged_entry(inodeno_t ino) {
	erase(remotes.find(ino)->second->id);
}

CInode

mdcache->mds->damage_table.remove_damaged_entry(in->ino());

What do you think? @neesingh-rh

@neesingh-rh
Copy link
Contributor Author

neesingh-rh commented Nov 17, 2022

I'm fine with the current patch too but I was just thinking we can simplify this:

DamageTable

damage_entry_id_t DamageTable::remove_damaged_entry(inodeno_t ino) {
	erase(remotes.find(ino)->second->id);
}

CInode

mdcache->mds->damage_table.remove_damaged_entry(in->ino());

What do you think? @neesingh-rh

@dparmar18 Yeah, it looks simpler. And I had thought about this but just to avoid confusion in CInode.cc between two function for erase(kind of) , I went with this.(Doesn't have any strong opinion) If u say second is better, will update. What say?

@dparmar18
Copy link
Contributor

I'm fine with the current patch too but I was just thinking we can simplify this:
DamageTable

damage_entry_id_t DamageTable::remove_damaged_entry(inodeno_t ino) {
	erase(remotes.find(ino)->second->id);
}

CInode

mdcache->mds->damage_table.remove_damaged_entry(in->ino());

What do you think? @neesingh-rh

@dparmar18 Yeah, it looks simpler. And I had thought about this but just to avoid confusion in CInode.cc between two function for erase(kind of) , I went with this.(Doesn't have any strong opinion) If u say second is better, will update. What say?

@neesingh-rh This would introduce one more erase() call in CInode.cc right? Doesn't that lead to more confusion? Or I didn't understand correctly?

@neesingh-rh neesingh-rh force-pushed the issue_54557 branch 2 times, most recently from 469397f to 08e05a3 Compare December 5, 2022 09:17
@kotreshhr
Copy link
Contributor

@neesingh-rh No test for this ?

@neesingh-rh
Copy link
Contributor Author

@neesingh-rh No test for this ?

Will update with the tests soon.

@neesingh-rh neesingh-rh force-pushed the issue_54557 branch 4 times, most recently from 38d714e to ada891a Compare December 6, 2022 07:08
@neesingh-rh neesingh-rh requested a review from vshankar November 23, 2023 12:20
@vshankar
Copy link
Contributor

I did run the teuthology after cdoing cleanups and here's the link:https://pulpito.ceph.com/neesingh-2023-11-23_10:04:20-fs-wip-neesingh-testing-231123-distro-default-smithi/

Could you explain what the changes are that fixes that test case failure reported in #48895 (review) ?

@neesingh-rh
Copy link
Contributor Author

I did run the teuthology after cdoing cleanups and here's the link:https://pulpito.ceph.com/neesingh-2023-11-23_10:04:20-fs-wip-neesingh-testing-231123-distro-default-smithi/

Could you explain what the changes are that fixes that test case failure reported in #48895 (review) ?

As we discussed earlier that after debuging the code many times it seemed that there was no problem in the code, we need to check only for the test case.
There were two failures that were happening:

  1. assert was failing on not finding MDS_DAMAGE in the get_mon_health() dict which got solved by adding sleep of some time cause addition of MDS_DAMAGE in the dict takes some time.
  2. It was skipping the scrubbing second time when scrub start repair was run which is only responsible for removing the damage from damage list. When I looked the code in ScrubStack.cc, It skips the scrubbing when there's no change since last scrub unless the header contains force flag and in our case we need to run scrub start repair second time , so I added force in the scrub start repair for test_health_status_after_backtrace_repair

@dparmar18
Copy link
Contributor

I did run the teuthology after cdoing cleanups and here's the link:https://pulpito.ceph.com/neesingh-2023-11-23_10:04:20-fs-wip-neesingh-testing-231123-distro-default-smithi/

Could you explain what the changes are that fixes that test case failure reported in #48895 (review) ?

As we discussed earlier that after debuging the code many times it seemed that there was no problem in the code, we need to check only for the test case. There were two failures that were happening:

  1. assert was failing on not finding MDS_DAMAGE in the get_mon_health() dict which got solved by adding sleep of some time cause addition of MDS_DAMAGE in the dict takes some time.

maybe you can use wait_until_true() with a timeout?

  1. It was skipping the scrubbing second time when scrub start repair was run which is only responsible for removing the damage from damage list. When I looked the code in ScrubStack.cc, It skips the scrubbing when there's no change since last scrub unless the header contains force flag and in our case we need to run scrub start repair second time , so I added force in the scrub start repair for test_health_status_after_backtrace_repair

@neesingh-rh
Copy link
Contributor Author

I did run the teuthology after cdoing cleanups and here's the link:https://pulpito.ceph.com/neesingh-2023-11-23_10:04:20-fs-wip-neesingh-testing-231123-distro-default-smithi/

Could you explain what the changes are that fixes that test case failure reported in #48895 (review) ?

As we discussed earlier that after debuging the code many times it seemed that there was no problem in the code, we need to check only for the test case. There were two failures that were happening:

  1. assert was failing on not finding MDS_DAMAGE in the get_mon_health() dict which got solved by adding sleep of some time cause addition of MDS_DAMAGE in the dict takes some time.

maybe you can use wait_until_true() with a timeout?

We can but if there's no harm lets stick to this

  1. It was skipping the scrubbing second time when scrub start repair was run which is only responsible for removing the damage from damage list. When I looked the code in ScrubStack.cc, It skips the scrubbing when there's no change since last scrub unless the header contains force flag and in our case we need to run scrub start repair second time , so I added force in the scrub start repair for test_health_status_after_backtrace_repair

@vshankar
Copy link
Contributor

I did run the teuthology after cdoing cleanups and here's the link:https://pulpito.ceph.com/neesingh-2023-11-23_10:04:20-fs-wip-neesingh-testing-231123-distro-default-smithi/

Could you explain what the changes are that fixes that test case failure reported in #48895 (review) ?

As we discussed earlier that after debuging the code many times it seemed that there was no problem in the code, we need to check only for the test case. There were two failures that were happening:

  1. assert was failing on not finding MDS_DAMAGE in the get_mon_health() dict which got solved by adding sleep of some time cause addition of MDS_DAMAGE in the dict takes some time.

maybe you can use wait_until_true() with a timeout?

We can but if there's no harm lets stick to this

I'm with @dparmar18 on this one. Using wait_until_true() is the preferred way rather than using sleep.

@vshankar
Copy link
Contributor

Otherwise LGTM.

@neesingh-rh neesingh-rh force-pushed the issue_54557 branch 3 times, most recently from df1c33b to e67b674 Compare November 28, 2023 10:42
@neesingh-rh
Copy link
Contributor Author

Wait, its failing after the latest changes.

@neesingh-rh neesingh-rh force-pushed the issue_54557 branch 2 times, most recently from 17c45c8 to 3ae9941 Compare November 28, 2023 12:37
@vshankar
Copy link
Contributor

jenkins test windows

@neesingh-rh
Copy link
Contributor Author

Run link: https://pulpito.ceph.com/neesingh-2023-11-29_05:06:25-fs-wip-neesingh-testing-231123-distro-default-smithi/

Copy link
Contributor

@dparmar18 dparmar18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dparmar18
Copy link
Contributor

jenkins test make check arm64

@vshankar
Copy link
Contributor

vshankar commented Dec 7, 2023

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cephfs Ceph File System core tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants