pacific: test: fix TierFlushDuringFlush to wait until dedup_tier is set on bas… by myoungwon · Pull Request #46748 · ceph/ceph

myoungwon · 2022-06-20T07:37:41Z

backport tracker: https://tracker.ceph.com/issues/56099, https://tracker.ceph.com/issues/56656

backport of #45035, #46866
parent tracker: https://tracker.ceph.com/issues/53855, https://tracker.ceph.com/issues/53294

Signed-off-by: Myoungwon Oh myoungwon.oh@samsung.com

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

…e pool When start_dedup() is called while the base pool is not set the dedup_tier, it is not possible to know the target pool of the chunk object. 1. User set the dedup_tier on a base pool by mon_command(). 2. User issues tier_flush on the object which has a manifest (base pool) before the dedup_tier is applied on the base pool. 3. OSD calls start_dedup() to flush the chunk objects to chunk pool. 4. OSD calls get_dedup_tier() to get the chunk pool of the base pool, but it is not possible to know the chunk pool. 5. get_dedup_tier() returns 0 because it is not applied on the base pool yet. 6. This makes refcount_manifest() lost it's way to chunk pool. To prevent this issue, start_dedup() has to be called after dedup_tier is set on the base pool. To do so, this commit prohibits getting chunk pool id if dedup_tier is not set. Fixes: http://tracker.ceph.com/issues/53855 Signed-off-by: Sungmin Lee <sung_min.lee@samsung.com> (cherry picked from commit 66ad91e)

ljflores

@myoungwon please have a look. This was caught on a test branch that contains your fix.

/a/yuriw-2022-07-05_17:38:43-rados-wip-yuri4-testing-2022-07-05-0719-pacific-distro-default-smithi/6915020

2022-07-06T02:26:55.749 INFO:tasks.ceph.osd.5.smithi191.stderr:2022-07-06T02:26:55.748+0000 7f83aa443700 -1 osd.5 399 get_health_metrics reporting 15 slow ops, oldest is osd_op(client.4785.0:7419 214.f 214:f5e1fadd:test-rados-api-smithi191-104463-84::foo:head [tier-flush] snapc 0=[] RETRY=1 ondisk+retry+read+ignore_cache+known_if_redirected e398)
2022-07-06T02:26:56.777 INFO:tasks.ceph.osd.5.smithi191.stderr:2022-07-06T02:26:56.763+0000 7f83aa443700 -1 osd.5 399 get_health_metrics reporting 15 slow ops, oldest is osd_op(client.4785.0:7419 214.f 214:f5e1fadd:test-rados-api-smithi191-104463-84::foo:head [tier-flush] snapc 0=[] RETRY=1 ondisk+retry+read+ignore_cache+known_if_redirected e398)
2022-07-06T02:26:57.748 INFO:tasks.ceph.osd.5.smithi191.stderr:2022-07-06T02:26:57.747+0000 7f83aa443700 -1 osd.5 399 get_health_metrics reporting 15 slow ops, oldest is osd_op(client.4785.0:7419 214.f 214:f5e1fadd:test-rados-api-smithi191-104463-84::foo:head [tier-flush] snapc 0=[] RETRY=1 ondisk+retry+read+ignore_cache+known_if_redirected e398)
2022-07-06T02:26:58.757 INFO:tasks.workunit.client.0.smithi191.stderr:++ cleanup
2022-07-06T02:26:58.757 INFO:tasks.workunit.client.0.smithi191.stderr:++ pkill -P 104256
2022-07-06T02:26:58.780 DEBUG:teuthology.orchestra.run:got remote process result: 124
2022-07-06T02:26:58.781 INFO:tasks.workunit.client.0.smithi191.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh: line 10: 104457 Terminated              bash -o pipefail -exc "ceph_test_rados_$f $color 2>&1 | tee ceph_test_rados_$ff.log | sed \"s/^/$r: /\""
2022-07-06T02:26:58.782 INFO:tasks.workunit.client.0.smithi191.stderr:++ true
2022-07-06T02:26:58.783 INFO:tasks.workunit.client.0.smithi191.stdout:              api_tier_pp: [       OK ] LibRadosTwoPoolsPP.CacheP
2022-07-06T02:26:58.783 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0...
2022-07-06T02:26:58.785 DEBUG:teuthology.orchestra.run.smithi191:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2022-07-06T02:26:58.797 INFO:tasks.ceph.osd.5.smithi191.stderr:2022-07-06T02:26:58.796+0000 7f83aa443700 -1 osd.5 399 get_health_metrics reporting 15 slow ops, oldest is osd_op(client.4785.0:7419 214.f 214:f5e1fadd:test-rados-api-smithi191-104463-84::foo:head [tier-flush] snapc 0=[] RETRY=1 ondisk+retry+read+ignore_cache+known_if_redirected e398)
2022-07-06T02:26:59.053 ERROR:teuthology.run_tasks:Saw exception from tasks.

This tracker is fixing an issue with LibRadosTwoPoolsPP.TierFlushDuringFlush, but I see that this one caught on LibRadosTwoPoolsPP.CachePin. We have been treating the failures on TierFlushDuringFlush and CachePin as the same, but I wonder if these are two separate issues (see https://tracker.ceph.com/issues/53855 for both examples).

All in all, we should make sure that this PR is truly fixing the issue before we move forward with the backport.

myoungwon · 2022-07-08T01:16:38Z

@ljflores I observed the same issue as this before, then fixed it by #46866. It seems that #46866 needs to pass QA first, then it should be backport to pacific and quincy.

ljflores · 2022-07-11T18:33:11Z

@ljflores I observed the same issue as this before, then fixed it by #46866. It seems that #46866 needs to pass QA first, then it should be backport to pacific and quincy.

I see, thanks @myoungwon. I have approved #46866 and tagged it for needs-qa.

ljflores · 2022-07-20T15:17:20Z

#46866 still needs a pacific backport. Then this one and the pacific backport for #46866 can be tested together.

During tier-flush, OSD sends reference increase message to target OSD. At this point, sending message with invalid pool information (e.g., deleted pool) causes unexpected behavior. Therefore, this commit return ENOENT early before sending the message fixes: https://tracker.ceph.com/issues/53294 Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com> (cherry picked from 3de27b2)

myoungwon · 2022-07-21T01:01:25Z

@ljflores I added a commit regarding #46866 here because it causes a conflict without this PR (it depends on this PR). Is that OK?

ljflores · 2022-07-23T00:11:25Z

@myoungwon Yes this is okay. I meant for those to backports to be tested together, and adding the commit here fulfills that. Thanks!

kamoltat · 2022-08-09T15:03:05Z

Rados Approved!

Pulpito :: Results Dashboard

0 Related Failures:
17 Tracked Failures:

2 jobs with Bug #53501: Exception when running 'rook' task. - Orchestrator - Ceph
1 job with Bug #53939: ceph-nfs-upgrade, pacific: Upgrade Paused due to UPGRADE_REDEPLOY_DAEMON: Upgrading daemon osd.0 on host smithi103 failed - Orchestrator - Ceph
3 jobs with Bug #55443: "SELinux denials found.." in rados run - Infrastructure - Ceph
1 job with Bug #45318: Health check failed: 2/6 mons down, quorum b,a,c,e (MON_DOWN)" in cluster log running tasks/mon_clock_no_skews.yaml - RADOS - Ceph
2 jobs with Bug #52321: qa/tasks/rook times out: 'check osd count' reached maximum tries (90) after waiting for 900 seconds - Orchestrator - Ceph
1 job with Bug #54071: rados/cephadm/osds: Invalid command: missing required parameter hostname() - Orchestrator - Ceph
1 job with Bug #52124: Invalid read of size 8 in handle_recovery_delete() - RADOS - Ceph
1 job with Bug #56573: test_cephadm.sh: KeyError: 'TYPE' - Orchestrator - Ceph

0 Newly Tracked Failures:
0 Dead Jobs

myoungwon requested a review from a team as a code owner June 20, 2022 07:37

github-actions bot added core tests labels Jun 20, 2022

github-actions bot added this to the pacific milestone Jun 20, 2022

ljflores approved these changes Jun 21, 2022

View reviewed changes

ljflores added the needs-qa label Jun 21, 2022

yuriw added the wip-yuri4-testing label Jul 5, 2022

ljflores requested changes Jul 7, 2022

View reviewed changes

ljflores requested a review from neha-ojha July 7, 2022 20:45

yuriw removed needs-qa wip-yuri4-testing labels Jul 11, 2022

ljflores approved these changes Jul 20, 2022

View reviewed changes

ljflores added the needs-qa label Jul 20, 2022

yuriw added the wip-yuri3-testing label Aug 3, 2022

yuriw merged commit bf8ad0d into ceph:pacific Aug 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pacific: test: fix TierFlushDuringFlush to wait until dedup_tier is set on bas…#46748

pacific: test: fix TierFlushDuringFlush to wait until dedup_tier is set on bas…#46748
yuriw merged 2 commits intoceph:pacificfrom
myoungwon:pacific-53855

myoungwon commented Jun 20, 2022 •

edited

Loading

Uh oh!

ljflores left a comment •

edited

Loading

Uh oh!

myoungwon commented Jul 8, 2022

Uh oh!

ljflores commented Jul 11, 2022

Uh oh!

ljflores commented Jul 20, 2022

Uh oh!

myoungwon commented Jul 21, 2022 •

edited

Loading

Uh oh!

ljflores commented Jul 23, 2022

Uh oh!

kamoltat commented Aug 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

myoungwon commented Jun 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

ljflores left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

myoungwon commented Jul 8, 2022

Uh oh!

ljflores commented Jul 11, 2022

Uh oh!

ljflores commented Jul 20, 2022

Uh oh!

myoungwon commented Jul 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ljflores commented Jul 23, 2022

Uh oh!

kamoltat commented Aug 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

myoungwon commented Jun 20, 2022 •

edited

Loading

ljflores left a comment •

edited

Loading

myoungwon commented Jul 21, 2022 •

edited

Loading