pacific: test: fix TierFlushDuringFlush to wait until dedup_tier is set on bas…#46748
pacific: test: fix TierFlushDuringFlush to wait until dedup_tier is set on bas…#46748yuriw merged 2 commits intoceph:pacificfrom
Conversation
…e pool When start_dedup() is called while the base pool is not set the dedup_tier, it is not possible to know the target pool of the chunk object. 1. User set the dedup_tier on a base pool by mon_command(). 2. User issues tier_flush on the object which has a manifest (base pool) before the dedup_tier is applied on the base pool. 3. OSD calls start_dedup() to flush the chunk objects to chunk pool. 4. OSD calls get_dedup_tier() to get the chunk pool of the base pool, but it is not possible to know the chunk pool. 5. get_dedup_tier() returns 0 because it is not applied on the base pool yet. 6. This makes refcount_manifest() lost it's way to chunk pool. To prevent this issue, start_dedup() has to be called after dedup_tier is set on the base pool. To do so, this commit prohibits getting chunk pool id if dedup_tier is not set. Fixes: http://tracker.ceph.com/issues/53855 Signed-off-by: Sungmin Lee <sung_min.lee@samsung.com> (cherry picked from commit 66ad91e)
There was a problem hiding this comment.
@myoungwon please have a look. This was caught on a test branch that contains your fix.
/a/yuriw-2022-07-05_17:38:43-rados-wip-yuri4-testing-2022-07-05-0719-pacific-distro-default-smithi/6915020
2022-07-06T02:26:55.749 INFO:tasks.ceph.osd.5.smithi191.stderr:2022-07-06T02:26:55.748+0000 7f83aa443700 -1 osd.5 399 get_health_metrics reporting 15 slow ops, oldest is osd_op(client.4785.0:7419 214.f 214:f5e1fadd:test-rados-api-smithi191-104463-84::foo:head [tier-flush] snapc 0=[] RETRY=1 ondisk+retry+read+ignore_cache+known_if_redirected e398)
2022-07-06T02:26:56.777 INFO:tasks.ceph.osd.5.smithi191.stderr:2022-07-06T02:26:56.763+0000 7f83aa443700 -1 osd.5 399 get_health_metrics reporting 15 slow ops, oldest is osd_op(client.4785.0:7419 214.f 214:f5e1fadd:test-rados-api-smithi191-104463-84::foo:head [tier-flush] snapc 0=[] RETRY=1 ondisk+retry+read+ignore_cache+known_if_redirected e398)
2022-07-06T02:26:57.748 INFO:tasks.ceph.osd.5.smithi191.stderr:2022-07-06T02:26:57.747+0000 7f83aa443700 -1 osd.5 399 get_health_metrics reporting 15 slow ops, oldest is osd_op(client.4785.0:7419 214.f 214:f5e1fadd:test-rados-api-smithi191-104463-84::foo:head [tier-flush] snapc 0=[] RETRY=1 ondisk+retry+read+ignore_cache+known_if_redirected e398)
2022-07-06T02:26:58.757 INFO:tasks.workunit.client.0.smithi191.stderr:++ cleanup
2022-07-06T02:26:58.757 INFO:tasks.workunit.client.0.smithi191.stderr:++ pkill -P 104256
2022-07-06T02:26:58.780 DEBUG:teuthology.orchestra.run:got remote process result: 124
2022-07-06T02:26:58.781 INFO:tasks.workunit.client.0.smithi191.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh: line 10: 104457 Terminated bash -o pipefail -exc "ceph_test_rados_$f $color 2>&1 | tee ceph_test_rados_$ff.log | sed \"s/^/$r: /\""
2022-07-06T02:26:58.782 INFO:tasks.workunit.client.0.smithi191.stderr:++ true
2022-07-06T02:26:58.783 INFO:tasks.workunit.client.0.smithi191.stdout: api_tier_pp: [ OK ] LibRadosTwoPoolsPP.CacheP
2022-07-06T02:26:58.783 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0...
2022-07-06T02:26:58.785 DEBUG:teuthology.orchestra.run.smithi191:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2022-07-06T02:26:58.797 INFO:tasks.ceph.osd.5.smithi191.stderr:2022-07-06T02:26:58.796+0000 7f83aa443700 -1 osd.5 399 get_health_metrics reporting 15 slow ops, oldest is osd_op(client.4785.0:7419 214.f 214:f5e1fadd:test-rados-api-smithi191-104463-84::foo:head [tier-flush] snapc 0=[] RETRY=1 ondisk+retry+read+ignore_cache+known_if_redirected e398)
2022-07-06T02:26:59.053 ERROR:teuthology.run_tasks:Saw exception from tasks.
This tracker is fixing an issue with LibRadosTwoPoolsPP.TierFlushDuringFlush, but I see that this one caught on LibRadosTwoPoolsPP.CachePin. We have been treating the failures on TierFlushDuringFlush and CachePin as the same, but I wonder if these are two separate issues (see https://tracker.ceph.com/issues/53855 for both examples).
All in all, we should make sure that this PR is truly fixing the issue before we move forward with the backport.
I see, thanks @myoungwon. I have approved #46866 and tagged it for |
During tier-flush, OSD sends reference increase message to target OSD. At this point, sending message with invalid pool information (e.g., deleted pool) causes unexpected behavior. Therefore, this commit return ENOENT early before sending the message fixes: https://tracker.ceph.com/issues/53294 Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com> (cherry picked from 3de27b2)
|
@myoungwon Yes this is okay. I meant for those to backports to be tested together, and adding the commit here fulfills that. Thanks! |
backport tracker: https://tracker.ceph.com/issues/56099, https://tracker.ceph.com/issues/56656
backport of #45035, #46866
parent tracker: https://tracker.ceph.com/issues/53855, https://tracker.ceph.com/issues/53294
Signed-off-by: Myoungwon Oh myoungwon.oh@samsung.com
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows