Skip to content

pacific: test: fix TierFlushDuringFlush to wait until dedup_tier is set on bas…#46748

Merged
yuriw merged 2 commits intoceph:pacificfrom
myoungwon:pacific-53855
Aug 9, 2022
Merged

pacific: test: fix TierFlushDuringFlush to wait until dedup_tier is set on bas…#46748
yuriw merged 2 commits intoceph:pacificfrom
myoungwon:pacific-53855

Conversation

@myoungwon
Copy link
Member

@myoungwon myoungwon commented Jun 20, 2022

backport tracker: https://tracker.ceph.com/issues/56099, https://tracker.ceph.com/issues/56656


backport of #45035, #46866
parent tracker: https://tracker.ceph.com/issues/53855, https://tracker.ceph.com/issues/53294

Signed-off-by: Myoungwon Oh myoungwon.oh@samsung.com

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

…e pool

When start_dedup() is called while the base pool is not set the dedup_tier,
it is not possible to know the target pool of the chunk object.

1. User set the dedup_tier on a base pool by mon_command().
2. User issues tier_flush on the object which has a manifest (base pool)
  before the dedup_tier is applied on the base pool.
3. OSD calls start_dedup() to flush the chunk objects to chunk pool.
4. OSD calls get_dedup_tier() to get the chunk pool of the base pool,
  but it is not possible to know the chunk pool.
5. get_dedup_tier() returns 0 because it is not applied on the base pool yet.
6. This makes refcount_manifest() lost it's way to chunk pool.

To prevent this issue, start_dedup() has to be called after dedup_tier is set
on the base pool. To do so, this commit prohibits getting chunk pool id if
dedup_tier is not set.

Fixes: http://tracker.ceph.com/issues/53855

Signed-off-by: Sungmin Lee <sung_min.lee@samsung.com>
(cherry picked from commit 66ad91e)
@myoungwon myoungwon requested a review from a team as a code owner June 20, 2022 07:37
@github-actions github-actions bot added this to the pacific milestone Jun 20, 2022
Copy link
Member

@ljflores ljflores left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@myoungwon please have a look. This was caught on a test branch that contains your fix.

/a/yuriw-2022-07-05_17:38:43-rados-wip-yuri4-testing-2022-07-05-0719-pacific-distro-default-smithi/6915020

2022-07-06T02:26:55.749 INFO:tasks.ceph.osd.5.smithi191.stderr:2022-07-06T02:26:55.748+0000 7f83aa443700 -1 osd.5 399 get_health_metrics reporting 15 slow ops, oldest is osd_op(client.4785.0:7419 214.f 214:f5e1fadd:test-rados-api-smithi191-104463-84::foo:head [tier-flush] snapc 0=[] RETRY=1 ondisk+retry+read+ignore_cache+known_if_redirected e398)
2022-07-06T02:26:56.777 INFO:tasks.ceph.osd.5.smithi191.stderr:2022-07-06T02:26:56.763+0000 7f83aa443700 -1 osd.5 399 get_health_metrics reporting 15 slow ops, oldest is osd_op(client.4785.0:7419 214.f 214:f5e1fadd:test-rados-api-smithi191-104463-84::foo:head [tier-flush] snapc 0=[] RETRY=1 ondisk+retry+read+ignore_cache+known_if_redirected e398)
2022-07-06T02:26:57.748 INFO:tasks.ceph.osd.5.smithi191.stderr:2022-07-06T02:26:57.747+0000 7f83aa443700 -1 osd.5 399 get_health_metrics reporting 15 slow ops, oldest is osd_op(client.4785.0:7419 214.f 214:f5e1fadd:test-rados-api-smithi191-104463-84::foo:head [tier-flush] snapc 0=[] RETRY=1 ondisk+retry+read+ignore_cache+known_if_redirected e398)
2022-07-06T02:26:58.757 INFO:tasks.workunit.client.0.smithi191.stderr:++ cleanup
2022-07-06T02:26:58.757 INFO:tasks.workunit.client.0.smithi191.stderr:++ pkill -P 104256
2022-07-06T02:26:58.780 DEBUG:teuthology.orchestra.run:got remote process result: 124
2022-07-06T02:26:58.781 INFO:tasks.workunit.client.0.smithi191.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh: line 10: 104457 Terminated              bash -o pipefail -exc "ceph_test_rados_$f $color 2>&1 | tee ceph_test_rados_$ff.log | sed \"s/^/$r: /\""
2022-07-06T02:26:58.782 INFO:tasks.workunit.client.0.smithi191.stderr:++ true
2022-07-06T02:26:58.783 INFO:tasks.workunit.client.0.smithi191.stdout:              api_tier_pp: [       OK ] LibRadosTwoPoolsPP.CacheP
2022-07-06T02:26:58.783 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0...
2022-07-06T02:26:58.785 DEBUG:teuthology.orchestra.run.smithi191:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2022-07-06T02:26:58.797 INFO:tasks.ceph.osd.5.smithi191.stderr:2022-07-06T02:26:58.796+0000 7f83aa443700 -1 osd.5 399 get_health_metrics reporting 15 slow ops, oldest is osd_op(client.4785.0:7419 214.f 214:f5e1fadd:test-rados-api-smithi191-104463-84::foo:head [tier-flush] snapc 0=[] RETRY=1 ondisk+retry+read+ignore_cache+known_if_redirected e398)
2022-07-06T02:26:59.053 ERROR:teuthology.run_tasks:Saw exception from tasks.

This tracker is fixing an issue with LibRadosTwoPoolsPP.TierFlushDuringFlush, but I see that this one caught on LibRadosTwoPoolsPP.CachePin. We have been treating the failures on TierFlushDuringFlush and CachePin as the same, but I wonder if these are two separate issues (see https://tracker.ceph.com/issues/53855 for both examples).

All in all, we should make sure that this PR is truly fixing the issue before we move forward with the backport.

@ljflores ljflores requested a review from neha-ojha July 7, 2022 20:45
@myoungwon
Copy link
Member Author

@ljflores I observed the same issue as this before, then fixed it by #46866. It seems that #46866 needs to pass QA first, then it should be backport to pacific and quincy.

@ljflores
Copy link
Member

@ljflores I observed the same issue as this before, then fixed it by #46866. It seems that #46866 needs to pass QA first, then it should be backport to pacific and quincy.

I see, thanks @myoungwon. I have approved #46866 and tagged it for needs-qa.

@ljflores
Copy link
Member

#46866 still needs a pacific backport. Then this one and the pacific backport for #46866 can be tested together.

During tier-flush, OSD sends reference increase message to target OSD.
At this point, sending message with invalid pool information (e.g., deleted pool)
causes unexpected behavior.

Therefore, this commit return ENOENT early before sending the message

fixes: https://tracker.ceph.com/issues/53294

Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
(cherry picked from 3de27b2)
@myoungwon
Copy link
Member Author

myoungwon commented Jul 21, 2022

@ljflores I added a commit regarding #46866 here because it causes a conflict without this PR (it depends on this PR). Is that OK?

@ljflores
Copy link
Member

@myoungwon Yes this is okay. I meant for those to backports to be tested together, and adding the commit here fulfills that. Thanks!

@yuriw yuriw merged commit bf8ad0d into ceph:pacific Aug 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants