Skip to content

osd: add has_manifest_chunk to count chunks in snapshot#38767

Merged
athanatos merged 4 commits intoceph:masterfrom
myoungwon:wip-manifest-chunk-scrub
Jan 29, 2021
Merged

osd: add has_manifest_chunk to count chunks in snapshot#38767
athanatos merged 4 commits intoceph:masterfrom
myoungwon:wip-manifest-chunk-scrub

Conversation

@myoungwon
Copy link
Member

cls_has_chunk does not cover snapshotted manifest object.
This leads to unexpected behavior during chunk scrub.

Signed-off-by: Myoungwon Oh myoungwon.oh@samsung.com

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

{
if (!obc->obs.oi.has_manifest()) {
return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably want to remove this check from cls_has_chunk as well.

@athanatos
Copy link
Contributor

What is cls_has_chunk suppose to do? Why do we want to return true if any snapshot has that chunk? Do we use it to validate whether a particular chunk reference is live? Wouldn't it actually have to return the number of references to a particular chunk (an object might reference the same chunk twice at different offsets within the object)?

@myoungwon
Copy link
Member Author

cls_has_chunk is used for chunk scrub by ceph_dedup_tool in order to find out the reference mismatch.
Assuming that there are manifest and chunk object in two tiers like:
base tier: manifest object A (has a chunk object CHUNKA)
low tier: chunk object CHUNKA (has a reference of the object A)

Due to the false-positive design on reference counting,

[From https://github.com//pull/24230] As you know, the key idea behind of reference counting is false-positive, which means (manifest object (no ref), chunk object(has ref)) can be possible instead of (manifest object (has ref), chunk 1(no ref)). So, if we want to add chunk-scrub, it would be enough to add scrub operation in the chunk pool.

what we should do here is to examine that CHUNKA exists in the low tier, and then check that the manifest object A contains CHUNKA's reference. If not, we should remove the CHUNKA.

@myoungwon myoungwon force-pushed the wip-manifest-chunk-scrub branch from c2865cc to fddc6a1 Compare January 8, 2021 01:17
@athanatos
Copy link
Contributor

Ok, what about the multiple reference case I mention above?

@myoungwon
Copy link
Member Author

As your comment, there is a case as below,
Due to the same chunk in two different offsets within the object A, CHUNKA has two references of object A.
But, the object A does not be updated correctly.
Base tier: manifest object A (has chunk object CHUNKA )
Low tier: chunk object CHUNKA (has a reference of the object A and the object A)

In this case, cl_has_chunk() should return the number of references to decrease the reference count in CHUNKA.
Did you point out this case?

@athanatos
Copy link
Contributor

Yeah, that's what I'm asking about. Do we record how many references each backref holds?

@myoungwon myoungwon requested a review from a team as a code owner January 8, 2021 07:13
@myoungwon myoungwon removed the request for review from a team January 8, 2021 07:15
@myoungwon
Copy link
Member Author

Done. Can you take a look?

@myoungwon myoungwon force-pushed the wip-manifest-chunk-scrub branch 2 times, most recently from 5bc4aea to 0cc6aa0 Compare January 8, 2021 13:18
}
}

int PrimaryLogPG::has_manifest_chunk(ObjectContextRef obc, std::string& fp_oid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_manifest_ref_count

@myoungwon myoungwon force-pushed the wip-manifest-chunk-scrub branch from 0cc6aa0 to 78ed661 Compare January 9, 2021 06:11
obc_g ? &(obc_g->obs.oi.manifest) : nullptr ,
nullptr,
refs);
if (!refs.is_empty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to do this check, the for loop before would just immediately terminate.

@myoungwon
Copy link
Member Author

@athanatos I addressed your comment.

@athanatos
Copy link
Contributor

Needs a rados suite run prior to merge.

cls_has_chunk does not cover snapshotted manifest object.
This leads to unexpected behavior during chunk scrub.

Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
@myoungwon myoungwon force-pushed the wip-manifest-chunk-scrub branch from a860d7f to 1109a59 Compare January 25, 2021 09:56
@myoungwon
Copy link
Member Author

myoungwon commented Jan 25, 2021

@athanatos
Copy link
Contributor

@myoungwon There are a ton of dead and failed runs there.

@myoungwon
Copy link
Member Author

https://pulpito.ceph.com/myoungwon-2021-01-25_12:40:07-rados-wip-manifest-chunk-scrub-distro-basic-smithi/

Hm.. The dead and fail cases are related to this PR?
It seems that all dead cases did not finish their job due to test environment as below.

2021-01-24T06:06:15.487 ERROR:teuthology.orchestra.daemon.state:Failed to send signal 1: None
Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/daemon/state.py", line 108, in signal
self.proc.stdin.write(struct.pack('!b', sig))
File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/virtualenv/lib/python3.6/site-packages/paramiko/file.py", line 405, in write
self._write_all(data)
File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/virtualenv/lib/python3.6/site-packages/paramiko/file.py", line 522, in _write_all
count = self._write(data)
File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/virtualenv/lib/python3.6/site-packages/paramiko/channel.py", line 1364, in _write
self.channel.sendall(data)
File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/virtualenv/lib/python3.6/site-packages/paramiko/channel.py", line 846, in sendall
sent = self.send(s)
File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/virtualenv/lib/python3.6/site-packages/paramiko/channel.py", line 801, in send
return self._send(s, m)
File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/virtualenv/lib/python3.6/site-packages/paramiko/channel.py", line 1198, in _send
raise socket.error("Socket is closed")
OSError: Socket is closed

In addition, the failed case is already tracked by https://tracker.ceph.com/issues/48786, https://tracker.ceph.com/issues/48915.
Another failed case caused by _wctx_finish in bluestore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants