Conversation
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Instead of replicating when writing running `_write_bdev_label` we replicate on mkfs and in mount. `_replicate_bdev_label` will try to replicate the label by checking first if the allocator didn't allocate anything in the offset we want to replicate at. `read_bdev_label` will read all offsets where we might find labels as it expects at least one to be ok. If it wasn't replicated and we try to read other offset a part from "0", we assume it will fail on crc check. Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
always run init_rm_free on expected offsets at mount and fsck so that aloc doesn't try to use those chunks Signed-off-by: Pere Diaz Bou <pdiabou@redhat.com>
3ff2ac5 to
ce05964
Compare
Signed-off-by: Pere Diaz Bou <pdiabou@redhat.com>
| } | ||
| } | ||
|
|
||
| int BlueStore::_replicate_bdev_label(const string &path, bluestore_bdev_label_t label, bool alloc_available) { |
There was a problem hiding this comment.
Why _replicate_bdev_label is separate from _write_bdev_label?
All of them should always contain the same data.
I think that the additional locations of label should be handled inside _write_bdev_label().
There was a problem hiding this comment.
_write_bdev_label is static in contrast to _replicate_bdev_label which isn't because it uses alloc
There was a problem hiding this comment.
IMO better then to have _write_bdev_label being responsible specifically for writing labels to multiple destinations and to introduce another function to tag label's extents unallocated?
Signed-off-by: Pere Diaz Bou <pdiabou@redhat.com>
|
jenkins test make check |
|
|
||
| // Label offsets where they might be replicated. It is possible on previous versions where these offsets | ||
| // were already used so labels won't exist there. | ||
| const vector<uint64_t> bdev_label_offsets = {0, 1073741824, 107374182400, 1099511627776}; |
There was a problem hiding this comment.
what's about making this configurable through regular ceph settings?
There was a problem hiding this comment.
Caph settings can change, but to locate bdev label copies it is best to have them in fixed, predictable location.
I do not think that having this configurable brings additional value.
|
I feel like there will be a problem when we upgrade from previous version. |
Signed-off-by: Pere Diaz Bou <pdiabou@redhat.com>
|
jenkins test make check |
|
@pereman2 - could you please provide some design overview behind your implementation so we can discuss the approach. I've been thinking about something like the following but as far as I can see your approach is rather different:
Alternatively (and apparently more in line with your approach) we can fail on step 3 if some label is invalid and request fsck to fix that but there is a drawback in this approach - some label locations could be permanently damaged due to e.g. disk errors - this will effectively forbid OSD from starting as the following loop (read - error - fsck - read - error - ...) occurs. |
| void BlueStore::_unalloc_bdev_label_offset(uint64_t offset) { | ||
| ceph_assert(alloc); | ||
| bool existing_allocation_in_offset = true; | ||
| alloc->foreach([&](uint64_t extent_offset, uint64_t extent_length) { |
There was a problem hiding this comment.
This could be quite time consuming operation on a highly fragmented modern HDD.
Suggest to introduce virtual bool Allocator::is_allocated(offset [, len]) method and implement it in descendants to perform that efficiently.
There was a problem hiding this comment.
And most likely we can avoid this at all if we rely on bdev content to determine new multilabel mode and unconditionally mark relevant locations as non-free on mount.
There was a problem hiding this comment.
Agree, I missed a procedure like that. How hard would it be to implement on all allocators? Are the some that we could skip?
There was a problem hiding this comment.
This is doable but for sure requires some efforts to maintain this functionality in every allocator (and cover with proper testing). Let's postpone this though till we agree on the overall design. May be we'll don't need it at all
In general what you are proposing looks more or less like our first of many ideas. The key idea behind committed changes is to only deal with multiple labels on upgraded OSDs. If you upgraded and mount again Right now when we mount we read first label ( |
ifed01
left a comment
There was a problem hiding this comment.
I'm still skeptical about the overall design - primarily I don't like extent at offset zero being a key stone - persistent disk error at it will make OSD unrecoverable - which IMO is a design flaw. And that's the issue which I really observed a few times in the field.
Persistent disk errors at other label locations aren't well handled IMO for now as well. Could be fixed by proper implementation though.
Apart from that there are a few label corruption/recovery scenarios which aren't properly implemented as well - undecodable label or label mismatch fields are the examples. Definitely more thorough unit testing is required.
And finally I'd really like to have more options to control the feature:
- Enable/disable label replication on mkfs. This will help in both working around some potential issues in the field and have better testing coverage - e.g. we can't implement "upgrade" test cases at store_test right now.
- Enable/disable label replication in _write_bdev_label/_replicate_bdev_label even when it's been originally enabled on mkfs. Might be helpful to work around potential implementation issues in the field.
| if (recovery_mode) { | ||
| for (auto &label1 : labels) { | ||
| for (auto &label2 : labels) { | ||
| if (label1.osd_uuid != label2.osd_uuid) { |
There was a problem hiding this comment.
Let's have bluestore_bdev_label_t::compare method or something to compare labels. Just to have this logic close to class definition as label content change will be error-prone.
| ASSERT_EQ(0, r); | ||
| } | ||
|
|
||
| TEST_P(StoreTest, CorruptedLabelFsck) { |
There was a problem hiding this comment.
This is a very limited scenario to test the functionality. Provided overwrite causes checksum mismatch in the first label only.
E.g. making some label undecodable or making label comparison failed (with proper checksums) will cause completely different behavior: most likely - unrecoverable OSD state.
E.g. try to increase bl size by calling:
bl.append_zero(100);
| return r; | ||
| } | ||
| } else { | ||
| int r = _read_bdev_label(cct, p, &label, true); |
There was a problem hiding this comment.
Please see the error conditions in this function. Once some one is met - no recovery will happen as you return a few lines below. Please also see my comment in the new test case at store_test.cc - it simply doesn't cover all the possible cases hence you miss the error in this path.
| int r = 0; | ||
| for (uint64_t offset : bdev_label_offsets) { | ||
| r = bl.write_fd(fd, offset); | ||
| if (r < 0) { |
There was a problem hiding this comment.
So failing to write out a specific label (e.g. due to persistent disk block error) will cause inability to "fix" an OSD and e.g. use it for data recovery. I don't think this is our expectation for this new feature.
| string p = path + "/block"; | ||
| bluestore_bdev_label_t label; | ||
| _read_bdev_label(cct, p, &label); | ||
| _replicate_bdev_label(p, label); |
There was a problem hiding this comment.
DB is in read-write mode at this point. Hence there is no guarantee that there were no allocations at main device which could be corrupted by label replication. Additionally this unconditional replication is in fact pretty error-prone and future changes to mkfs procedure could cause selected locations to be occupied and this function will silently corrupt data there.
| if (offset >= bdev->get_size()) { | ||
| break; | ||
| } | ||
| _unalloc_bdev_label_offset(offset); |
There was a problem hiding this comment.
Similarly to mkfs - it's too late to tag label extents as allocated at this point - DB has been already opened in R/W and someone could has written some data over there. As a result labels will be corrupted and one will need to repair them at the next OSD start - causing new data corruption for DB data or something..
| void BlueStore::_unalloc_bdev_label_offset(uint64_t offset) { | ||
| ceph_assert(alloc); | ||
| bool existing_allocation_in_offset = true; | ||
| alloc->foreach([&](uint64_t extent_offset, uint64_t extent_length) { |
There was a problem hiding this comment.
This is doable but for sure requires some efforts to maintain this functionality in every allocator (and cover with proper testing). Let's postpone this though till we agree on the overall design. May be we'll don't need it at all
| string value; | ||
| int r = read_meta("replicated_label", &value); | ||
| if (r == 0 && value == "yes") { | ||
| int r = _replicate_bdev_label(p, label, false); |
There was a problem hiding this comment.
Please clarify what label do we use here as a source for the recovery? Presumably the one which we managed to read successfully above. But what if there exist two different but both valid labels, e.g. after an incomplete previous replication? Which one to choose? IMO that's another design flaw. We really need epoch/timestamp/seqno in the label to distinguish the latest replica.
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution! |
Currently this is wip as I'm working on trying to replicate bluestore's label while not messing existing allocations on where replicated label will be placed in case of backports.
The idea is to replicate bluestore's label on different offsets like: 0, 1GiB 10GiB and 1TiB. Currently, only /block is being replicated but it is expected to expand this onto bluefs if possible.
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows