osd: flush operations for chunked objects#19294
Conversation
972d52f to
85898f7
Compare
|
retest this please |
|
@liewegas In the previous discussion (#15482), we has discussed the following two issues.
I think two issues seem to be handled separately. so, I make this PR first (for flushing, how to handle clean and dirty chunks).
This PR is implemented as you comments. Can you take a look? |
| } | ||
| uint64_t tgt_length = iter->second.length; | ||
| uint64_t tgt_offset= iter->second.offset; | ||
| hobject_t tgt_soid = iter->second.oid; |
There was a problem hiding this comment.
I'm not understanding this operation. It seems like there are actually two states? In one case, we did the chunking and figured out the object name but we just haven't written it back yet. In teh second case, we did that already, and overwrite it again, so now it has new/different content. In that case, we don't actually want to write it back to the same object, do we? (Or is the content-based naming not addressed yet? Sorry, my memory is fuzzy on what has been done so far and what hasn't!)
There was a problem hiding this comment.
There is one or two state depending on the object size. For example,
- figure out dirty chunks's size (because we don't know actual dirty chunks,
this will be used to check the flush completion) - send write messages
- receive acks
3.1. check completion (whether all dirty chunk we sent are completed or not)
3.2. do remaining jobs
Regarding remaining jobs,
I thought we needed a way to avoid resource starvation. For example,
manifest_flush() needs read(storage) ->write(network) operations. but, if the object size is huge, flush() repeats such operations without yielding resources.
So, I added the following logic. If the object size (many dirty chunks) is larger than the threshold value (for example, 4MB), just stop sending data and wait for the previous flush request to complete.
This flush() will continue when the previous requests (for flush()) are complete.
There was a problem hiding this comment.
The flush model sounds right.
I'm thinking of this scenario:
- write object
- turn it into many chunks
- chunks are flushed to lower tier
- part of object is overwritten in base, and some chunks are marked dirty
At this point, the implemenation of flush above is assuming that the chunks are mutable, so just marking them dirty so they can be updated later is sufficient. This is fine for normal tiering, but won't work for dedup (where the chunk is immutable, and the manifest chunk isn't dirty, it's more like 'dangling'). Maybe that's fine for now, but what is the plan here? Will these chunks have a state bit like "immutable" or "cas" so that the base tier knows how to deal with them? Because I think the "mark chunks dirty when we overwrite parts of the object" code will need to change. This mid-point is a but odd because I don't think you would actually chunk an object into mutable chunks... chunking is only really useful for the dedup case.
There was a problem hiding this comment.
I think the chunk (both dedup and normal tier) can be "mutable" because we can consider following scenarios.
-
Chunk states for normal tiering (no problem):
Missing -> (overwritten) -> dirty -> (flushed) ->clean or missing -> (overwritten) -> dirty
-> (flushed) -> clean or missing... -
Chunk states for dedup tiering (no dangling):
Missing -> (overwritten) -> dirty -> (flushed) -> clean or missing -> (overwritten) -> dirty
-> (if the chunk was flushed, we can see old chunk,target_oid (== old cas object id).
So, just send decrease reference message to old cas object and then, flush()) ->
(if flush is finished, chunk.target_oid can be updated by new value) -> clean or missing ...
We only need to check whether the chunk was flushed and send decrease message before the chunks is flushed to lower tier. If the cas object's reference is 0, this will be deleted.
Did I misunderstand your question or are there any uncertainties in what I said?
85898f7 to
1bc4410
Compare
8d7f35d to
9dabae8
Compare
|
retest this please |
|
I added commits in order to fix a test failure (http://pulpito.ceph.com/yuriw-2018-01-04_20:43:14-rados-wip-yuri4-testing-2018-01-04-1750-distro-basic-smithi/2026920/). So, retest is needed. |
|
@myoungwon ok will retest |
5081568 to
3a56870
Compare
Signed-off-by: Myoungwon Oh <omwmw@sk.com>
If all chunks are dirty, the cheunked object will be flushed Signed-off-by: Myoungwon Oh <omwmw@sk.com>
Signed-off-by: Myoungwon Oh <omwmw@sk.com>
Signed-off-by: Myoungwon Oh <omwmw@sk.com>
…test Signed-off-by: Myoungwon Oh <omwmw@sk.com>
This commit prevents double free in finish_flush() (stop_block() -> cancel_flush()) Signed-off-by: Myoungwon Oh <omwmw@sk.com>
Signed-off-by: Myoungwon Oh <omwmw@sk.com>
To avoid ObjectContextRef leak, drop ObjectContextRef before send a flush request to low tier Signed-off-by: Myoungwon Oh <omwmw@sk.com>
|
@yuriw Sorry, This PR causes bunch of test failures (*.set-chunk.yaml, http://pulpito.ceph.com/yuriw-2018-01-07_19:00:29-rados-wip-yuri4-testing-2018-01-06-1732-distro-basic-smithi/).
Therefore, i remove the commit. flush() will be handled without any OpRequestRef as originally thought. |
|
@liewegas @yuriw This is the test result. It seems that Failed cases is not related to this PR. |
These commits are the second stage (chunked manifest) for deduplication
(http://pad.ceph.com/p/deduplication_how_do_we_store_chunk)
The rest of the previous work (#15482)
Signed-off-by: Myoungwon Oh omwmw@sk.com