osd: flush operations for chunked objects by myoungwon · Pull Request #19294 · ceph/ceph

myoungwon · 2017-12-02T13:18:34Z

These commits are the second stage (chunked manifest) for deduplication
(http://pad.ceph.com/p/deduplication_how_do_we_store_chunk)
The rest of the previous work (#15482)

Signed-off-by: Myoungwon Oh omwmw@sk.com

myoungwon · 2017-12-08T00:34:51Z

retest this please

myoungwon · 2017-12-08T10:58:58Z

@liewegas In the previous discussion (#15482), we has discussed the following two issues.

how to handle ref cleanup (e.g., when we promote or when we delete the object with the manifest)

if/how to handle a chunked object where some chunks are clean and others are stored on the local object.

I think two issues seem to be handled separately. so, I make this PR first (for flushing, how to handle clean and dirty chunks).
You mentioned at (#15482) as following.

no manifest (object is local)

we decide there are 2 chunks

we write the first chunk, new manifest written with a CLEAN and DIRTY chunk (or the CLEAN chunk is MISSING and we zero out that range of the object)

we write the second chunk, local object truncated to 0, new manifest written with two MISSING chunks

This PR is implemented as you comments. Can you take a look?

liewegas · 2017-12-08T17:42:47Z

src/osd/PrimaryLogPG.cc

+    }
+    uint64_t tgt_length = iter->second.length;
+    uint64_t tgt_offset= iter->second.offset;
+    hobject_t tgt_soid = iter->second.oid;


I'm not understanding this operation. It seems like there are actually two states? In one case, we did the chunking and figured out the object name but we just haven't written it back yet. In teh second case, we did that already, and overwrite it again, so now it has new/different content. In that case, we don't actually want to write it back to the same object, do we? (Or is the content-based naming not addressed yet? Sorry, my memory is fuzzy on what has been done so far and what hasn't!)

There is one or two state depending on the object size. For example,

figure out dirty chunks's size (because we don't know actual dirty chunks,
this will be used to check the flush completion)

send write messages

receive acks
3.1. check completion (whether all dirty chunk we sent are completed or not)
3.2. do remaining jobs

Regarding remaining jobs,
I thought we needed a way to avoid resource starvation. For example,
manifest_flush() needs read(storage) ->write(network) operations. but, if the object size is huge, flush() repeats such operations without yielding resources.
So, I added the following logic. If the object size (many dirty chunks) is larger than the threshold value (for example, 4MB), just stop sending data and wait for the previous flush request to complete.
This flush() will continue when the previous requests (for flush()) are complete.

The flush model sounds right.

I'm thinking of this scenario:

write object

turn it into many chunks

chunks are flushed to lower tier

part of object is overwritten in base, and some chunks are marked dirty

At this point, the implemenation of flush above is assuming that the chunks are mutable, so just marking them dirty so they can be updated later is sufficient. This is fine for normal tiering, but won't work for dedup (where the chunk is immutable, and the manifest chunk isn't dirty, it's more like 'dangling'). Maybe that's fine for now, but what is the plan here? Will these chunks have a state bit like "immutable" or "cas" so that the base tier knows how to deal with them? Because I think the "mark chunks dirty when we overwrite parts of the object" code will need to change. This mid-point is a but odd because I don't think you would actually chunk an object into mutable chunks... chunking is only really useful for the dedup case.

I think the chunk (both dedup and normal tier) can be "mutable" because we can consider following scenarios.

Chunk states for normal tiering (no problem):
Missing -> (overwritten) -> dirty -> (flushed) ->clean or missing -> (overwritten) -> dirty
-> (flushed) -> clean or missing...

Chunk states for dedup tiering (no dangling):
Missing -> (overwritten) -> dirty -> (flushed) -> clean or missing -> (overwritten) -> dirty
-> (if the chunk was flushed, we can see old chunk,target_oid (== old cas object id).
So, just send decrease reference message to old cas object and then, flush()) ->
(if flush is finished, chunk.target_oid can be updated by new value) -> clean or missing ...

We only need to check whether the chunk was flushed and send decrease message before the chunks is flushed to lower tier. If the cas object's reference is 0, this will be deleted.

Did I misunderstand your question or are there any uncertainties in what I said?

That makes sense!

myoungwon · 2017-12-26T00:29:02Z

retest this please

yuriw · 2018-01-02T17:12:11Z

wip-yuri4-testing-2018-01-02-1711

myoungwon · 2018-01-05T09:14:45Z

I added commits in order to fix a test failure (http://pulpito.ceph.com/yuriw-2018-01-04_20:43:14-rados-wip-yuri4-testing-2018-01-04-1750-distro-basic-smithi/2026920/). So, retest is needed.

yuriw · 2018-01-06T00:59:00Z

@myoungwon ok will retest

yuriw · 2018-01-06T17:33:15Z

wip-yuri4-testing-2018-01-06-1732

Signed-off-by: Myoungwon Oh <omwmw@sk.com>

If all chunks are dirty, the cheunked object will be flushed Signed-off-by: Myoungwon Oh <omwmw@sk.com>

Signed-off-by: Myoungwon Oh <omwmw@sk.com>

…test Signed-off-by: Myoungwon Oh <omwmw@sk.com>

This commit prevents double free in finish_flush() (stop_block() -> cancel_flush()) Signed-off-by: Myoungwon Oh <omwmw@sk.com>

Signed-off-by: Myoungwon Oh <omwmw@sk.com>

To avoid ObjectContextRef leak, drop ObjectContextRef before send a flush request to low tier Signed-off-by: Myoungwon Oh <omwmw@sk.com>

myoungwon · 2018-01-08T05:48:53Z

@yuriw Sorry, This PR causes bunch of test failures (*.set-chunk.yaml, http://pulpito.ceph.com/yuriw-2018-01-07_19:00:29-rados-wip-yuri4-testing-2018-01-06-1732-distro-basic-smithi/).
This is because a commit i added (write op will be handled after flush() is finished) causes an ordering problem.
For example,

all chunks are dirty (need flush)
OSD receive a write op (this op will be delayed after flush() is finished)
OSD receive a read op (add blocked_ops).
cancel flush ops
enqueue blocked_ops
At this time, ceph_test_rados receives ack for read op instead of write op.

Therefore, i remove the commit. flush() will be handled without any OpRequestRef as originally thought.

myoungwon · 2018-01-08T05:54:52Z

@liewegas @yuriw This is the test result.
http://pulpito.ceph.com/myoungwon-2018-01-07_15:08:41-rados-wip-manifest-ref-flush-distro-basic-smithi/
http://pulpito.ceph.com/myoungwon-2018-01-08_01:52:47-rados-wip-manifest-ref-flush-distro-basic-smithi/

It seems that Failed cases is not related to this PR.

myoungwon added core feature labels Dec 2, 2017

myoungwon force-pushed the wip-manifest-ref-flush branch 6 times, most recently from 972d52f to 85898f7 Compare December 7, 2017 07:47

liewegas reviewed Dec 8, 2017

View reviewed changes

liewegas added the needs-qa label Dec 11, 2017

myoungwon mentioned this pull request Dec 13, 2017

osd: fix unordered read bug (for chunked object) #19464

Merged

myoungwon force-pushed the wip-manifest-ref-flush branch from 85898f7 to 1bc4410 Compare December 13, 2017 06:36

myoungwon force-pushed the wip-manifest-ref-flush branch 3 times, most recently from 8d7f35d to 9dabae8 Compare December 22, 2017 11:12

yuriw added the wip-yuri4-testing label Jan 2, 2018

myoungwon changed the title ~~WIP: osd: ref cleanup and flush operations for chunked objects~~ osd: flush operations for chunked objects Jan 4, 2018

myoungwon force-pushed the wip-manifest-ref-flush branch 2 times, most recently from 5081568 to 3a56870 Compare January 7, 2018 13:38

myoungwon added 4 commits January 7, 2018 22:40

osd: set dirty flag if chunks are overwritten

437bb83

Signed-off-by: Myoungwon Oh <omwmw@sk.com>

osd: add flush() for the chunked object.

2e3af00

If all chunks are dirty, the cheunked object will be flushed Signed-off-by: Myoungwon Oh <omwmw@sk.com>

osd: add ordered flag if the object is flushing

e5cc463

Signed-off-by: Myoungwon Oh <omwmw@sk.com>

src/test: add chunked object unit test

c97fc50

Signed-off-by: Myoungwon Oh <omwmw@sk.com>

myoungwon added 4 commits January 7, 2018 22:41

src/test: remove version check and add data alignment for chunk_read …

fca74ef

…test Signed-off-by: Myoungwon Oh <omwmw@sk.com>

osd: use stop_block() if the object is blocked

6273c2f

This commit prevents double free in finish_flush() (stop_block() -> cancel_flush()) Signed-off-by: Myoungwon Oh <omwmw@sk.com>

osd: fix updating wrong object size

085f1ca

Signed-off-by: Myoungwon Oh <omwmw@sk.com>

osd: fix ObjectContextRef leak

3a56870

To avoid ObjectContextRef leak, drop ObjectContextRef before send a flush request to low tier Signed-off-by: Myoungwon Oh <omwmw@sk.com>

myoungwon requested a review from liewegas January 8, 2018 11:05

liewegas approved these changes Jan 8, 2018

View reviewed changes

liewegas merged commit a913358 into ceph:master Jan 8, 2018

myoungwon mentioned this pull request Jan 12, 2018

osd: refcount for manifest object (redirect, chunked) #19935

Merged

Conversation

myoungwon commented Dec 2, 2017

Uh oh!

myoungwon commented Dec 8, 2017

Uh oh!

myoungwon commented Dec 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liewegas Dec 8, 2017

Choose a reason for hiding this comment

Uh oh!

myoungwon Dec 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liewegas Dec 9, 2017

Choose a reason for hiding this comment

Uh oh!

myoungwon Dec 11, 2017

Choose a reason for hiding this comment

Uh oh!

liewegas Dec 11, 2017

Choose a reason for hiding this comment

Uh oh!

myoungwon commented Dec 26, 2017

Uh oh!

yuriw commented Jan 2, 2018

Uh oh!

myoungwon commented Jan 5, 2018

Uh oh!

yuriw commented Jan 6, 2018

Uh oh!

yuriw commented Jan 6, 2018

Uh oh!

myoungwon commented Jan 8, 2018

Uh oh!

myoungwon commented Jan 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

myoungwon commented Dec 8, 2017 •

edited

Loading

myoungwon Dec 9, 2017 •

edited

Loading

myoungwon commented Jan 8, 2018 •

edited

Loading