Skip to content

osd: flush operations for chunked objects#19294

Merged
liewegas merged 8 commits intoceph:masterfrom
myoungwon:wip-manifest-ref-flush
Jan 8, 2018
Merged

osd: flush operations for chunked objects#19294
liewegas merged 8 commits intoceph:masterfrom
myoungwon:wip-manifest-ref-flush

Conversation

@myoungwon
Copy link
Member

These commits are the second stage (chunked manifest) for deduplication
(http://pad.ceph.com/p/deduplication_how_do_we_store_chunk)
The rest of the previous work (#15482)

Signed-off-by: Myoungwon Oh omwmw@sk.com

@myoungwon myoungwon force-pushed the wip-manifest-ref-flush branch 6 times, most recently from 972d52f to 85898f7 Compare December 7, 2017 07:47
@myoungwon
Copy link
Member Author

retest this please

@myoungwon
Copy link
Member Author

myoungwon commented Dec 8, 2017

@liewegas In the previous discussion (#15482), we has discussed the following two issues.

  1. how to handle ref cleanup (e.g., when we promote or when we delete the object with the manifest)
  2. if/how to handle a chunked object where some chunks are clean and others are stored on the local object.

I think two issues seem to be handled separately. so, I make this PR first (for flushing, how to handle clean and dirty chunks).
You mentioned at (#15482) as following.

  1. no manifest (object is local)
  2. we decide there are 2 chunks
  3. we write the first chunk, new manifest written with a CLEAN and DIRTY chunk (or the CLEAN chunk is MISSING and we zero out that range of the object)
  4. we write the second chunk, local object truncated to 0, new manifest written with two MISSING chunks

This PR is implemented as you comments. Can you take a look?

}
uint64_t tgt_length = iter->second.length;
uint64_t tgt_offset= iter->second.offset;
hobject_t tgt_soid = iter->second.oid;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not understanding this operation. It seems like there are actually two states? In one case, we did the chunking and figured out the object name but we just haven't written it back yet. In teh second case, we did that already, and overwrite it again, so now it has new/different content. In that case, we don't actually want to write it back to the same object, do we? (Or is the content-based naming not addressed yet? Sorry, my memory is fuzzy on what has been done so far and what hasn't!)

Copy link
Member Author

@myoungwon myoungwon Dec 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one or two state depending on the object size. For example,

  1. figure out dirty chunks's size (because we don't know actual dirty chunks,
    this will be used to check the flush completion)
  2. send write messages
  3. receive acks
    3.1. check completion (whether all dirty chunk we sent are completed or not)
    3.2. do remaining jobs

Regarding remaining jobs,
I thought we needed a way to avoid resource starvation. For example,
manifest_flush() needs read(storage) ->write(network) operations. but, if the object size is huge, flush() repeats such operations without yielding resources.
So, I added the following logic. If the object size (many dirty chunks) is larger than the threshold value (for example, 4MB), just stop sending data and wait for the previous flush request to complete.
This flush() will continue when the previous requests (for flush()) are complete.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flush model sounds right.

I'm thinking of this scenario:

  1. write object
  2. turn it into many chunks
  3. chunks are flushed to lower tier
  4. part of object is overwritten in base, and some chunks are marked dirty

At this point, the implemenation of flush above is assuming that the chunks are mutable, so just marking them dirty so they can be updated later is sufficient. This is fine for normal tiering, but won't work for dedup (where the chunk is immutable, and the manifest chunk isn't dirty, it's more like 'dangling'). Maybe that's fine for now, but what is the plan here? Will these chunks have a state bit like "immutable" or "cas" so that the base tier knows how to deal with them? Because I think the "mark chunks dirty when we overwrite parts of the object" code will need to change. This mid-point is a but odd because I don't think you would actually chunk an object into mutable chunks... chunking is only really useful for the dedup case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the chunk (both dedup and normal tier) can be "mutable" because we can consider following scenarios.

  • Chunk states for normal tiering (no problem):
    Missing -> (overwritten) -> dirty -> (flushed) ->clean or missing -> (overwritten) -> dirty
    -> (flushed) -> clean or missing...

  • Chunk states for dedup tiering (no dangling):
    Missing -> (overwritten) -> dirty -> (flushed) -> clean or missing -> (overwritten) -> dirty
    -> (if the chunk was flushed, we can see old chunk,target_oid (== old cas object id).
    So, just send decrease reference message to old cas object and then, flush()) ->
    (if flush is finished, chunk.target_oid can be updated by new value) -> clean or missing ...

We only need to check whether the chunk was flushed and send decrease message before the chunks is flushed to lower tier. If the cas object's reference is 0, this will be deleted.

Did I misunderstand your question or are there any uncertainties in what I said?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense!

@myoungwon myoungwon force-pushed the wip-manifest-ref-flush branch from 85898f7 to 1bc4410 Compare December 13, 2017 06:36
@myoungwon myoungwon force-pushed the wip-manifest-ref-flush branch 3 times, most recently from 8d7f35d to 9dabae8 Compare December 22, 2017 11:12
@myoungwon
Copy link
Member Author

retest this please

@yuriw
Copy link
Contributor

yuriw commented Jan 2, 2018

@myoungwon myoungwon changed the title WIP: osd: ref cleanup and flush operations for chunked objects osd: flush operations for chunked objects Jan 4, 2018
@myoungwon
Copy link
Member Author

I added commits in order to fix a test failure (http://pulpito.ceph.com/yuriw-2018-01-04_20:43:14-rados-wip-yuri4-testing-2018-01-04-1750-distro-basic-smithi/2026920/). So, retest is needed.

@yuriw
Copy link
Contributor

yuriw commented Jan 6, 2018

@myoungwon ok will retest

@yuriw
Copy link
Contributor

yuriw commented Jan 6, 2018

@myoungwon myoungwon force-pushed the wip-manifest-ref-flush branch 2 times, most recently from 5081568 to 3a56870 Compare January 7, 2018 13:38
Signed-off-by: Myoungwon Oh <omwmw@sk.com>
If all chunks are dirty, the cheunked object will be flushed

Signed-off-by: Myoungwon Oh <omwmw@sk.com>
Signed-off-by: Myoungwon Oh <omwmw@sk.com>
Signed-off-by: Myoungwon Oh <omwmw@sk.com>
…test

Signed-off-by: Myoungwon Oh <omwmw@sk.com>
This commit prevents double free in finish_flush()
(stop_block() -> cancel_flush())

Signed-off-by: Myoungwon Oh <omwmw@sk.com>
Signed-off-by: Myoungwon Oh <omwmw@sk.com>
To avoid ObjectContextRef leak, drop ObjectContextRef
before send a flush request to low tier

Signed-off-by: Myoungwon Oh <omwmw@sk.com>
@myoungwon
Copy link
Member Author

@yuriw Sorry, This PR causes bunch of test failures (*.set-chunk.yaml, http://pulpito.ceph.com/yuriw-2018-01-07_19:00:29-rados-wip-yuri4-testing-2018-01-06-1732-distro-basic-smithi/).
This is because a commit i added (write op will be handled after flush() is finished) causes an ordering problem.
For example,

  1. all chunks are dirty (need flush)
  2. OSD receive a write op (this op will be delayed after flush() is finished)
  3. OSD receive a read op (add blocked_ops).
  4. cancel flush ops
  5. enqueue blocked_ops
    At this time, ceph_test_rados receives ack for read op instead of write op.

Therefore, i remove the commit. flush() will be handled without any OpRequestRef as originally thought.

@myoungwon
Copy link
Member Author

myoungwon commented Jan 8, 2018

@myoungwon myoungwon requested a review from liewegas January 8, 2018 11:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants