Skip to content

os/bluestore: new write path (checksums and compression)#9228

Merged
liewegas merged 136 commits intoceph:masterfrom
liewegas:wip-bluestore-write
Jun 1, 2016
Merged

os/bluestore: new write path (checksums and compression)#9228
liewegas merged 136 commits intoceph:masterfrom
liewegas:wip-bluestore-write

Conversation

@liewegas
Copy link
Member

This branch completely replaces the read and write path with a new design.
The disk format has changed, and the data structures are very different.

Compression is only partially supported, but checksums work.

ceph_test_objectstore passes all tests.

OPTION(bluestore_min_alloc_size_ssd, OPT_U32, 4*1024)
OPTION(bluestore_compression, OPT_STR, "none") // force|aggressive|passive|none
OPTION(bluestore_compression_algorithm, OPT_STR, "snappy")
OPTION(bluestore_compression_min_blob_size, OPT_U32, 256*1024)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should consider blob size specification in allocation units rather than absolute units. This way we'll always have blobs aligned with alloc units
E.g.
bluestore_compression_min_blob_size = 8 // i.e = 8 * bluestore_mon_alloc_size = 8 * 64 * 1024 bytes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markhpc
Copy link
Member

markhpc commented May 25, 2016

Am marking this performance as I want to track this both for the compression support performance characteristics and the other potential impacts on the write path.

alg = "snappy";
} else if (g_conf->bluestore_compression_algorithm == "zlib") {
alg = "zlib";
} else if (g_conf->bluestore_compression_algorithm.length()) {
Copy link
Contributor

@ifed01 ifed01 May 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can burden Compressor with the check for known compression algorithm instead of doing that here. This way adding new algorithm becomes transparent for the BlueStore. In fact Compressor already has that check..

@liewegas liewegas force-pushed the wip-bluestore-write branch 2 times, most recently from 9b8e7d0 to ba69bc0 Compare May 31, 2016 18:34
@liewegas liewegas force-pushed the wip-bluestore-write branch from f314c23 to e854964 Compare June 1, 2016 15:27
Igor Fedotov and others added 10 commits June 1, 2016 11:38
…tion. Read handler prototype implementation.

Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
This lets us track which portions of a blob are still in use.  In some
cases, we may be able to split the blob to deallocate a portion of it.
In other cases, we will want this information to know whether to
recompress the blob (or whatever).

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Trying to remove the old extent_t

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
We'll remove the old ref_map once the users go away.

Signed-off-by: Sage Weil <sage@redhat.com>
liewegas and others added 27 commits June 1, 2016 11:40
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
…ressed data length.

Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
e.g., 0x432da000~1000 instead of 0x432da000~0x1000

I think it's sufficiently clear that the value after ~ should have the same
base as the first bit, and it's easier to read.  And less text.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
_wctx_finish callers always write the onode; we only need to worry about
our changes to the bnode.

Signed-off-by: Sage Weil <sage@redhat.com>
Also include b_off in there.

Signed-off-by: Sage Weil <sage@redhat.com>
Kill some mostly-duplicated code

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
…rite happens for neighboring csum blocks to verify for potential alignment issue

Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
…egacy Bnode::ref_map

Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
…tion (incomplete)

Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
encode/decode of vector<char> is not optimized.  Bufferptr is a more
natural type here anyway.

Signed-off-by: Sage Weil <sage@redhat.com>
Size these using a global config.  This is only a starting point--we'll
obviously have to rework this to share memory across collections.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Add a Buffer flag to mark that a buffer should not be cached once it is
stable.

Signed-off-by: Sage Weil <sage@redhat.com>
The checks are the same (or should be--we had missed a few).

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas liewegas force-pushed the wip-bluestore-write branch from e854964 to 1c2c6cc Compare June 1, 2016 15:41
@liewegas liewegas merged commit 9c8b085 into ceph:master Jun 1, 2016
@liewegas liewegas deleted the wip-bluestore-write branch August 12, 2016 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants