os/bluestore: new write path (checksums and compression)#9228
os/bluestore: new write path (checksums and compression)#9228liewegas merged 136 commits intoceph:masterfrom
Conversation
0ad2c82 to
dc44dda
Compare
292d9fc to
6b799e8
Compare
e383028 to
812c5b6
Compare
| OPTION(bluestore_min_alloc_size_ssd, OPT_U32, 4*1024) | ||
| OPTION(bluestore_compression, OPT_STR, "none") // force|aggressive|passive|none | ||
| OPTION(bluestore_compression_algorithm, OPT_STR, "snappy") | ||
| OPTION(bluestore_compression_min_blob_size, OPT_U32, 256*1024) |
There was a problem hiding this comment.
Perhaps we should consider blob size specification in allocation units rather than absolute units. This way we'll always have blobs aligned with alloc units
E.g.
bluestore_compression_min_blob_size = 8 // i.e = 8 * bluestore_mon_alloc_size = 8 * 64 * 1024 bytes
There was a problem hiding this comment.
|
Am marking this performance as I want to track this both for the compression support performance characteristics and the other potential impacts on the write path. |
| alg = "snappy"; | ||
| } else if (g_conf->bluestore_compression_algorithm == "zlib") { | ||
| alg = "zlib"; | ||
| } else if (g_conf->bluestore_compression_algorithm.length()) { |
There was a problem hiding this comment.
We can burden Compressor with the check for known compression algorithm instead of doing that here. This way adding new algorithm becomes transparent for the BlueStore. In fact Compressor already has that check..
9b8e7d0 to
ba69bc0
Compare
f314c23 to
e854964
Compare
…tion. Read handler prototype implementation. Signed-off-by: Igor Fedotov <ifedotov@mirantis.com> Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
This lets us track which portions of a blob are still in use. In some cases, we may be able to split the blob to deallocate a portion of it. In other cases, we will want this information to know whether to recompress the blob (or whatever). Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Trying to remove the old extent_t Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
We'll remove the old ref_map once the users go away. Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
…ressed data length. Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
e.g., 0x432da000~1000 instead of 0x432da000~0x1000 I think it's sufficiently clear that the value after ~ should have the same base as the first bit, and it's easier to read. And less text. Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
_wctx_finish callers always write the onode; we only need to worry about our changes to the bnode. Signed-off-by: Sage Weil <sage@redhat.com>
Also include b_off in there. Signed-off-by: Sage Weil <sage@redhat.com>
Kill some mostly-duplicated code Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
…rite happens for neighboring csum blocks to verify for potential alignment issue Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
…egacy Bnode::ref_map Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
…tion (incomplete) Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
encode/decode of vector<char> is not optimized. Bufferptr is a more natural type here anyway. Signed-off-by: Sage Weil <sage@redhat.com>
Size these using a global config. This is only a starting point--we'll obviously have to rework this to share memory across collections. Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Add a Buffer flag to mark that a buffer should not be cached once it is stable. Signed-off-by: Sage Weil <sage@redhat.com>
The checks are the same (or should be--we had missed a few). Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
e854964 to
1c2c6cc
Compare
This branch completely replaces the read and write path with a new design.
The disk format has changed, and the data structures are very different.
Compression is only partially supported, but checksums work.
ceph_test_objectstore passes all tests.