Add pipelined & parallel compression optimization by cp5555 · Pull Request #6262 · facebook/rocksdb

cp5555 · 2020-01-06T09:41:30Z

This PR adds support for pipelined & parallel compression optimization for BlockBasedTableBuilder. This optimization makes block building, block compression and block appending a pipeline, and uses multiple threads to accelerate block compression. Users can set CompressionOptions::parallel_threads greater than 1 to enable compression parallelism.

siying · 2020-01-09T19:02:03Z

include/rocksdb/advanced_options.h

Please rename the option name to something more specific.

Changed this option to parallel_threads.

siying · 2020-01-09T19:02:32Z

include/rocksdb/advanced_options.h

Do we need to when we already have the previous option?

Removed this since parallel_threads is sufficient.

siying · 2020-01-09T19:03:39Z

util/work_queue.h

Please copy&past the header from another file.

Copied RocksDB header for work_queue.h.

siying · 2020-01-09T19:30:50Z

util/work_queue.h

Are these source code copied from somewhere else? If not I suggest you keep the comment format similar to other code: "//" for each line.

The code was copied from facebook/zstd.

Can you put this information in a code comment? Maybe after line 27.

Added. Also add similar information in work_queue_test.cc.

siying · 2020-01-09T19:35:42Z

util/work_queue.h

Was the class copied from somewhere? If it is from a proven library then fine. If not, we probably should at least add some unit tests. It's also worth thinking whether we can simplify it but still satisfy our performance requirement.

Added a unit test for work_queue.h, since there is an out-of-box unit test from facebook/zstd. Currently all methods in the class are used except waitUntilFinished, so maybe we could keep the class WorkQueue as it was in zstd repo.

siying · 2020-01-09T19:38:49Z

table/block_based/block_based_table_builder.cc

Do we need to keep it a subclass of BlockBasedTableBuilder, or can we define a separate class for it? It feels that the relationship between the two classes are quite loose.

ParallelCompressionRep was designed as a helper class for parallel compression in block based table only. ParallelCompressionRep will only be used by BlockBasedTableBuilder, so I made it a private inner class so that it's not visible to other code, including those who used block_based_table_builder.h.

siying · 2020-01-10T22:36:45Z

table/block_based/block_based_table_builder.cc

Can you keep the coding convention? You can run "make format" to reformat it. I tried it and it also work with Ubuntu subsystem on Windows. Let me know if you want me to run it for you and give you a patch for formatting.

I've checked the make format results in latest commits. Sorry for the inconvenience.

yzygitzh · 2020-02-18T07:36:07Z

ping @siying

siying

Awesome! I don't have major comments in the main logic. More unit tests might be needed though.

siying · 2020-02-24T22:47:16Z

util/work_queue.h

Can you put this information in a code comment? Maybe after line 27.

siying · 2020-02-24T22:59:13Z

table/block_based/block_based_table_builder.h

+
+  // Get blocks from mem-table walking thread, compress them and
+  // pass them to the write thread. Used in parallel compression mode only
+  void WriteBlocks(CompressionContext& compression_ctx,


If my understanding is correct, the function is a function used by compression threads. I think we should try to think about a better name. Right now, it's a little bit hard for me to imaging that it is a long running function that keeps taking work item from a queue and process them until it is signaled to finish. Maybe including something like "thread". Or keeping the convention of some parts of the code, prefix "Bg" in the function name.

Rename WriteBlocks to BGWorkCompression.

siying · 2020-02-24T23:00:02Z

table/block_based/block_based_table_builder.h

+                              CompressionType& result_compression_type);
+
+  // Get compressed blocks from WriteBlocks and write them into SST
+  void WriteRawBlocks();


Similar to WriteBlocks(). If my understand is correct, this is the function used by the block writing thread. Can you try to think about a better name? Something like including "thread" in the function name or something like that.

Rename WriteRawBlocks to BGWorkWriteRawBlock. Get rid of "s" suffix to keep aligned with original WriteRawBlock.

siying · 2020-02-24T23:06:52Z

util/work_queue_test.cc

@@ -0,0 +1,254 @@
+//  Copyright (c) 2011-present, Facebook, Inc.  All rights reserved.


I only see a unit test for work queue, but I didn't see unit test that builds SST files with parallel compression. Maybe I missed something. Also it is preferred to have at least one unit test that covers the whole flow: from setting the option, generating SST files, and read them back and check the values are correct.

Sorry I missed unit tests in former versions.
I leveraged existing RandomizedHarnessTest in table/table_test.cc as basic unit tests. They now also check cases where parallel compression is enabled.
I also enabled DBBasicTestWithParallelIO in db/db_basic_test.cc and DBTest2::CompressionOptions in db/db_test2.cc to check parallel compression cases.
These tests should all cover the whole flow.

siying

Sorry I gave it another pass and have more comments. I should have been more careful in the first round of reviews.
Again thank you for working on it and I believe it is very cool project.

siying · 2020-02-27T21:54:34Z

table/block_based/block_based_table_builder.cc

Here is a problem: without parallel compression, inside this function call, rep_.data_begin_offset is updated, so that in BlockBasedTableBuilder::Add() we can determine the size of the file reaches limit so the file can terminate. But now, we don't know it until the background threads finished compressing the blocks.

I don't know a good way to solve the problem. Can we estimate the size and terminate the file in BlockBasedTableBuilder::Add()?

Also, can compression queue can be pointer that points to the object in the pool instead?

Here is a problem: without parallel compression, inside this function call, rep_.data_begin_offset is updated, so that in BlockBasedTableBuilder::Add() we can determine the size of the file reaches limit so the file can terminate. But now, we don't know it until the background threads finished compressing the blocks.

I don't know a good way to solve the problem. Can we estimate the size and terminate the file in BlockBasedTableBuilder::Add()?

rep_.data_begin_offset will only be increased in kBuffered state, where parallel compression code path is not involved. I think that variable tracks raw size instead of compressed size, in a synchronized and single-threaded way. Only when the state is changed to kUnbuffered, code related to parallel compression is executed. As a result, code related to rep_.data_begin_offset should work well with parallel compression enabled.

There is such a problem in ProcessKeyValueCompaction in compaction_job.cc though, where current_output_file_size is updated by builder's FileSize(). However, maximum blocks inflight is bounded by number of compression threads (I'll explain that in another comment). As a result, the file size with parallel compression is bounded by original_file_size + compressed_block_size * number_of_compression_threads. I think it's acceptable in most cases.

Also, can compression queue can be pointer that points to the object in the pool instead?

I make all WorkQueue's point to pointers in the pool. Now block_rep_pool_, compress_queue_ and write_queue_ contain references (pointers) to data in block_rep_buf_.

Add SST size estimation based on historical_compression_ratio * bytes_under_compression.

siying · 2020-02-27T21:59:41Z

table/block_based/block_based_table_builder.cc

We follow Google C++ Style and class member's naming convention is keys_ptr_: https://google.github.io/styleguide/cppguide.html#Variable_Names

siying · 2020-02-27T22:02:03Z

util/work_queue.h

Maybe I missed something but I didn't see maxSzie_ ever set in the queues used. Should we set something? Because the writer thread can be significantly faster than the background compression thread, without a limit to the queue size, we can end up with unlimited raw block in the queue, which consumes memory and make it harder to estimate file size, but does not help with anything.

The maximum size of queue, i.e. inflight blocks, was implicitly ensured because we have a determined number of BlockRep's, which is equal to number of compression threads. Each time we want to emit a block to compression, we have to fetch a BlockRep from its pool. I make this more explicit by adding setMaxSize calls in ParallelCompressionRep constructor.

Change setMaxSize to initialization in initialization list to keep the same convention as BlockBasedTableBuilder::Rep.

siying · 2020-02-27T22:09:45Z

table/block_based/block_based_table_builder.cc

Is it possible to make it std::unique_ptr so that we don't have to do the cleaning up with delete?

siying · 2020-02-27T22:19:36Z

table/block_based/block_based_table_builder.cc

Can you explain why the data structure is a WorkQueue and not a normal vector or deque?

The BlockRep pool will be pushed by writer thread and popped by block building thread concurrently. This is to reuse memory and keep a determined number of inflight compression payloads. As a result, the pool has to be thread-safe. More comments are added around BlockRep definition for this.

siying · 2020-02-27T22:27:36Z

table/block_based/block_based_table_builder.cc

I don't think we need ptr in those variable names. It's clear they are pointers because their types are pointers.

siying · 2020-02-27T22:31:37Z

table/block_based/block_based_table_builder.cc

Rather than delete them, can we make them unique_ptr?

siying · 2020-02-27T22:37:22Z

table/block_based/block_based_table_builder.cc

I'm a little bit confused here. We moved block_rep.slot_ptr, but we still seem to continue using and cleaning up this pointer. It seems to be contradicting. Do we need std::move() here?

WorkQueue's push method only accepts r-value by design. We have to use std::move to wrap the pointer, but the data referenced by pointer is not moved, only pointer value, i.e. a scalar, is "moved". Actually, the pointer value is just copied into the queue.

slot_ptr is now made a unique_ptr, and unique_ptr::get() returns a r-value itself, so std::move is not necessary. But std::move for block_rep's is still needed, because it's a l-value.

siying · 2020-02-27T22:43:03Z

table/block_based/block_based_table_builder.cc

This std::move() is confusing to me too. I'm not sure about the behavior we want for block_rep after the move.
My understanding is that, we want to reuse those allocated memory for keys, strings, first_key_in_next_block_ptr, etc. If that is the case, can we be more explicit here? If block_rep_pool keeps holding the ownership to all those objects pointed by those pointers, I don't think we should do std::move() here.

Either way, please add comments somewhere to explain the ownership for those objects.

When block_rep is a struct, std::move will only move its value (including pointers), but not the data it references. Now block_rep is a pointer, std::move will only move its value as well. I've added comments for object ownership around BlockRep definition and WorkQueue variable definitions.

Moving a pointer is confusing too. My vote would be to move away from this move.

Modified WorkQueue design so it copies elements instead of moves them. As long as we avoid passing large elements directly to WorkQueue in future, this should be fine. We can always pass large objects by pointers.
This behavior is more similar to STL queue, and maybe less misleading.

yzygitzh · 2020-03-09T07:17:53Z

@siying Sorry I'm busy writing my master thesis these days. I'll look into your comments by the end of this week.

yzygitzh · 2020-03-20T10:02:33Z

Hi @siying , sorry for late reply.

I've fixed the code according your comments. Besides, another unit test for parallel compression is added in DBBasicTestWithTimestampCompressionSettings test in db/db_with_timestamp_basic_test.cc.

PTAL, thanks!

yzygitzh · 2020-03-21T03:28:58Z

include/rocksdb/advanced_options.h

More demostrations about SST file inflation added.

yzygitzh · 2020-03-24T15:19:14Z

Hi @siying ,

Again thanks for your comments! Several updates since last push:

After discussing with @burtonli , I add a SST size estimation mechanism for parallel compression. It uses compression ratio so far in SST building and raw bytes under compression to estimate the size inflation, and add it to the file size on disk. The block building thread will wait for the completion of first block compression in order to get a valid initial compression ratio. This estimation should bound the inflation into the size of one compressed block approximately, similar to non-parallel case.
I add TableBuilder::EstimatedFileSize for SST size estimation specially for compaction usage. Maybe in future we can make use of the historical compression ratio idea to estimate SST size in finer granularity during compaction, like after each TableBuilder::Add() call.
I change BlockBasedTableBuilder::ParallelCompressionRep from class to a struct, following the same convention as BlockBasedTableBuilder::Rep struct.

PTAL. Thanks!

siying

Thank you for making the change. I don't have major comments anymore.

Please update summary of the pull request to be clearer. Consider to remove "Add Feature -" from the PR title to be more concise. Also add an entry HISTORY.md to explain this new feature.

siying · 2020-03-25T02:57:16Z

db/db_test2.cc

Ideally we need a unit test that validates the data in the database. That is the keys are expected and not lost during the compaction process.

Added consistency check between data written and data in the database. This should benefit original DBTest2::CompressionOptions test as well.

siying · 2020-03-25T02:59:09Z

include/rocksdb/advanced_options.h

This is a public header so ideally mentioning of internal function like BlockBasedTableBuilder::EstimatedFileSize() is not recommended. Imagine the readers of header files under include/rocksdb/ are RocksDB users who don't read source code. I think the mentioning of internal functions here can be removed.

Rewrote the option demonstration without code details.

siying · 2020-03-25T03:09:52Z

table/block_based/block_based_table_builder.cc

Moving a pointer is confusing too. My vote would be to move away from this move.

siying · 2020-03-25T03:11:35Z

table/block_based/block_based_table_builder.cc

We follow Google C++ Style, which bans C style casting: https://google.github.io/styleguide/cppguide.html#Casting try to use a C++ style casting instead.

siying · 2020-03-25T03:12:56Z

table/block_based/block_based_table_builder.cc

We follow Google C++ Style, which bans C style casting: https://google.github.io/styleguide/cppguide.html#Casting try to use a C++ style casting instead.

yzygitzh · 2020-03-25T12:24:51Z

Hi @siying ,

I've finished addressing latest comments, main changes include:

Modified the design of WorkQueue so that we can get rid of std::move.
Updated HISTORY.md.
Comments and code style fixing. Commit messages, PR title and PR description are also polished.

PTAL. Thanks!

facebook-github-bot

@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

siying

I think it mostly looks good to me. Sorry that I have a late comment.

We also need to add an option to to the stress test tool. It can be done in a follow up pull request but needs to be done. Start with adding an option in db_stress to cover it. You can start with looking at how DB options are set up in db_stress_tool/db_stress_test_base.cc and add an option for parallel level. After doing that, set it up by a chance in tools/db_crashtest.py.

Thanks again for working on such a complicated feature. We are really close to land it.

siying · 2020-03-26T00:01:51Z

table/block_based/block_based_table_builder.cc

I think if r->status is OK, we should exit the loop and avoid calling WriteRawBlock() again. I believe inside WriteRawBlock() we only assert status OK and in release mode we would just override the status, which can cause problem. I think it's safer to just exit the loop if r->status.

Ideally the failure case is tested.

Sorry for this late comment. I should have noticed it earlier. But this is serious and we have to fix it before commit the feature.

Sure. Did you mean exiting the loop when r->status is NOT OK?

Yes. Sorry for the typo.

Also, there seems to be a data race between the main thread and the block writer thread with rep_->status().

Those variables added to estimate file size might also have a data race.

For variables to estimate file size,

raw_bytes_submitted will only be updated and accessed in block building thread, should be safe;

raw_bytes_compressed and curr_compression_ratio will be updated in writer thread, and accessed in block building thread. Their updates should be single-threaded. There might be the case where r->offset is already updated, but raw_bytes_compressed and curr_compression_ratio are not updated yet, causing an estimation error of 1-compressed-block-size. Shall we make these updates atomic, or you think it's acceptable?

I noticed that, in orignal BlockBasedTableBuilder, when compression is aborted, rep_->status is set to Status::Corruption, but WriteRawBlock is still called. Is WriteRawBlock meant to write uncompressed data here, or we should return before WriteRawBlock is called?

Regarding the data race for variables related to file size estimation, making the variables used by BlockBasedTableBuilder::EstimatedFileSize atomic should be good enough to me.

The compression validation failure seems to be a bug. We don't have to fix the bug with non-parallel case. But if it is fixed in parallel writer case, it is great.

Variables used by BlockBasedTableBuilder::EstimatedFileSize is all protected by estimation_mutex. This should lead to more accurate (and predictable) estimation then separate atomic variables.

I've added if (!ok()) check for compression in both parallel and non-parallel cases. A fake faulting compressor is needed in future for thorough testing though.

facebook-github-bot · 2020-03-26T06:55:12Z

@cp5555 has updated the pull request. Re-import the pull request

yzygitzh · 2020-03-26T07:30:35Z

Hi @siying ,

Again thanks for your continuous efforts on code review. It has been really helpful because I'm not quite familiar with RocksDB design principles, and I've learned a lot during the code revision.

For problems about error checking and data racing, here comes the updates:

Seems original non-parallel code was not checking status between compression and WriteRawBlock(). I suppose it should be checked because as you said, there is an assert in WriteRawBlock(). So I add the checking, and now it returns if status is not OK before WriteRawBlock().
Before BlockBasedTableBuilder::Finish(), main thread and writer thread might access rep_->status, compression thread(s) and write thread might update rep_->status. In order to make those updates and accesses thread-safe,
- I add a field status in BlockRep as compression status output, so compression thread(s) will only update their own block's status. As a result, compression threads are safe. These local status's are checked and assigned to global status (if non-OK) once block_rep is popped out in write thread.
- I wrap BlockBasedTableBuilder::status into a mutex, and introduce SetStatusAtomic to set status within the same mutex. As a result, main thread and writer thread are safe.
In order to test status fail in compression, seems we have to write a fake faulting compressor, and it's not trivial. Maybe I can work on this as another PR, coming alongside with db_stress PR.
I also wrap SST size estimation operations into a mutex so there should be no data race. Besides, the estimation considers kBlockTrailerSize size now.
According to my local test, there is no performance drop with former synchronizations equipped.

PTAL. Thanks!

Ziyue

Summary: facebook#6262 causes CLANG analyze to complain. Add assertion to suppress the warning. Test Plan: Run "clang analyze" and make sure it passes.

siying · 2020-04-03T17:56:48Z

@yzygitzh when I re-read some of the code and have some ideas to improve the new code added. Would you have time to address them as a follow up?

A general suggestion would be to separate those logic of generate the objects into queues into separate functions. If possible, encapsulate some logic as part of member functions of the classes. Ideally, the logic of deserializing and serializiing data from queue items should not be mixed together with the logic of processing them.

Another suggestion is that, the logic of estimating file offset can be separate out as different functions or even classes.

We can discuss through messenger on this.

Thanks again for making the contribution.

Summary: #6262 causes CLANG analyze to complain. Add assertion to suppress the warning. Pull Request resolved: #6641 Test Plan: Run "clang analyze" and make sure it passes. Reviewed By: anand1976 Differential Revision: D20841722 fbshipit-source-id: 5fa6e0c5cfe7a822214c9b898a408df59d4fd2cd

Summary: `HarnessTest` in `table_test.cc` currently tests many parameter combinations sequentially using nested loops. This is problematic from a testing perspective, since if the test fails, we have no way of knowing how many/which combinations have failed. It can also cause timeouts on our test system due to the sheer number of combinations tested. (Specifically, the parallel compression threads parameter added by facebook#6262 seems to have been the last straw.) The patch turns `HarnessTest` into a parameterized test, so the various parameter combinations can be tested separately and potentially concurrently. It also cleans up the tests a little, fixes `RandomizedLongDB`, which did not get updated when the parallel compression threads parameter was added, and turns `FooterTests` into a standalone test case (since it does not actually need a fixture class). Test Plan: `make check`

Summary: `HarnessTest` in `table_test.cc` currently tests many parameter combinations sequentially in a loop. This is problematic from a testing perspective, since if the test fails, we have no way of knowing how many/which combinations have failed. It can also cause timeouts on our test system due to the sheer number of combinations tested. (Specifically, the parallel compression threads parameter added by #6262 seems to have been the last straw.) There is some DIY code there that splits the load among eight test cases but that does not appear to be sufficient anymore. Instead, the patch turns `HarnessTest` into a parameterized test, so all the parameter combinations can be tested separately and potentially concurrently. It also cleans up the tests a little, fixes `RandomizedLongDB`, which did not get updated when the parallel compression threads parameter was added, and turns `FooterTests` into a standalone test case (since it does not actually need a fixture class). Pull Request resolved: #6974 Test Plan: `make check` Reviewed By: siying Differential Revision: D22029572 Pulled By: ltamasi fbshipit-source-id: 51baea670771c33928f2eb3902bd69dcf540aa41

facebook-github-bot added the CLA Signed label Jan 6, 2020

yzygitzh mentioned this pull request Jan 6, 2020

[RFC] Add Feature - Add pipelined & parallel compression optimization and pipelined load & sort optimization #6143

Open

siying reviewed Jan 10, 2020

View reviewed changes

yzygitzh force-pushed the p2c branch 2 times, most recently from 287046f to 0b02f05 Compare January 20, 2020 16:16

siying reviewed Feb 24, 2020

View reviewed changes

yzygitzh force-pushed the p2c branch 2 times, most recently from 00eb742 to b321bb9 Compare February 26, 2020 06:56

siying reviewed Feb 27, 2020

View reviewed changes

yzygitzh force-pushed the p2c branch from b321bb9 to 5676bed Compare March 20, 2020 07:59

yzygitzh force-pushed the p2c branch from 5676bed to 4228be7 Compare March 21, 2020 03:14

yzygitzh reviewed Mar 21, 2020

View reviewed changes

yzygitzh force-pushed the p2c branch 2 times, most recently from 4f8d702 to 7e390de Compare March 24, 2020 14:57

yzygitzh force-pushed the p2c branch from 7e390de to cdcac14 Compare March 25, 2020 01:41

siying reviewed Mar 25, 2020

View reviewed changes

cp5555 changed the title ~~Add Feature - Add pipelined & parallel compression optimization~~ Add pipelined & parallel compression optimization Mar 25, 2020

yzygitzh force-pushed the p2c branch from cdcac14 to 2b2126f Compare March 25, 2020 12:12

yzygitzh force-pushed the p2c branch from 2b2126f to 03b4f9c Compare March 25, 2020 14:49

facebook-github-bot reviewed Mar 25, 2020

View reviewed changes

siying reviewed Mar 26, 2020

View reviewed changes

yzygitzh force-pushed the p2c branch from 03b4f9c to 2a5f75c Compare March 26, 2020 06:55

siying mentioned this pull request Apr 3, 2020

Fix clang anaylze warning caused by #6262 #6641

Closed

siying mentioned this pull request Apr 30, 2020

Disallow BlockBasedTableBuilder to set status from non-OK #6776

Closed

zhichao-cao mentioned this pull request Apr 30, 2020

Fix the status overwrite in block based table builder in 6.9.fb #6783

Closed

yzygitzh mentioned this pull request May 29, 2020

Make parallel compression optimization code tidier #6888

Closed

ltamasi mentioned this pull request Jun 12, 2020

Turn HarnessTest in table_test into a parameterized test #6974

Closed

		@@ -0,0 +1,254 @@
		// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.

Conversation

cp5555 commented Jan 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yzygitzh commented Feb 18, 2020

Uh oh!

siying left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

siying left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cp5555 commented Jan 6, 2020 •

edited

Loading