Support ZSTD #254

zuston · 2022-10-08T07:43:51Z

What changes were proposed in this pull request?

Introduce the ZSTD compression
Introduce the abstract interface of codec
Recycle the buffer to optimize the performance

Why are the changes needed?

ZSTD has a good tradeoff between compression ratio and de/compress speed. For reducing the shuffle-data stored size, it's necessary to support this compression algorithm.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Manual tests and UTs

zuston · 2022-10-08T07:47:44Z

Terasort Compression Benchmark

100GB terasort

zuston · 2022-10-08T07:50:03Z

PTAL @jerqi

codecov-commenter · 2022-10-08T10:22:50Z

Codecov Report

Merging #254 (4557866) into master (47effb2) will decrease coverage by 0.14%.
The diff coverage is 63.54%.

@@             Coverage Diff              @@
##             master     #254      +/-   ##
============================================
- Coverage     59.71%   59.56%   -0.15%     
- Complexity     1377     1381       +4     
============================================
  Files           166      171       +5     
  Lines          8918     8983      +65     
  Branches        853      859       +6     
============================================
+ Hits           5325     5351      +26     
- Misses         3318     3353      +35     
- Partials        275      279       +4

Impacted Files	Coverage Δ
...rg/apache/hadoop/mapred/RssMapOutputCollector.java	`0.00% <0.00%> (ø)`
.../java/org/apache/hadoop/mapreduce/RssMRConfig.java	`23.07% <0.00%> (-51.93%)`	⬇️
...pache/hadoop/mapreduce/task/reduce/RssShuffle.java	`0.00% <0.00%> (ø)`
.../java/org/apache/spark/shuffle/RssSparkConfig.java	`90.90% <0.00%> (-5.87%)`	⬇️
...ava/org/apache/uniffle/common/RssShuffleUtils.java	`0.00% <ø> (-95.66%)`	⬇️
...g/apache/uniffle/common/compression/NoOpCodec.java	`0.00% <0.00%> (ø)`
...g/apache/uniffle/common/compression/ZstdCodec.java	`72.22% <72.22%> (ø)`
...a/org/apache/uniffle/common/compression/Codec.java	`80.00% <80.00%> (ø)`
...e/spark/shuffle/reader/RssShuffleDataIterator.java	`90.54% <84.61%> (+1.80%)`	⬆️
...rg/apache/uniffle/common/config/RssClientConf.java	`90.90% <90.90%> (ø)`
... and 5 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

common/src/main/java/org/apache/uniffle/common/compression/CompressionFactory.java

client-spark/common/src/main/java/org/apache/spark/shuffle/reader/RssShuffleDataIterator.java

zuston · 2022-10-09T02:26:39Z

client-spark/common/src/main/java/org/apache/spark/shuffle/reader/RssShuffleDataIterator.java

+
+        int uncompressedLen = compressedBlock.getUncompressLength();
+        if (uncompressedData == null || uncompressedData.capacity() < uncompressedLen) {
+          uncompressedData = ByteBuffer.allocate(uncompressedLen);


In original implementation, the bytebuffer will be destoryed and recreate. So to avoid the frequent GC, it use the offheap-bytebuffer.

And in this PR, we will recycle the bytebuffer, so I think it's no need to use the off-heap memory now. Maybe we should add the off-heap support in the next PR.

PTAL @jerqi . This is the different with the original implementation.

It's ok for me.

common/src/main/java/org/apache/uniffle/common/config/RssClientConf.java

jerqi · 2022-10-09T03:44:00Z

Should we add the document?

zuston · 2022-10-09T04:56:30Z

Should we add the document?

Done

jerqi · 2022-10-10T02:42:22Z

client-spark/common/src/main/java/org/apache/spark/shuffle/reader/RssShuffleDataIterator.java

+
+        int uncompressedLen = compressedBlock.getUncompressLength();
+        if (uncompressedData == null || uncompressedData.capacity() < uncompressedLen) {
+          uncompressedData = ByteBuffer.allocate(uncompressedLen);


It's ok for me.

common/src/main/java/org/apache/uniffle/common/compression/Lz4Compressor.java

frankliee · 2022-10-10T07:12:13Z

Can you provide the test report of JVM memory usage？
If the new compressor uses much more memory, it will increase the risk of OOM.

common/src/main/java/org/apache/uniffle/common/compression/NoOpCompressor.java

common/src/main/java/org/apache/uniffle/common/compression/Compressor.java

...ration-test/spark-common/src/test/java/org/apache/uniffle/test/SparkIntegrationTestBase.java

frankliee · 2022-10-10T08:01:47Z

client-spark/common/src/main/java/org/apache/spark/shuffle/RssSparkConfig.java

    return builder.stringConf();
  }
+
+  public static RssConf toRssConf(SparkConf sparkConf) {


Why change conf design in this ZSTD PR?

I want to make compressorFactory accessed by MR and Spark to create concrete codec which will be initialized by specified conf, so it will have two choice.

Use the shareable RssConf like this PR

Introduce the extra config bean of compression (I think there is no need to do so)

Besides, I want to refactor the code of MR/Spark client conf entry, this PR is to do some partial work. Please refer to #200

zuston · 2022-10-10T08:12:45Z

Can you provide the test report of JVM memory usage？ If the new compressor uses much more memory, it will increase the risk of OOM.

The monitor sceenshot is as follows, I dont find the obvious difference.

zuston · 2022-10-10T08:48:40Z

Updated @frankliee

zuston · 2022-10-12T03:39:29Z

Gentle ping @frankliee @jerqi

zuston · 2022-10-13T11:02:10Z

Do u have any other concerns? Please let me know @jerqi @frankliee

...tion-test/spark3/src/test/java/org/apache/uniffle/test/GetShuffleReportForMultiPartTest.java

common/src/main/java/org/apache/uniffle/common/config/RssClientConf.java

common/src/main/java/org/apache/uniffle/common/compression/ZstdCompressor.java

frankliee · 2022-10-13T11:37:01Z

common/src/main/java/org/apache/uniffle/common/compression/Compressor.java

+
+package org.apache.uniffle.common.compression;
+
+public interface Compressor {


Do we need to merge Compressor and Decompressor into one interface "Codec" like hadoop ?
It is more concise and avoid to mix different pair of Compressor and Decompressor.

@jerqi @zuston

Let me do a simple review about hadoop codec.

I dont find co/decompressor mixed in Hadoop one interface.

https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/Compressor.java

I mean the compress/decompress could share the same interface for the user.
For example, CompressionCodec has createOutputStream (for compress) and createInputStream (for decompress).

So you mean that I need to create similar Zstd/LZ4CompressionCodec to implement the Compressor and Decompressor interface?

If that, it will make hard to init the corresponding var for specific compressor or decompressor, like the this.lz4Factory = LZ4Factory.fastestInstance();.

Please let me know if i'm wrong

You can only provide a Codec instead of CompressionFactory, which hides the inner compressor and decompressor.
The user could directly use Codec to compress or decompress data, so that the user does not need to use compressor and decompressor directly.

I propose the new commit according to your idea, 866b642

Do I get your point? @frankliee

I prefer this style

abstract class Codec { private static class Compressor {} private static class deCompressor{} private getCompressor() // for init lazily private getDeCompressor() public compress() public deCompress() } ZSTDCodec extends Codec {} LZ4Codec extends Codec {}

Emm.... OK. I will obey this project style.

common/src/main/java/org/apache/uniffle/common/compression/CompressionFactory.java

zuston · 2022-10-18T10:04:54Z

PTAL @jerqi

zuston · 2022-10-21T11:47:59Z

Bug: I found the zstd has no such method of decompressByteArray in Spark2.4.6 zstd version of 1.3.2-2

To be compatible with the older version, I think I should use the reflection to check it.

This problem will be fixed in the next PR. We could merge this firstly.

I have updated latest commit, could u help review @frankliee @jerqi . If having any problem, I think I could do quick fix in this weekend.

Latest commit changelog:

Introduce the abstract class Codec

zuston · 2022-10-24T08:04:15Z

Gentle ping @jerqi @frankliee

common/src/main/java/org/apache/uniffle/common/compression/ZstdCodec.java

common/src/main/java/org/apache/uniffle/common/config/RssClientConf.java

...test/spark-common/src/test/java/org/apache/uniffle/test/RepartitionWithLocalFileRssTest.java

zuston · 2022-10-26T06:29:16Z

Updated @frankliee . Could you help review again?

frankliee · 2022-10-26T11:03:23Z

LGTM, thanks for your contributions.

### What changes were proposed in this pull request? This PR adds the support of snappy compression/decompression based on the example of #254. ### Why are the changes needed? Add a new feature. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT

zuston requested a review from jerqi October 8, 2022 07:49

jerqi reviewed Oct 8, 2022

View reviewed changes

common/src/main/java/org/apache/uniffle/common/compression/CompressionFactory.java Outdated Show resolved Hide resolved

jerqi reviewed Oct 8, 2022

View reviewed changes

client-spark/common/src/main/java/org/apache/spark/shuffle/reader/RssShuffleDataIterator.java Outdated Show resolved Hide resolved

zuston requested a review from jerqi October 9, 2022 02:02

zuston commented Oct 9, 2022

View reviewed changes

jerqi reviewed Oct 9, 2022

View reviewed changes

common/src/main/java/org/apache/uniffle/common/config/RssClientConf.java Outdated Show resolved Hide resolved

zuston requested a review from jerqi October 9, 2022 05:39

jerqi reviewed Oct 10, 2022

View reviewed changes

zuston requested a review from jerqi October 10, 2022 03:35

jerqi requested a review from frankliee October 10, 2022 03:58

frankliee reviewed Oct 10, 2022

View reviewed changes

common/src/main/java/org/apache/uniffle/common/compression/NoOpCompressor.java Outdated Show resolved Hide resolved

frankliee reviewed Oct 10, 2022

View reviewed changes

common/src/main/java/org/apache/uniffle/common/compression/Compressor.java Outdated Show resolved Hide resolved

frankliee reviewed Oct 10, 2022

View reviewed changes

...ration-test/spark-common/src/test/java/org/apache/uniffle/test/SparkIntegrationTestBase.java Outdated Show resolved Hide resolved

frankliee reviewed Oct 10, 2022

View reviewed changes

zuston requested a review from frankliee October 10, 2022 08:46

jerqi reviewed Oct 13, 2022

View reviewed changes

frankliee reviewed Oct 13, 2022

View reviewed changes

common/src/main/java/org/apache/uniffle/common/compression/CompressionFactory.java Show resolved Hide resolved

zuston requested a review from jerqi October 18, 2022 08:26

zuston added 16 commits October 21, 2022 19:42

+mr

ef0c1da

+fix

dbf9dc1

enable compression level

7db03e8

Add header

91b8d93

fix UTs

2732d68

checkstyle

267005a

avoid static import

5f08f4b

rename of to getInstance

1f5e0e3

Remove the original implementation

35348c9

Add doc

f747f5b

Fix initializing lz4

6c4bcc4

resolve comments from franklee

fd0a51d

Make LZ4 as default codec

deb38ae

mr fix

3259926

new codec

6b3c74b

Refactor

5c8e057

zuston force-pushed the zstd-v2 branch from 871fc44 to 5c8e057 Compare October 21, 2022 11:43

frankliee reviewed Oct 26, 2022

View reviewed changes

minor fix

4557866

zuston requested a review from frankliee October 26, 2022 06:57

frankliee approved these changes Oct 26, 2022

View reviewed changes

frankliee merged commit 01def93 into apache:master Oct 26, 2022

jerqi mentioned this pull request Oct 26, 2022

[Feature] Support zstd compression #189

Closed

amaliujia mentioned this pull request Nov 6, 2022

[ISSUE-283][FEATURE] Support snappy compression/decompression #304

Merged

zuston mentioned this pull request Nov 12, 2022

[Improvement] Support ZSTD in the spark2.x version #313

Open

3 tasks


		package org.apache.uniffle.common.compression;

		public interface Compressor {

Support ZSTD #254

Support ZSTD #254

Uh oh!

Conversation

zuston commented Oct 8, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

zuston commented Oct 8, 2022

Terasort Compression Benchmark

Uh oh!

zuston commented Oct 8, 2022

Uh oh!

codecov-commenter commented Oct 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jerqi commented Oct 9, 2022

Uh oh!

zuston commented Oct 9, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

frankliee commented Oct 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zuston commented Oct 10, 2022

Uh oh!

zuston commented Oct 10, 2022

Uh oh!

zuston commented Oct 12, 2022

Uh oh!

zuston commented Oct 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zuston Oct 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zuston commented Oct 18, 2022

Uh oh!

codecov-commenter commented Oct 8, 2022 •

edited

Loading

frankliee commented Oct 10, 2022 •

edited

Loading

zuston commented Oct 13, 2022 •

edited

Loading

zuston Oct 17, 2022 •

edited

Loading

zuston commented Oct 21, 2022 •

edited

Loading