LUCENE-8739: custom codec providing Zstandard compression/decompression#439
LUCENE-8739: custom codec providing Zstandard compression/decompression#439praveennish wants to merge 13 commits intoapache:mainfrom
Conversation
…on with dictionary
| /** Stored field format used by plugaable codec */ | ||
| public class Lucene90CustomCompressionStoredFieldsFormat extends StoredFieldsFormat { | ||
| // chunk size for zstandard | ||
| private static final int ZSTD_BLOCK_LENGTH = 10 * 8 * 1024; |
There was a problem hiding this comment.
Let's use bigger blocks, like for BEST_COMPRESSION.
| private static final int ZSTD_BLOCK_LENGTH = 10 * 8 * 1024; | |
| private static final int ZSTD_BLOCK_LENGTH = 10 * 48 * 1024; |
| private static final int NUM_SUB_BLOCKS = 10; | ||
| private final int level; | ||
| public static final int defaultLevel = 3; | ||
| private static final int DICT_SIZE_FACTOR = 2; |
There was a problem hiding this comment.
Let's do 6 like for BEST_COMPRESSION?
| private static final int DICT_SIZE_FACTOR = 2; | |
| private static final int DICT_SIZE_FACTOR = 6; |
| switch (mode) { | ||
| case ZSTD: | ||
| return new Lucene90CompressingStoredFieldsFormat( | ||
| "CustomCompressionStoredFieldsZSTD", ZSTD_MODE_NO_DICT, ZSTD_BLOCK_LENGTH, 1024, 10); |
There was a problem hiding this comment.
4096 docs per chunk at most, like for BEST_COMPRESSION
| "CustomCompressionStoredFieldsZSTD", ZSTD_MODE_NO_DICT, ZSTD_BLOCK_LENGTH, 1024, 10); | |
| "CustomCompressionStoredFieldsZSTD", ZSTD_MODE_NO_DICT, ZSTD_BLOCK_LENGTH, 4096, 10); |
|
|
||
| case ZSTD_DICT: | ||
| return new Lucene90CompressingStoredFieldsFormat( | ||
| "CustomCompressionStoredFieldsZSTD", ZSTD_MODE_DICT, ZSTD_BLOCK_LENGTH, 1024, 10); |
There was a problem hiding this comment.
| "CustomCompressionStoredFieldsZSTD", ZSTD_MODE_DICT, ZSTD_BLOCK_LENGTH, 1024, 10); | |
| "CustomCompressionStoredFieldsZSTD", ZSTD_MODE_DICT, ZSTD_BLOCK_LENGTH, 4096, 10); |
|
|
||
| final Lucene90CustomCompressionCodec.Mode mode; | ||
|
|
||
| private static int compressionLevel; |
|
|
||
| implementation project(':lucene:core') | ||
| moduleImplementation project(':lucene:core') | ||
| moduleImplementation 'com.github.luben:zstd-jni:1.5.0-4' |
There was a problem hiding this comment.
Just minor suggestion:
current version is 1.5.1-1
could bring another few percent compression speed
https://github.com/facebook/zstd/releases/tag/v1.5.1
There was a problem hiding this comment.
Hi Tobias,
I have tested 1.5.0-X and the data looks to be very promising with the current version. I have not tested 1.5.1-X JNI version but https://github.com/facebook/zstd/releases/tag/v1.5.1 looks good and can be picked up in the future since Zstd is very dynamic and we may see a lot of improvements over the current version!
Thanks!!
|
Is this PR going to be merged? What release is this planned for? |
|
See discussion on LUCENE-8739, this PR is unlikely going to get merged. |
|
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
|
|
||
| // dictionary compression first | ||
| doCompress(bytes, off, dictLength, cctx, out); | ||
| cctx.loadDict(new ZstdDictCompress(bytes, off, dictLength, this.compressionLevel)); |
There was a problem hiding this comment.
You should close this ZstdDictCompress object, or it will cause a memory leak. I have used your code in Elasticsearch and found it to be useful, but this is a minor issue that needs to be fixed.
| // decompress dictionary first | ||
| doDecompress(in, dctx, bytes, dictLength); | ||
|
|
||
| dctx.loadDict(new ZstdDictDecompress(bytes.bytes, 0, dictLength)); |
There was a problem hiding this comment.
You should close this ZstdDictDecompress object too.
|
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
|
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
Description
#9784
Lucene currently supports LZ4 and Zlib compression/decompression for StoredFieldsFormat. We propose Zstandard (https://facebook.github.io/zstd/) compression/decompression for StoredFieldsFormat for the following reasons:
(https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/).
Solution
Tests
Checklist
Please review the following and check all that apply:
mainbranch../gradlew check.