Skip to content

LUCENE-8739: Add ZSTD support.#174

Closed
jpountz wants to merge 1 commit intoapache:mainfrom
jpountz:zstd
Closed

LUCENE-8739: Add ZSTD support.#174
jpountz wants to merge 1 commit intoapache:mainfrom
jpountz:zstd

Conversation

@jpountz
Copy link
Copy Markdown
Contributor

@jpountz jpountz commented Jun 7, 2021

This is a new stored fields format that leverages ZSTD via JNA.

JNA doesn't have a NOTICE file on their repository, so I copied the bottom of the README that seems to cover what the NOTICE file usually does.

This is a new stored fields format that leverages ZSTD via JNA.
@rmuir
Copy link
Copy Markdown
Member

rmuir commented Jun 7, 2021

What is the improvement? Especially interested in performance versus a "good" zlib: e.g. your comment here: https://issues.apache.org/jira/browse/LUCENE-8739?focusedCommentId=17322015&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17322015

I'm just curious because it definitely adds complexity by adding a third-party dependency/native code to lucene-codecs. JNA is complicated (e.g. .so hell with various "system versions" of it, doesn't work out of box with non-executable /tmp, doesn't support all architectures, etc)

If the improvement can be done in an easier way (e.g. instructions to luser on how to swap in better zlib), that seems like a better way to go?

@jpountz
Copy link
Copy Markdown
Contributor Author

jpountz commented Jun 8, 2021

Right, zstd definitely outperforms zlib on large files, but when I tested with Lucene stored fields, the difference wasn't very significant, which makes me wonder that most of the benefit of zstd comes from its very large window. One could increase the block size for stored fields to try to leverage this, but it would slow down document retrieval.

I wasn't aware of these limitations with JNA. I guess I'll close the PR, it should already make it easier for folks who are curious about zstd to test it with Lucene.

@jpountz jpountz closed this Jun 8, 2021
@rmuir
Copy link
Copy Markdown
Member

rmuir commented Jun 8, 2021

@jpountz another alternative would be to build against the new JDK foreign api? If we need to call out to some native code, maybe we should look at the best available mechanism. From what I looked at before, I feel like it would have less hassles than JNA or JNI and give safer native code, but never tried it out.

@uschindler
Copy link
Copy Markdown
Contributor

With foreign API you can directly wrap arrays to compress and wrap as MemorySegment and pass to foreign function (native MethodHandle). Jextract parses header file and creates MethodHandlers out of it.
It is also safer than JNI, because user has to opt in with a JVM parameter and you can't crush your jvm without explicitly allowing it. By default this codec won't work.

So yes, that's the way to go in future...

See also MmapDirectory v2 #177 for JDK 17.

@believezzd
Copy link
Copy Markdown

@jpountz

Description

  • There are two place code that i can't understand in this commit, could you please give me help.

First one

  • To ZstdStoredFieldsFormat#ZstdDecompressor.decompress(124)
  • why it has to skip blocks. As I know, in the compress, the compressed sub_block appends to output one by one. it it just decompress from offset to offset+length. Just As DeflateDecompressor.decompress

Second one

  • To ZstdStoredFieldsFormat#ZstdDecompressor.decompress(145)
  • why it reset bytes.offset and bytes.length to other value in the end. As I can see, in the end, bytes.offset should be zero and bytes.length should be originalLength. Same as DeflateDecompressor.decompress in the end.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants