Skip to content

add compression method for files: Xz#16578

Merged
alexey-milovidov merged 24 commits intoClickHouse:masterfrom
fibersel:issue-8828
Nov 11, 2020
Merged

add compression method for files: Xz#16578
alexey-milovidov merged 24 commits intoClickHouse:masterfrom
fibersel:issue-8828

Conversation

@fibersel
Copy link
Copy Markdown
Contributor

@fibersel fibersel commented Nov 1, 2020

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
add *.xz compression/decompression support.It enables using *.xz in file() function.This closes #8828

@azat
Copy link
Copy Markdown
Member

azat commented Nov 1, 2020

Where can I find tests for ZlibInflatingReadBuffer.h?I grepped tests/ folder and found nothing

src/IO/tests/zlib_buffers.cpp:#include <IO/ZlibInflatingReadBuffer.h>
src/IO/tests/zlib_buffers.cpp:        DB::ZlibInflatingReadBuffer inflating_buf(std::move(buf), DB::CompressionMethod::Gzip);

Copy link
Copy Markdown
Member

@azat azat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor comments until there is no can-be-tested label

return;
//std::cout << ret << " RET FIN" << std::endl;

if (ret == LZMA_STREAM_END) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like indent is wrong

throw Exception(std::string("lzma stream encoding failed: ") + "; lzma version: " + LZMA_VERSION_STRING, ErrorCodes::LZMA_STREAM_ENCODER_FAILED);

std::cout << lstr.avail_in << std::endl;
//std::cout << lstr.avail_in << " " << lstr.avail_out << std::endl;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to remove? (and in a few other places)


//std::cout << ret << " RET IMPL" << std::endl;

if (ret == LZMA_STREAM_END) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coding style: curly bracket is always on a new line (and not only here but in a few other places) (except for } while (true))

Comment on lines +67 to +69
throw Exception(
std::string("lzma decoder finished, but stream is still alive: error code: ") + std::to_string(ret) + "; lzma version: " + LZMA_VERSION_STRING,
ErrorCodes::LZMA_STREAM_DECODER_FAILED);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can use fmt-like constructor

Suggested change
throw Exception(
std::string("lzma decoder finished, but stream is still alive: error code: ") + std::to_string(ret) + "; lzma version: " + LZMA_VERSION_STRING,
ErrorCodes::LZMA_STREAM_DECODER_FAILED);
throw Exception(ErrorCodes::LZMA_STREAM_DECODER_FAILED, "lzma decoder finished, but stream is still alive: error code: {}; lzma version: {}", ret, LZMA_VERSION_STRING);

@robot-clickhouse robot-clickhouse added the submodule changed At least one submodule changed in this PR. label Nov 2, 2020
//std::cout << ret << " RET FIN" << std::endl;

if (ret == LZMA_STREAM_END) {
if (ret == LZMA_STREAM_END)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trailing whitespace, and incorrect indent below (tabs instead of spaces, and tabs also not only here)

And by the way it is not only here, it is pretty easy to fix just using git, but please note that this will override the history (and that said that if you don't know git I guess it will be better if you will fix this in your editor)

git rebase --whitespace=fix upstream/master

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I just format it with clang-format and options from .clang-format?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess (I've never used it)

add_subdirectory (replxx-cmake)
add_subdirectory (ryu-cmake)
add_subdirectory (unixodbc-cmake)
add_subdirectory (xz)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you've forget to add submodule

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, this is just to know where to fetch the submodule from.

You need to run smth like this:

git submodule add https://github.com/xz-mirror/xz contrib/xz

And this will create a special file contrib/xz in the git index (but in the filesystem you will see it as a content of the repository), that will contain the HEAD for the submodule

Copy link
Copy Markdown
Member

@azat azat Nov 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also to make fasttest update your submodule you need to add contrib/xz into

SUBMODULES_TO_UPDATE=(contrib/boost contrib/zlib-ng contrib/libxml2 contrib/poco contrib/libunwind contrib/ryu contrib/fmtlib contrib/base64 contrib/cctz contrib/libcpuid contrib/double-conversion contrib/libcxx contrib/libcxxabi contrib/libc-headers contrib/lz4 contrib/zstd contrib/fastops contrib/rapidjson contrib/re2 contrib/sparsehash-c11 contrib/croaring)

Without fasttest other tests won't be run

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it's possible to disable xz in fasttest.

@fibersel
Copy link
Copy Markdown
Contributor Author

fibersel commented Nov 2, 2020

I kind of fixed it with clang-format.I manually look for tabs with vim and don't find it.Looks like all whitespaces consist of single spaces now.

Comment on lines +332 to +333
set (LZMA_LIBRARY liblzma)
set (LZMA_INCLUDE_DIR ${ClickHouse_SOURCE_DIR}/contrib/xz/src/liblzma/api)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I guess it is better to move this out into separate contrib/xz-cmake/CMakeLists.txt, and add an interface library.
One of similar files is contrib/protobuf-cmake/CMakeLists.txt (protobuf has it's own cmake rules, while in clickhouse there is just a wrapper that install some options)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should I do to pass description check?
I've added record to CHANCHELOG.md

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should I do to pass description check?

Update description of PR and include changelog entry using specified format.
There is template for PRs - https://raw.githubusercontent.com/ClickHouse/ClickHouse/master/.github/PULL_REQUEST_TEMPLATE.md

Just copy it and modify the template, but leave the format.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's much better to provide our own CMakeLists.

CMake often cannot be safely reused (without building unneeded targets, polluting build options).

@robot-clickhouse robot-clickhouse added doc-alert pr-feature Pull request with new product feature labels Nov 2, 2020
@fibersel
Copy link
Copy Markdown
Contributor Author

fibersel commented Nov 2, 2020

Ok, it seems, that I should write a wrapper for CMake

@fibersel
Copy link
Copy Markdown
Contributor Author

fibersel commented Nov 2, 2020

It seems like tuklib_cpucores calls additional function that sets TUKLIB_CPUCORES_DEFINITIONS to TUKLIB_CPUCORES_SCHED_GETAFFINITY, but after that check still fails.
https://github.com/xz-mirror/xz/blob/master/cmake/tuklib_cpucores.cmake#L39

@azat
Copy link
Copy Markdown
Member

azat commented Nov 3, 2020

It seems like tuklib_cpucores calls additional function that sets TUKLIB_CPUCORES_DEFINITIONS to TUKLIB_CPUCORES_SCHED_GETAFFINITY, but after that check still fails.

The reason is that if (DEFINED CACHED{VAR}) is added only in 3.14

You can try upgrade fasttest image to 20.04 (there are few images that already uses it anyway, so I guess it is fine, @alesapin ?) and it will contain cmake 3.16

P.S. you can also send a patch to the uptream to bump cmake requirements:
https://github.com/xz-mirror/xz/blob/869b9d1b4edd6df07f819d360d306251f8147353/CMakeLists.txt#L50

@alexey-milovidov
Copy link
Copy Markdown
Member

We usually write our own CMakeLists for libraries. Because original CMakeLists are intended to build a library itself, not to build it for another project.

@fibersel
Copy link
Copy Markdown
Contributor Author

fibersel commented Nov 3, 2020

2020-11-04 01:46:00 -- Using CMake version 3.13.4
Is seems like after updating base image to version 20.04 cmake is not being updated to 3.16

@azat
Copy link
Copy Markdown
Member

azat commented Nov 4, 2020

Is seems like after updating base image to version 20.04 cmake is not being updated to 3.16

Yeah, this is because fasttest image is not updated on per-PR bases (as most of the images)

std::string("lzma preset failed: ") + "; lzma version: " + LZMA_VERSION_STRING, ErrorCodes::LZMA_STREAM_ENCODER_FAILED);


lzma_filter filters[] = {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing comments (explanations), the code is unclear to me.

Copy link
Copy Markdown
Member

@alexey-milovidov alexey-milovidov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need a functional test.

grep 'Content-Encoding'
in test/queries/0_stateless

We need to make sure that xz tool can read our xz-compressed data
and we can read data compressed with the xz tool.

#include <IO/WriteHelpers.h>
#include <IO/ReadHelpers.h>

int main(int, char **)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you run it? (And what are the results in comparison with xz?)

Copy link
Copy Markdown
Contributor Author

@fibersel fibersel Nov 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I ran. But I suspect, that my implementation is not completely optimal:

(base) [centos@clickhouse build]$ ./src/IO/tests/lzma_buffers
Writing done. Elapsed: 12.78 s., 6.17 MB/s
Reading done. Elapsed: 0.79 s., 100.24 MB/s

(base) [centos@clickhouse build]$ ./src/IO/tests/zlib_buffers
Writing done. Elapsed: 8.04 s., 110.55 MB/s
Reading done. Elapsed: 4.44 s., 200.08 MB/s

Copy link
Copy Markdown
Member

@alexey-milovidov alexey-milovidov Nov 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to say... Better to compare with xz tool, and we expect that performance should not be significantly lower.

Copy link
Copy Markdown
Member

@alexey-milovidov alexey-milovidov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, only some small changes required.

@alexey-milovidov alexey-milovidov self-assigned this Nov 7, 2020
@alexey-milovidov
Copy link
Copy Markdown
Member

Maybe we can also add a test for file table function?

@alexey-milovidov
Copy link
Copy Markdown
Member

alexey-milovidov commented Nov 7, 2020

You can find similar tests in the following way:

  • look at the git blame for the code related to CompressionMethod;
  • copy sha of the commit;
  • search the commit on GitHub;
  • open "issues" - there are also pull requests;
  • now you have found pull requests where the feature was added;
  • there should be the related tests.

@fibersel
Copy link
Copy Markdown
Contributor Author

fibersel commented Nov 11, 2020

HTTPHandlers.cpp has too many changes because I did clang-format -i src/Server/HTTPHandler.cpp before commiting.

return CompressionMethod::Zlib;
if (*method_str == "brotli" || *method_str == "br")
return CompressionMethod::Brotli;
if (*method_str == "LZMA" || *method_str == "xz")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small note... xz is actually LZMA2. But why do we need this name? Maybe just leave xz only.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there the only place to remove LZMA?
Is it ok to leave all buffer filenames with LZMA* prefix?

Copy link
Copy Markdown
Member

@alexey-milovidov alexey-milovidov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

There are some leftover comments that are related to another xz implementation... we can try it in the next PR.

@fibersel
Copy link
Copy Markdown
Contributor Author

LGTM.

There are some leftover comments that are related to another xz implementation... we can try it in the next PR.

I had sent mail to contributor of fast-lzma2 maintainer, but he has not responded yet.I am not sure that fast-lzma2 is compatible with classic *.xz format and it's repository does not contain CMake file, so we need to adapt their implementation.I have faced some problems with that.Can we merge this implementation and keep in mind existence of another implementation?

@alexey-milovidov
Copy link
Copy Markdown
Member

Yes, we can merge before trying fast-lzma2.

@alexey-milovidov
Copy link
Copy Markdown
Member

You can remove commented code before we'll merge.

@alexey-milovidov
Copy link
Copy Markdown
Member

I've tested decompression performance, it's the same as for xz tool.

@alexey-milovidov alexey-milovidov merged commit 34b2a46 into ClickHouse:master Nov 11, 2020
@alexey-milovidov
Copy link
Copy Markdown
Member

Maybe you will be also interested in the internals of pixz tool.

@fibersel
Copy link
Copy Markdown
Contributor Author

Maybe you will be also interested in the internals of pixz tool.

I will keep it in mind, but now I am going to deal with main task.

@akuzm
Copy link
Copy Markdown
Contributor

akuzm commented Nov 12, 2020

This PR broke the Fast Test.

#16908

I'll merge the fix for now, so that we have at least some tests, but it's not entirely correct. CMake should be able to finish when the xz submodule is not cloned -- it's not an essential one.

path = contrib/miniselect
url = https://github.com/danlark1/miniselect
[submodule "contrib/xz"]
path = contrib/xz
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uses spaces for indent, while all other lines uses tabs (since git submodule add uses them)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature Pull request with new product feature submodule changed At least one submodule changed in this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Generic compression methods for files: add xz.

5 participants