core-clp: Add support for decompressing a specific file split from a clp archive into one or more IR files.#417
Merged
Merged
Conversation
� Conflicts: � components/core/src/clp/GlobalMySQLMetadataDB.cpp
7d848a4 to
425377b
Compare
Co-authored-by: Lin Zhihao <59785146+LinZhihao-723@users.noreply.github.com>
Co-authored-by: Lin Zhihao <59785146+LinZhihao-723@users.noreply.github.com>
LinZhihao-723
previously approved these changes
Jun 5, 2024
LinZhihao-723
left a comment
Member
There was a problem hiding this comment.
Commit message suggestion:
core-clp: Add support for decompressing an IR from a specific file split from a clp archive.
I've done my parts of review, maybe you can take it over from here @kirkrodrigues
kirkrodrigues
requested changes
Jun 7, 2024
Member
|
Forgot to mention, since we now have a log event serializer, can we add some unit tests to test serialization + deserialization? |
Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
haiqi96
commented
Jun 7, 2024
| { | ||
| SPDLOG_ERROR( | ||
| "Failed to create directory structure {}, errno={}", | ||
| output_dir.c_str(), |
Contributor
Author
There was a problem hiding this comment.
Suggested change
| output_dir.c_str(), | |
| temp_output_dir.c_str(), |
LinZhihao-723
requested changes
Jun 7, 2024
Comment on lines
+59
to
+69
| if (false == res) { | ||
| close_writer(); | ||
| return true; | ||
| } | ||
|
|
||
| m_is_open = true; | ||
|
|
||
| // Flush the preamble | ||
| flush(); | ||
|
|
||
| return false; |
Member
There was a problem hiding this comment.
Suggested change
| if (false == res) { | |
| close_writer(); | |
| return true; | |
| } | |
| m_is_open = true; | |
| // Flush the preamble | |
| flush(); | |
| return false; | |
| if (false == res) { | |
| close_writer(); | |
| return false; | |
| } | |
| m_is_open = true; | |
| // Flush the preamble | |
| flush(); | |
| return true; |
we should return false to indicate error right?
LinZhihao-723
requested changes
Jun 7, 2024
| } | ||
| begin_message_ix = end_message_ix; | ||
|
|
||
| if (auto const error_code = ir_serializer.open(temp_ir_path.string()); |
|
|
||
| LogEventSerializer<four_byte_encoded_variable_t> ir_serializer; | ||
| // Open output IR file | ||
| if (auto const error_code = ir_serializer.open(temp_ir_path.string()); |
LinZhihao-723
previously approved these changes
Jun 7, 2024
kirkrodrigues
approved these changes
Jun 7, 2024
kirkrodrigues
left a comment
Member
There was a problem hiding this comment.
For the PR title, how about:
core-clp: Add support for decompressing a specific file split from a clp archive into one or more IR files.
junhaoliao
pushed a commit
to junhaoliao/clp
that referenced
this pull request
May 17, 2026
…clp archive into one or more IR files. (y-scope#417)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
References
Description
The change is motivated by the need to support log viewer, which
This PR introduces a new decompression interface
decompress_irthat decompress a file split into one or multiple IRs.The function takes in an original file ID, a specific message index and a threshold.
It first find the file split which contains the message index, and decompress the split into one or more IR; the function creates a new IR whenever the current IR's raw size (i.e. not zstd compressed) is greater than the given threshold.
Each IR follows the IRv1 format, meaning it has the complete preamble and and EoF byte, and can be deserialized individually.
The generated IR use the naming format: <FILE_ORIG_ID><begin_message_ix><end_message_ix>.clp.zst. Since the preamble of the IRv1 doesn't contain any log event index information, this name is essential for the user of the IR to know what's the range of log index the IR contains.
The PR also introduces a new class
LogEventSerializer.cppthat serialized a plain text message into the IR format.Due to the limitation of our current IR related encoding APIs, the function is designed with two inefficienies
We agreed that these two are acceptables as properly supporting the flow will take more thoughts on reworking the encoder interface.
Validation performed
The validation is not directly performed on this PR, but on a following PR which adds the
decompress_irin to the execution path of clp executable.To validate the functionality, we compressed a 64MB file into archive(s). We then decompressed it into mulitple IRs, decoded and concatnate them, and did a binary comparison with the original file.
We used two configuration to cover all the possible cases:
Compressed a 64MB hadoop log using smaller encoded file size and archive size, such that it splits the original file into 3 splits across 2 archives. We then decompressed all 3 IRs by running clp 3 times, using different message index
Compressed the 64MB hadoop log using default settings, so only one file and archive was generated. We then decompressed the IR using a 32MB threshold, generating 3 IRs on disk.