Skip to content

core-clp: Add support for decompressing a specific file split from a clp archive into one or more IR files.#417

Merged
kirkrodrigues merged 57 commits into
y-scope:mainfrom
haiqi96:ArchiveToIR
Jun 8, 2024
Merged

core-clp: Add support for decompressing a specific file split from a clp archive into one or more IR files.#417
kirkrodrigues merged 57 commits into
y-scope:mainfrom
haiqi96:ArchiveToIR

Conversation

@haiqi96

@haiqi96 haiqi96 commented May 28, 2024

Copy link
Copy Markdown
Contributor

References

Description

The change is motivated by the need to support log viewer, which

  1. Requires the input to be in the IR format.
  2. Requires the IR to be reasonably small to not overflow the memory.

This PR introduces a new decompression interface decompress_ir that decompress a file split into one or multiple IRs.

The function takes in an original file ID, a specific message index and a threshold.
It first find the file split which contains the message index, and decompress the split into one or more IR; the function creates a new IR whenever the current IR's raw size (i.e. not zstd compressed) is greater than the given threshold.
Each IR follows the IRv1 format, meaning it has the complete preamble and and EoF byte, and can be deserialized individually.

The generated IR use the naming format: <FILE_ORIG_ID><begin_message_ix><end_message_ix>.clp.zst. Since the preamble of the IRv1 doesn't contain any log event index information, this name is essential for the user of the IR to know what's the range of log index the IR contains.

The PR also introduces a new class LogEventSerializer.cpp that serialized a plain text message into the IR format.

Due to the limitation of our current IR related encoding APIs, the function is designed with two inefficienies

  1. The serialized data is first written into an internal buffer, instead of directly writing to the on-disk file.
  2. The serializer takes in a plain text message, meaning that the encoded message in the archive needs to be first decompressed into plain text (which introduce extra overhead) before being serialized.

We agreed that these two are acceptables as properly supporting the flow will take more thoughts on reworking the encoder interface.

Validation performed

The validation is not directly performed on this PR, but on a following PR which adds the decompress_ir in to the execution path of clp executable.

To validate the functionality, we compressed a 64MB file into archive(s). We then decompressed it into mulitple IRs, decoded and concatnate them, and did a binary comparison with the original file.

We used two configuration to cover all the possible cases:

  1. Compressed a 64MB hadoop log using smaller encoded file size and archive size, such that it splits the original file into 3 splits across 2 archives. We then decompressed all 3 IRs by running clp 3 times, using different message index

  2. Compressed the 64MB hadoop log using default settings, so only one file and archive was generated. We then decompressed the IR using a 32MB threshold, generating 3 IRs on disk.

@haiqi96 haiqi96 changed the title Archive to ir Draft: Archive to IR May 28, 2024
@haiqi96 haiqi96 force-pushed the ArchiveToIR branch 3 times, most recently from 7d848a4 to 425377b Compare May 29, 2024 14:34
haiqi96 and others added 4 commits June 5, 2024 10:48
Co-authored-by: Lin Zhihao <59785146+LinZhihao-723@users.noreply.github.com>
Co-authored-by: Lin Zhihao <59785146+LinZhihao-723@users.noreply.github.com>
@haiqi96 haiqi96 changed the title clp-core: Add support for decompressing an IR from a specific file split in the clp-archive core-clp: Add support for decompressing an IR from a specific file split in the clp-archive Jun 5, 2024
LinZhihao-723
LinZhihao-723 previously approved these changes Jun 5, 2024

@LinZhihao-723 LinZhihao-723 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit message suggestion:
core-clp: Add support for decompressing an IR from a specific file split from a clp archive.
I've done my parts of review, maybe you can take it over from here @kirkrodrigues

Comment thread components/core/src/clp/ir/Constant.hpp
Comment thread components/core/src/clp/ir/Constant.hpp Outdated
Comment thread components/core/src/clp/ir/LogEventSerializer.hpp Outdated
Comment thread components/core/src/clp/ir/LogEventSerializer.hpp Outdated
Comment thread components/core/src/clp/ir/LogEventSerializer.hpp Outdated
Comment thread components/core/src/clp/clp/FileDecompressor.cpp Outdated
Comment thread components/core/src/clp/clp/FileDecompressor.cpp Outdated
Comment thread components/core/src/clp/clp/FileDecompressor.cpp Outdated
Comment thread components/core/src/clp/streaming_archive/reader/Archive.hpp Outdated
Comment thread components/core/src/clp/streaming_archive/reader/Archive.hpp Outdated
@kirkrodrigues

Copy link
Copy Markdown
Member

Forgot to mention, since we now have a log event serializer, can we add some unit tests to test serialization + deserialization?

Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
haiqi96 and others added 2 commits June 7, 2024 10:19
Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
{
SPDLOG_ERROR(
"Failed to create directory structure {}, errno={}",
output_dir.c_str(),

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
output_dir.c_str(),
temp_output_dir.c_str(),

Comment on lines +59 to +69
if (false == res) {
close_writer();
return true;
}

m_is_open = true;

// Flush the preamble
flush();

return false;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (false == res) {
close_writer();
return true;
}
m_is_open = true;
// Flush the preamble
flush();
return false;
if (false == res) {
close_writer();
return false;
}
m_is_open = true;
// Flush the preamble
flush();
return true;

we should return false to indicate error right?

}
begin_message_ix = end_message_ix;

if (auto const error_code = ir_serializer.open(temp_ir_path.string());

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not error_code


LogEventSerializer<four_byte_encoded_variable_t> ir_serializer;
// Open output IR file
if (auto const error_code = ir_serializer.open(temp_ir_path.string());

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not error_code

LinZhihao-723
LinZhihao-723 previously approved these changes Jun 7, 2024

@kirkrodrigues kirkrodrigues left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the PR title, how about:

core-clp: Add support for decompressing a specific file split from a clp archive into one or more IR files.

@haiqi96 haiqi96 changed the title core-clp: Add support for decompressing an IR from a specific file split in the clp-archive core-clp: Add support for decompressing a specific file split from a clp archive into one or more IR files. Jun 8, 2024
@kirkrodrigues kirkrodrigues merged commit 040eb51 into y-scope:main Jun 8, 2024
@haiqi96 haiqi96 deleted the ArchiveToIR branch June 28, 2024 14:43
junhaoliao pushed a commit to junhaoliao/clp that referenced this pull request May 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants