Skip to content

clp-s: Add support for chunking output into different files during timestamp-ordered decompression#451

Merged
gibber9809 merged 7 commits into
y-scope:mainfrom
gibber9809:chunked-decompression
Jun 25, 2024
Merged

clp-s: Add support for chunking output into different files during timestamp-ordered decompression#451
gibber9809 merged 7 commits into
y-scope:mainfrom
gibber9809:chunked-decompression

Conversation

@gibber9809

Copy link
Copy Markdown
Contributor

Description

This PR adds support for chunking the output of timestamp-ordered decompression into several files, where each file has at most the number of records specified in the command line argument. The argument --ordered-chunk-split-threshold <value> can be used in conjunction with the --ordered argument during decompression to trigger this feature.

Validation performed

  • Tested edge case where every records ends up in same chunk
  • Tested edge case where num_records % chunk_size == 0
  • Tested case where num_records % chunk_size > 0

@gibber9809 gibber9809 requested a review from wraymo June 18, 2024 18:58

@wraymo wraymo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Most of the comments are about the style changes.

Comment thread components/core/src/clp_s/CommandLineArguments.cpp Outdated
Comment thread components/core/src/clp_s/CommandLineArguments.cpp Outdated
Comment on lines +84 to +98
auto finish_chunk = [&](bool open_new_writer) {
writer.close();
std::string new_file_name = std::string(src_path) + "_" + std::to_string(first_timestamp)
+ "_" + std::to_string(last_timestamp) + ".jsonl";
auto new_file_path = std::filesystem::path(new_file_name);
std::error_code ec;
std::filesystem::rename(src_path, new_file_path, ec);
if (ec) {
throw OperationFailed(ErrorCodeFailure, __FILE__, __LINE__, ec.message());
}

if (open_new_writer) {
writer.open(src_path, FileWriter::OpenMode::CreateForWriting);
}
};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any concerns of making it a private method?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's easier to read this way, but if you prefer private method I can change it.

Comment thread components/core/src/clp_s/JsonConstructor.cpp Outdated
Comment thread components/core/src/clp_s/JsonConstructor.cpp Outdated
Comment thread components/core/src/clp_s/JsonConstructor.cpp Outdated
Comment thread components/core/src/clp_s/CommandLineArguments.hpp Outdated
po::value<size_t>(&m_ordered_chunk_split_threshold)
->default_value(m_ordered_chunk_split_threshold),
"Number of records to include in each output chunk when decompressing records "
"in order"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"in order"
"in timestamp ascending order"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced with "in ascending timestamp order" instead.

Comment thread components/core/src/clp_s/CommandLineArguments.cpp Outdated
gibber9809 and others added 2 commits June 20, 2024 10:53
Co-authored-by: wraymo <37269683+wraymo@users.noreply.github.com>
@gibber9809 gibber9809 requested a review from wraymo June 20, 2024 15:21
po::bool_switch(&m_ordered_decompression),
"Enable decompression in ascending timestamp order for this archive"
)(
"ordered-chunk-split-threshold",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to update the name of this argument?

@gibber9809 gibber9809 requested a review from wraymo June 21, 2024 16:05

@wraymo wraymo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the PR title, what about "clp-s: Add support for chunking output into different files during timestamp-ordered decompression"?

@gibber9809 gibber9809 changed the title clp-s: Support chunking output into different files during timestamp-ordered decompression clp-s: Add support for chunking output into different files during timestamp-ordered decompression Jun 25, 2024
@gibber9809 gibber9809 merged commit 01d5737 into y-scope:main Jun 25, 2024
jackluo923 pushed a commit to jackluo923/clp that referenced this pull request Dec 4, 2024
…mestamp-ordered decompression (y-scope#451)

Co-authored-by: wraymo <37269683+wraymo@users.noreply.github.com>
@gibber9809 gibber9809 deleted the chunked-decompression branch January 29, 2025 15:51
junhaoliao pushed a commit to junhaoliao/clp that referenced this pull request May 17, 2026
…mestamp-ordered decompression (y-scope#451)

Co-authored-by: wraymo <37269683+wraymo@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants