Skip to content

clp-s: Add support for decompression in ascending timestamp order.#440

Merged
gibber9809 merged 8 commits into
y-scope:mainfrom
gibber9809:timestamp-order-decompression
Jun 14, 2024
Merged

clp-s: Add support for decompression in ascending timestamp order.#440
gibber9809 merged 8 commits into
y-scope:mainfrom
gibber9809:timestamp-order-decompression

Conversation

@gibber9809

Copy link
Copy Markdown
Contributor

Description

This PR adds support to decompress a clp-s archive in timestamp order with the --ordered command line option. This is implemented by loading all tables into memory, putting them into a min-heap ordered by the timestamp of the next record in each table, and popping from the heap until all records have been decompressed. The timestamp is pulled from the authoritative timestamp column specified at compression time, and if a table does not contain such a column the timestamp is 0 for ordering purposes.

Note: we don't sort by timestamp within each table, so the logs can end up slightly out of timestamp order (this is probably preferred anyway, since it brings us closer to log order).

This seems to add minimal decompression speed overhead (about 2.5%) for fairly typical archives, though would likely have higher overhead for large archives or archives with many tables (due to extra memory allocations compared to decompressing one table at a time). Similarly there can be significant memory overhead, particularly for large archives.

Validation performed

  • Validated that records end up sorted in timestamp order after decompression
  • Validated that in order decompression results in identical set of records compared to out of order decompression

@gibber9809 gibber9809 requested a review from wraymo June 13, 2024 19:24
decompression_options.add_options()(
"ordered",
po::bool_switch(&m_ordered_decompression),
"Enable in-order decompression for this archive"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to make it clear that "order" is based on timestamp?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had some vague idea that we'd transparently use timestamp or MOT depending on what was available in the archive, but being explicit is best for now.

*/
epochtime_t get_next_timestamp() const { return m_get_timestamp(); }

bool done() const { return m_cur_message >= m_num_messages; }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add a description for this method?

auto& schema_reader
= create_schema_reader(schema_id, should_extract_timestamp, should_marshal_records);
create_schema_reader(
m_schema_reader,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to rename it to initialize_schema_reader?

return m_schema_reader;
}

std::vector<std::shared_ptr<SchemaReader>> ArchiveReader::load_all_tables() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to rename it to read_all_tables (we have read_table above)?

Comment thread components/core/src/clp_s/ArchiveReader.cpp Outdated
void store();

private:
void construct_in_order(FileWriter& writer);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a description for this method?

@gibber9809 gibber9809 requested a review from wraymo June 13, 2024 23:54

@wraymo wraymo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Since we only support ascending order for timestamps, should we explicitly state this in the command-line description and the commit message?

@gibber9809 gibber9809 changed the title clp-s: Implement decompression in timestamp order. clp-s: Implement decompression in ascending timestamp order. Jun 14, 2024
@gibber9809 gibber9809 requested a review from wraymo June 14, 2024 15:16

@wraymo wraymo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the commit message, what about "clp-s: Add support for decompression in ascending timestamp order."?

@gibber9809 gibber9809 changed the title clp-s: Implement decompression in ascending timestamp order. clp-s: Add support for decompression in ascending timestamp order. Jun 14, 2024
@gibber9809 gibber9809 merged commit 8cfa96c into y-scope:main Jun 14, 2024
jackluo923 pushed a commit to jackluo923/clp that referenced this pull request Dec 4, 2024
…-scope#440)

Co-authored-by: wraymo <37269683+wraymo@users.noreply.github.com>
@gibber9809 gibber9809 deleted the timestamp-order-decompression branch January 29, 2025 15:50
junhaoliao pushed a commit to junhaoliao/clp that referenced this pull request May 17, 2026
…-scope#440)

Co-authored-by: wraymo <37269683+wraymo@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants