clp-s: Add support for decompression in ascending timestamp order.#440
Conversation
| decompression_options.add_options()( | ||
| "ordered", | ||
| po::bool_switch(&m_ordered_decompression), | ||
| "Enable in-order decompression for this archive" |
There was a problem hiding this comment.
Do we want to make it clear that "order" is based on timestamp?
There was a problem hiding this comment.
Yeah, I had some vague idea that we'd transparently use timestamp or MOT depending on what was available in the archive, but being explicit is best for now.
| */ | ||
| epochtime_t get_next_timestamp() const { return m_get_timestamp(); } | ||
|
|
||
| bool done() const { return m_cur_message >= m_num_messages; } |
There was a problem hiding this comment.
Do we want to add a description for this method?
| auto& schema_reader | ||
| = create_schema_reader(schema_id, should_extract_timestamp, should_marshal_records); | ||
| create_schema_reader( | ||
| m_schema_reader, |
There was a problem hiding this comment.
Do you want to rename it to initialize_schema_reader?
| return m_schema_reader; | ||
| } | ||
|
|
||
| std::vector<std::shared_ptr<SchemaReader>> ArchiveReader::load_all_tables() { |
There was a problem hiding this comment.
Do you want to rename it to read_all_tables (we have read_table above)?
| void store(); | ||
|
|
||
| private: | ||
| void construct_in_order(FileWriter& writer); |
There was a problem hiding this comment.
Can you add a description for this method?
Co-authored-by: wraymo <37269683+wraymo@users.noreply.github.com>
wraymo
left a comment
There was a problem hiding this comment.
Great work! Since we only support ascending order for timestamps, should we explicitly state this in the command-line description and the commit message?
wraymo
left a comment
There was a problem hiding this comment.
For the commit message, what about "clp-s: Add support for decompression in ascending timestamp order."?
…-scope#440) Co-authored-by: wraymo <37269683+wraymo@users.noreply.github.com>
…-scope#440) Co-authored-by: wraymo <37269683+wraymo@users.noreply.github.com>
Description
This PR adds support to decompress a clp-s archive in timestamp order with the
--orderedcommand line option. This is implemented by loading all tables into memory, putting them into a min-heap ordered by the timestamp of the next record in each table, and popping from the heap until all records have been decompressed. The timestamp is pulled from the authoritative timestamp column specified at compression time, and if a table does not contain such a column the timestamp is 0 for ordering purposes.Note: we don't sort by timestamp within each table, so the logs can end up slightly out of timestamp order (this is probably preferred anyway, since it brings us closer to log order).
This seems to add minimal decompression speed overhead (about 2.5%) for fairly typical archives, though would likely have higher overhead for large archives or archives with many tables (due to extra memory allocations compared to decompressing one table at a time). Similarly there can be significant memory overhead, particularly for large archives.
Validation performed