KV pair ir stream (IR_v2) --> clp_s archive format#543
Closed
AVMatthews wants to merge 18 commits into
Closed
Conversation
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR:
Validation performed
./clp-s r elasticsearch_ir elasticsearch/./clp-s i elasticsearch_archive elasticsearch_ir/./clp-s x elasticsearch_archive elasticsearch_out/./clp-s c elasticsearch_clp-s_archive elasticsearch/./clp-s x elasticsearch_clp-s_archive elasticsearch_clp-s_out/jq -S -c '.' elasticsearch_out/original | sort > elasticsearch_sorted.jsonjq -S -c '.' elasticsearch_clp-s_out/original | sort > elasticsearch_clp-s_sorted.jsondiff elasticsearch_clp-s_sorted.json elasticsearch_sorted.json | diffstatBenchmarking Info
ElasticSearch : ~1.6x longer that clp-s
$ time ./clp-s c elasticsearch_clp-s_archive elasticsearchreal 18m4.676s
user 18m1.334s
sys 0m3.140s
$ time ./clp-s i elasticsearch_archive elasticsearch_ir/real 29m31.376s
user 29m26.950s
sys 0m4.224s
Postgresql: 1.6x longer that clp-s
$ time ./clp-s c postgresql_clp-s_archive postgresqlreal 1m37.820s
user 1m37.445s
sys 0m0.232s
$ time ./clp-s i postgresql_archive postgresql_ir/real 2m41.273s
user 2m40.924s
sys 0m0.172s
Spark : 2.2x longer that clp-s
$ time ./clp-s c spark_archive spark-event-logsreal 4m18.178s
user 4m16.497s
sys 0m1.444s
$ time ./clp-s i spark_archive spark_ir/real 9m38.949s
user 9m36.601s
sys 0m2.148s
Cockroach : ~1.45x longer that clp-s
$ time ./clp-s c cockroachdb_clp-s_archive cockroachdbreal 34m42.541s
user 33m27.539s
sys 0m11.856s
$ time ./clp-s i cockroachdb_archive cockroachdb_ir/real 50m30.246s
user 50m20.945s
sys 0m8.311s
Postgres Perf Breakdown
CLP-S
Total: 1m 37s
64% in
parse_line()- 1m 2 s18% in
m_archive_writer->append_message()- 17.5s5.2% JSON I/O - 5s
IRV2 -> Archive
Total: 2m 41s
53%
parse_kv_log_event()... includesm_archive_writer->append_message()- 1m 25s35% deserializing IR (equivalent to the JSON I/O) - 56.5s
Summary: The deserialization process is providing significantly more overhead then the JSON I/O seem too. We are reconstructing the information essentially twice, once back into the format that was written out the the ir file and then into the archive format by walking over the IRV2 structures.
Summary by CodeRabbit
Summary by CodeRabbit
New Features
Bug Fixes
Documentation