importccl: skip detour though CSV when importing from workload#34892
importccl: skip detour though CSV when importing from workload#34892craig[bot] merged 2 commits intocockroachdb:masterfrom
Conversation
3664784 to
aba3941
Compare
danhhz
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @danhhz and @dt)
pkg/ccl/importccl/read_import_workload.go, line 1 at r2 (raw file):
// Copyright 2018 The Cockroach Authors.
nit: 2019
pkg/ccl/importccl/read_import_workload.go, line 53 at r2 (raw file):
// this tries to fast-path a few workload-generated things to dodge the parse // TODO(dt): have workload's Batch func return Datums directly as the fast-path.
agree with the sentiment, but I'd love to stay away from Datums in workload. too many allocations. jordan is pressing pretty hard on figuring out our in-memory columnar data representation (which will be much more allocation healthy), so my current plan is just to wait for that
pkg/ccl/importccl/read_import_workload.go, line 66 at r2 (raw file):
return tree.NewDInt(tree.DInt(t)), nil case string: return tree.ParseDatumStringAs(hint, t, evalCtx)
It's probably not hard to figure out the various cases and avoid ParseDatumStringAs. I'd also love to go in here and reuse some datums and I'd love to get this merged for the better progress tracking, so happy to do this myself as a followup
a574aaa to
89eec68
Compare
dt
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @danhhz)
pkg/ccl/importccl/read_import_workload.go, line 1 at r2 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
nit: 2019
Done.
pkg/ccl/importccl/read_import_workload.go, line 53 at r2 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
agree with the sentiment, but I'd love to stay away from Datums in workload. too many allocations. jordan is pressing pretty hard on figuring out our in-memory columnar data representation (which will be much more allocation healthy), so my current plan is just to wait for that
A workload could return its strings tagged with the type aliases, like tree.DString instead of string, without any additional allocs, right?
They wouldn't implement tree.Datum as non-pointers, but they would carry the the additional info about what type workload actually wanted, along to whatever is charged with transforming them into KVs (i.e this func for now). Anyway, happy to revisit in followup.
pkg/ccl/importccl/read_import_workload.go, line 66 at r2 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
It's probably not hard to figure out the various cases and avoid ParseDatumStringAs. I'd also love to go in here and reuse some datums and I'd love to get this merged for the better progress tracking, so happy to do this myself as a followup
Yeah, my plan was to keep it as a fallback but try to identify the hottest paths via profiles of the workloads that we care about and hit those e,g. just matching the ints appeared to get it from 30% of CPU down to ~15% in one tpcc profile
This is a minor refactor that just inverts the calling pattern of a common helper. Recognizing that most of the file-reading import frontends needed to do the same work of opening a file, setting up decompression and progress tracking, etc, we added a common helper that took one of the frontend's raw file handling func and called it after doing the commono setup. This simply inverts that -- the frontends get the files to convert, then can (and do) choose to call the helper internally and pass it their own per-file func. This paves the way for frontends that might want to do something differnet. Release note: none.
Based on suggestions from Dan and in an attempt to fix job progress reporting for workload imports (which previously generated CSV data on the fly for the CSV reader to consume, but did not know how to estimate a size and thus did not get incremental progress reporting), this is a prototype of using a dedicated IMPORT frontend for producing workload KV data. This is just a first pass at this -- mostly to prove the progress reporting would work -- and does not even begin to exploit all the possible optimizations using a separate frontend woud make not only possible but even pretty easy. E.g. This still uses strings and parsing for non-int types. Release note: none.
89eec68 to
4c1a5b1
Compare
|
I changed up the datum converter a tad last week so this is ready for another look |
danhhz
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @danhhz and @dt)
pkg/ccl/importccl/read_import_workload.go, line 53 at r2 (raw file):
Previously, dt (David Taylor) wrote…
A workload could return its strings tagged with the type aliases, like
tree.DStringinstead ofstring, without any additional allocs, right?They wouldn't implement tree.Datum as non-pointers, but they would carry the the additional info about what type workload actually wanted, along to whatever is charged with transforming them into KVs (i.e this func for now). Anyway, happy to revisit in followup.
Oh the type aliases is an interesting idea. We're definitely agreed that this would be a followup
|
bors r+ |
34892: importccl: skip detour though CSV when importing from workload r=dt a=dt Based on some recent suggestions from @danhhz, while I was thinking about ways to fix job progress reporting for workload imports, I looked at how hard it would be to skip our detour though CSV when importing on-the-fly workload-generated data, since that detour is what made progress reporting hard. Specifically, it was harder because the on-the-fly CSV generation doesn't know up front how many bytes it will return. The usual IMPORT progress tracking used by the file-reading frontends is`num_bytes_read/num_bytes` but that doesn't work if you don't know `num_bytes`. However, the workload generator itself certainly know how many rows it has generated and how many it will generate as it goes. If we hook it up directly, without a detour though a "file" of CSV data, we can just report that number for progress directly. That is what got me looking at this, but more importantly, skipping the detour to CSV lets us skip a lot of expensive string encoding and decoding in some cases. This is just a first pass at this -- mostly to prove it works and solved my original progress concern, but this initial implementation does not even begin to exploit all the possible optimizations. However it likely makes testing some of them much easier. The specialized frontend is automatically by reader process when it is configured to read CSV data and all files URIs are pointing to workload generation-backed storage. In that case, it uses the workload configuration values from those file URIs to setup a generator directly and then writes its output to the same row-converter that the CSV reader would have -- simply cutting the CSV encoder and decoder out of the middle. This is currently gated behind the `COCKROACH_IMPORT_WORKLOAD_FASTER` env var. This is a baby-step towards one of the areas identified in #34809. Release note: none. Co-authored-by: David Taylor <tinystatemachine@gmail.com>
Build succeeded |
Based on some recent suggestions from @danhhz, while I was thinking about ways to fix job progress reporting for workload imports, I looked at how hard it would be to skip our detour though CSV when importing on-the-fly workload-generated data, since that detour is what made progress reporting hard. Specifically, it was harder because the on-the-fly CSV generation doesn't know up front how many bytes it will return. The usual IMPORT progress tracking used by the file-reading frontends is
num_bytes_read/num_bytesbut that doesn't work if you don't knownum_bytes.However, the workload generator itself certainly know how many rows it has generated and how many it will generate as it goes. If we hook it up directly, without a detour though a "file" of CSV data, we can just report that number for progress directly.
That is what got me looking at this, but more importantly, skipping the detour to CSV lets us skip a lot of expensive string encoding and decoding in some cases.
This is just a first pass at this -- mostly to prove it works and solved my original progress concern, but this initial implementation does not even begin to exploit all the possible optimizations. However it likely makes testing some of them much easier.
The specialized frontend is automatically by reader process when it is configured to read CSV data and all files URIs are pointing to workload generation-backed storage. In that case, it uses the workload configuration values from those file URIs to setup a generator directly and then writes its output to the same row-converter that the CSV reader would have -- simply cutting the CSV encoder and decoder out of the middle.
This is currently gated behind the
COCKROACH_IMPORT_WORKLOAD_FASTERenv var.This is a baby-step towards one of the areas identified in #34809.
Release note: none.